暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

数据库抖动分析报告_CTWR

IT那活儿 2024-05-15
298
点击上方“IT那活儿”公众号--专注于企业全栈运维技术分享,不管IT什么活儿,干就完了!!!   

故障概述

二节点出现异常会话突增问题。

故障分析

2.1 等待事件
### 截取时间26分开始的event 

SAMPLE_TIME EVENT COUNT(*)
----------------- ---------------------------------------------------------------- ----------
20240319 09:25:59 db file sequential read                                                  25
20240319 09:26:07 db file sequential read                                                  19
20240319 09:26:09                                                                                  23
20240319 09:26:09 db file sequential read                                                  24
20240319 09:26:17 db file sequential read                                                  11
20240319 09:26:19                                                                          14
20240319 09:26:19 db file sequential read                                                  21
20240319 09:26:29                                                                          16
20240319 09:26:29 db file sequential read                                                  43
20240319 09:26:37 db file sequential read                                                  14
20240319 09:26:40                                                                          19

SAMPLE_TIME EVENT COUNT(*)
----------------- ---------------------------------------------------------------- ----------
20240319 09:26:40 db file sequential read                                                  40
20240319 09:26:50                                                                          12
20240319 09:26:50 db file sequential read                                                  44
20240319 09:26:57 gc buffer busy acquire                                                   33
20240319 09:27:00 db file sequential read                                                  12
20240319 09:27:00 gc buffer busy release                                                   18
20240319 09:27:00 buffer busy waits                                                   37
20240319 09:27:00 enq: SQ - contention                                                   58
20240319 09:27:00 enq: CT - CTWR process start/stop                                       373  <<<<<< CTWR 开始出现

20240319 09:27:07 cr request retry                                                   31
20240319 09:27:07 buffer busy waits                                                   37

SAMPLE_TIME EVENT COUNT(*)
----------------- ---------------------------------------------------------------- ----------
20240319 09:27:07 library cache lock                                                   145
20240319 09:27:07 gc buffer busy acquire                                                   231
20240319 09:27:10                                                                          13
20240319 09:27:10 gc buffer busy release                                                   18
20240319 09:27:10 library cache lock                                                   95
20240319 09:27:10 latch: redo allocation                                                   101
20240319 09:27:10 buffer busy waits                                                   112
20240319 09:27:10 enq: SQ - contention                                                   339
20240319 09:27:10 enq: CT - CTWR process start/stop                                       565
20240319 09:27:17 gc cr failure                                                   11
20240319 09:27:17 cr request retry                                                   29

SAMPLE_TIME EVENT COUNT(*)
----------------- ---------------------------------------------------------------- ----------
20240319 09:27:17 buffer busy waits                                                   44
20240319 09:27:17 gc buffer busy acquire                                                   245
20240319 09:27:17 library cache lock                                                   350
20240319 09:27:20                                                                          11
20240319 09:27:20 gc buffer busy release                                                   17
20240319 09:27:20 buffer busy waits                                                   160
20240319 09:27:20 latch: redo allocation                                                   225
20240319 09:27:20 library cache lock                                                   289
20240319 09:27:20 enq: SQ - contention                                                   511
20240319 09:27:20 enq: CT - CTWR process start/stop                                                  565

20240319 09:27:27 cr request retry                                                   36

SAMPLE_TIME EVENT COUNT(*)
----------------- ---------------------------------------------------------------- ----------
20240319 09:27:27 buffer busy waits                                                   47


……


SAMPLE_TIME EVENT COUNT(*)
----------------- ---------------------------------------------------------------- ----------
20240319 09:28:11 latch: redo copy                                                         11
20240319 09:28:11 gc current block 2-way                                                   12
20240319 09:28:11 gc buffer busy acquire                                                   17
20240319 09:28:11                                                                          23
20240319 09:28:11 enq: CT - CTWR process start/stop                                                   30
20240319 09:28:11 enq: US - contention 32  <<<<<<< enq: US - contention 出现

20240319 09:28:11 db file sequential read                                                  32
20240319 09:28:11 enq: SQ - contention                                                   41
20240319 09:28:11 enq: TX - index contention                                                   206
20240319 09:28:11 latch: redo allocation                                                   242
20240319 09:28:11 row cache lock                                                   391

SAMPLE_TIME EVENT COUNT(*)
----------------- ---------------------------------------------------------------- ----------
20240319 09:28:11 buffer busy waits                                                   638
20240319 09:28:11 gc buffer busy release                                                   981

20240319 09:28:18 gc cr grant 2-way                                                   19
20240319 09:28:18 db file sequential read                                                  27
20240319 09:28:18 buffer busy waits                                                   31
20240319 09:28:18 gc buffer busy acquire                                                   87
20240319 09:28:18 enq: TX - index contention                                                   114
20240319 09:28:18 gc buffer busy release                                                   125
20240319 09:28:22                                                                          12
20240319 09:28:22 db file sequential read                                                  17
20240319 09:28:22 gc current grant busy                                                   18

SAMPLE_TIME EVENT COUNT(*)
----------------- ---------------------------------------------------------------- ----------
20240319 09:28:22 gc index operation                                                   30
20240319 09:28:22 gc current block 2-way                                                   59
20240319 09:28:22 gc cr multi block request                                                   67
20240319 09:28:22 gc buffer busy acquire                                                   71
20240319 09:28:22 buffer busy waits                                                   102
20240319 09:28:22 gc current grant 2-way                                                   166
20240319 09:28:22 gc cr grant 2-way                                                   355

20240319 09:28:22 enq: US - contention                                                   467
20240319 09:28:22 enq: TX - index contention                                                   590
20240319 09:28:28                                                                          11

分析:首先出现异常的event是enq: CT - CTWR process start/stop,然后才是enq: US – contention争用 。
2.2 内存抖动及日志情况
### SGA内存抖动


   INST_ID COMPONENT                 STATUS    OPER_TYPE     OPER_MODE PARAMETER                 INITIAL_MB  TARGET_MB   FINAL_MB START_TIME        END_TIME
---------- ------------------------- --------- ------------- --------- ------------------------- ---------- ---------- ---------- ----------------- -----------------
         2 streams pool              COMPLETE  GROW          IMMEDIATE streams_pool_size               1024       1280       1280 24-03-16 23:35:39 24-03-16 23:35:49
         2 streams pool              COMPLETE  GROW          IMMEDIATE streams_pool_size               1024       1280       1280 24-03-16 23:35:39 24-03-16 23:35:50
         2 DEFAULT buffer cache      COMPLETE  SHRINK        IMMEDIATE db_cache_size                  57856      57600      57600 24-03-16 23:35:39 24-03-16 23:35:50
         2 streams pool              COMPLETE  GROW          IMMEDIATE streams_pool_size               1280       1536       1536 24-03-16 23:35:51 24-03-16 23:35:52
         2 DEFAULT buffer cache      COMPLETE  SHRINK        IMMEDIATE db_cache_size                  57600      57344      57344 24-03-16 23:35:51 24-03-16 23:35:52
         2 streams pool              COMPLETE  GROW          IMMEDIATE streams_pool_size               1536       1792       1792 24-03-16 23:35:53 24-03-16 23:35:53
         2 DEFAULT buffer cache      COMPLETE  SHRINK        IMMEDIATE db_cache_size                  57344      57088      57088 24-03-16 23:35:53 24-03-16 23:35:53
         2 streams pool              COMPLETE  GROW          IMMEDIATE streams_pool_size               1536       1792       1792 24-03-16 23:35:53 24-03-16 23:35:54
         2 DEFAULT buffer cache      COMPLETE  SHRINK        IMMEDIATE db_cache_size                  57344      57088      57088 24-03-16 23:35:53 24-03-16 23:35:54
         2 streams pool              COMPLETE  GROW          IMMEDIATE streams_pool_size               1536       1792       1792 24-03-16 23:35:53 24-03-16 23:35:55
         2 DEFAULT buffer cache      COMPLETE  SHRINK        IMMEDIATE db_cache_size                  57344      57088      57088 24-03-16 23:35:53 24-03-16 23:35:55
         2 DEFAULT buffer cache      COMPLETE  SHRINK        IMMEDIATE db_cache_size                  57088      56832      56832 24-03-16 23:35:56 24-03-16 23:35:57
         2 streams pool              COMPLETE  GROW          IMMEDIATE streams_pool_size               1792       2048       2048 24-03-16 23:35:56 24-03-16 23:35:57
         2 streams pool              COMPLETE  GROW          IMMEDIATE streams_pool_size               2048       2304       2304 24-03-16 23:35:58 24-03-16 23:35:58
         2 DEFAULT buffer cache      COMPLETE  SHRINK        IMMEDIATE db_cache_size                  56832      56576      56576 24-03-16 23:35:58 24-03-16 23:35:58
         2 streams pool              COMPLETE  GROW          IMMEDIATE streams_pool_size               2048       2304       2304 24-03-16 23:35:58 24-03-16 23:36:42
         2 DEFAULT buffer cache      COMPLETE  SHRINK        IMMEDIATE db_cache_size                  56832      56576      56576 24-03-16 23:35:58 24-03-16 23:36:42
         2 DEFAULT buffer cache      COMPLETE  SHRINK        IMMEDIATE db_cache_size                  56832      56576      56576 24-03-16 23:35:58 24-03-16 23:36:45
         2 streams pool              COMPLETE  GROW          IMMEDIATE streams_pool_size               2048       2304       2304 24-03-16 23:35:58 24-03-16 23:36:45
         2 DEFAULT buffer cache      COMPLETE  SHRINK        IMMEDIATE db_cache_size                  56576      56320      56320 24-03-16 23:37:12 24-03-16 23:37:13
         2 streams pool              COMPLETE  GROW          IMMEDIATE streams_pool_size               2304       2560       2560 24-03-16 23:37:12 24-03-16 23:37:13
         2 java pool                 COMPLETE  SHRINK        DEFERRED  java_pool_size                  1792       1536       1536 24-03-18 00:44:20 24-03-18 00:44:20
         2 DEFAULT buffer cache      COMPLETE  GROW          DEFERRED  db_cache_size                  56320      56576      56576 24-03-18 00:44:20 24-03-18 00:44:20
         2 DEFAULT buffer cache      COMPLETE  GROW          DEFERRED  db_cache_size                  56576      56832      56832 24-03-18 15:45:37 24-03-18 15:45:37
         2 java pool                 COMPLETE  SHRINK        DEFERRED  java_pool_size                  1536       1280       1280 24-03-18 15:45:37 24-03-18 15:45:37
         2 DEFAULT buffer cache      COMPLETE  GROW          DEFERRED  db_cache_size                  56832      57088      57088 24-03-19 00:42:46 24-03-19 00:42:46
         2 java pool                 COMPLETE  SHRINK        DEFERRED  java_pool_size                  1280       1024       1024 24-03-19 00:42:46 24-03-19 00:42:46
         2 DEFAULT buffer cache      ERROR     SHRINK        IMMEDIATE db_cache_size                  57088      56832      57088 24-03-19 09:27:06 24-03-19 09:27:30
         2 large pool                ERROR     GROW          IMMEDIATE large_pool_size                16384      16640      16384 24-03-19 09:27:06 24-03-19 09:27:30
         2 DEFAULT buffer cache      COMPLETE  SHRINK        IMMEDIATE db_cache_size                  57088      56832      56832 24-03-19 09:27:41 24-03-19 09:27:44
         2 large pool                COMPLETE  GROW          IMMEDIATE large_pool_size                16384      16640      16640 24-03-19 09:27:41 24-03-19 09:27:44

分析: SGA large_pool_size 开始进行稍微的增长,期间出现一次失败的情况。
###alert log

2024-03-19T09:26:25.344548+08:00
LOGMINER: End mining logfile for session 1 thread 2 sequence 506739, +DATA/db02/ONLINELOG/group26_261.log
2024-03-19T09:26:27.913945+08:00
LOGMINER: Begin mining logfile for session 1 thread 2 sequence 506740, +DATA/db02/ONLINELOG/group27_271.log
2024-03-19T09:26:30.445225+08:00
Archived Log entry 1763450 added for T-2.S-506739 ID 0x2611a2c0 LAD:1
2024-03-19T09:27:44.459395+08:00
public DBA buffer size has already reached the maximum size of 231112704. <<<<<<09:27:44,达到最大值。
public DBA buffer size has already reached the maximum size of 231112704.
public DBA buffer size has already reached the maximum size of 231112704.
public DBA buffer size has already reached the maximum size of 231112704.
public DBA buffer size has already reached the maximum size of 231112704.
public DBA buffer size has already reached the maximum size of 231112704.
public DBA buffer size has already reached the maximum size of 231112704.


####ctwr进程情况

 db02:/home/oracle(db022)$ps -ef|grep ctwr
  oracle 11731356        1   7   Mar 16      - 35:39 ora_ctwr_db022
  oracle 66781778 36112030   0 10:09:51  pts/8  0:00 grep ctwr
 db02:/home/oracle(db022)$ora log

'TAIL-1000F'||A.VALUE||'/'||'ALERT_'||B.INSTANCE_NAME||'.LOG'
----------------------------------------------------------------------------------------------------
tail -1000f /oracle/app/oracle/diag/rdbms/db02/db022/trace/alert_db022.log

 db02:/home/oracle(db022)$cd /oracle/app/oracle/diag/rdbms/db02/db022/trace/
 db02:/oracle/app/oracle/diag/rdbms/db02/db022/trace(db022)$
 db02:/oracle/app/oracle/diag/rdbms/db02/db022/trace(db022)$ls *ctwr*
db022_ctwr_11731356.trc db022_ctwr_11731356.trm
 db02:/oracle/app/oracle/diag/rdbms/db02/db022/trace(db022)$ls -l *ctwr*
-rw-r----- 1 oracle asmadmin 22475 Mar 19 09:28 db022_ctwr_11731356.trc
-rw-r----- 1 oracle asmadmin 5406 Mar 19 09:28 db022_ctwr_11731356.trm
 db02:/oracle/app/oracle/diag/rdbms/db02/db022/trace(db022)$

Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
Build label:    RDBMS_12.2.0.1.0_AIX.PPC64_170125.170330
ORACLE_HOME: oracle/app/oracle/product/12.2.0/db
System name: AIX
Node name: db02
Release: 2
Version: 7
Machine: 00C057F84C00
Instance name: db022
Redo thread mounted by this instance: 2
Oracle process number: 111
Unix process pid: 11731356, image: oracle@ db02 (CTWR)

*** SESSION ID:(10435.14611) 2024-03-19T09:27:05.997453+08:00
*** CLIENT ID:() 2024-03-19T09:27:05.997469+08:00
*** SERVICE NAME:(SYS$BACKGROUND) 2024-03-19T09:27:05.997478+08:00
*** MODULE NAME:() 2024-03-19T09:27:05.997486+08:00
*** ACTION NAME:() 2024-03-19T09:27:05.997493+08:00
*** CLIENT DRIVER:() 2024-03-19T09:27:05.997500+08:00


*** TRACE FILE RECREATED AFTER BEING REMOVED ***


*** 2024-03-19T09:27:05.945577+08:00

*** 2024-03-19T09:27:06.048862+08:00

*** 2024-03-19T09:27:44.277588+08:00

*** 2024-03-19T09:27:44.459500+08:00

分析:BCT 是使用的large pool 来记录下数据块变化的。

2.3 回滚段争用情况
### 问题时间段,回滚段争用的SQL

SQL> select sql_id,count(*)
  2  from dba_hist_active_sess_history
  3  where to_char(sample_time,'yyyymmdd hh24:mi:ss') between '20240319 09:20:00' and '20240319 09:40:00'
  4 and event 
= 'enq: US - contention'
  5  group by sql_id having count(*) >10
  6 order by count(*)
;

SQL_ID COUNT(*)
------------- ----------
4znn9r0p7umd9 11
7a7xv00dvrxvu 19
azw81fzvdfwca 20
0rs28rxgacsf1 20
czvbh2m27qs8p 21
6t2846xks3a81 22
44stykyzw08fa 23
94mjrjps3pr6u 25
cmnwf6d828r63 26
81v5pbhnj71zc 30
73n4whp1snxkf 31

SQL_ID COUNT(*)
------------- ----------
2gxg2xxc0pc2w 40
cynayy51r7zpt 48
99b0p63w1zrv8 57
gfqwhjsjvkhr5 66
3607cxxpdqdtp 66
1x6vr8rj0f8jf 67
1dtvp0yk8214q 69
fyy6aquyrjats 77
138a4dcps6ssa 87
2kgaujdnr44rs 107
4mvbqc4zp3g13 109

SQL_ID COUNT(*)
------------- ----------
ac5aahr706km9 126
5t52qm5vhzvag 141
dpk5hxj79y5fd 166
ay26uw9mxy2hy 243
aty5x3fkfrqhb 526
4wgq4vxaf18zp 639
2q83h41tbg7rn 1952
cwzr69ka04rms 3507


30 rows selected

##标红的4个SQL等待很大。分析历史执行情况,无异常变化。

分析总结及建议

数据库在3.19的9:27分开始出现异常活动会话突增现象,直到9:35结束, 持续大约10分钟。
3.1 等待事件时间梳理
  • 1)在09:27分时,数据库出现大量enq: CT - CTWR process start/stop及少量的enq: SQ – contention等待事件。启用BCT功能,用于Rman备份的块跟踪功能。

  • 2)在09:27:07 出现 大量的library cache lock 等待事件,持续到09:28:01时。

  • 3)在09:28:11时,enq: CT - CTWR process start/stop消失 ,enq: US – contention和enq: TX - index contention事件开始出现,在09:28:52时达到最大值。09:29:03时, enq: IV -  contention开始出现。

  • 4)在09:32:55时, 数据库恢复正常。期间未手动处理会话。

针对  enq: SQ – contention 等待事件的SQL 3mmcwuyvjx2d9 和25r6h6hfw2rhg,主要是3mmcwuyvjx2d9,其中使用的sequences分别是HIBERNATE_SEQUENCE(cache_Size = 2000) 和 SEQ_PRPLCLAIMSLOGREAD (cache_size = 20)。查询上述SQL历史执行情况,并无明显增长的情况。
  • enq: CT - CTWR process start/stop 
    是针对BCT的等待事件,在alert日志中显示了public DBA buffer size has already reached the maximum size of 231112704信息,同时查询SGA在9:27分钟发现large pool有grow的操作,说明内存有抖动。
  • enq: US – contention 
    针对的online rollback segments的争用,当前数据库已经设置隐藏参数 _rollback_segment_count = 2000 。
3.2 总 结
对比分析上述异常情况,只有enq: CT - CTWR process start/stop 比较特殊。
在MOS中有记录 The BCT buffers ran out of space causing waits for buffers resulting in hang spike in CPU.
数据库异常时间较短并未手动处理,可以查询的信息有限。
建议:再次出现问题时做 hang ayalyze分析,确定下事务链关系。

参考:
Short Hangs waiting for Enq: CT - CTWR Large Concurrency Wait Event (Doc ID 2601295.1)

END


本文作者:李 波(上海新炬中北团队)

本文来源:“IT那活儿”公众号

文章转载自IT那活儿,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论