这个案例很神奇 15年老司机人也是第一见

Roger的数据库专栏 2024-03-04

1530

前几天出差的时候，有个网友咨询一个问题，说他们的核心遇到一个比较怪异的问题，已经出现了2次了，对方把相关日志给我了。今天利用周末时间简单分析一下，以供大家参考！

据网友反馈在上午10:07分钟前后系统出现了hang的情况，话不多说，先来看看现象吧，如下是alert log：

Thu Feb 29 09:54:10 2024
LNS: Standby redo logfile selected for thread 1 sequence 443904 for destination LOG_ARCHIVE_DEST_4
Thu Feb 29 09:54:11 2024
LNS: Standby redo logfile selected for thread 1 sequence 443904 for destination LOG_ARCHIVE_DEST_2
Thu Feb 29 09:54:11 2024
Archived Log entry 3795358 added for thread 1 sequence 443903 ID 0xf960dd75 dest 1:
Thu Feb 29 09:55:28 2024
Thread 1 cannot allocate new log, sequence 443905
Checkpoint not complete
  Current log# 9 seq# 443904 mem# 0: +REDO/xxxx/onlinelog/group_9.266.1062239235
  Current log# 9 seq# 443904 mem# 1: +FRA/xxxx/onlinelog/group_9.513.1062239249
Thu Feb 29 10:02:35 2024
minact-scn: useg scan erroring out with error e:12751
Thu Feb 29 10:06:51 2024
Errors in file u01/app/oracle/diag/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_126505.trc:
ORA-01013: user requested cancel of current operation
Thu Feb 29 10:07:34 2024
Thread 1 advanced to log sequence 443905 (LGWR switch)
  Current log# 10 seq# 443905 mem# 0: +REDO/xxxx/onlinelog/group_10.272.1062238997
  Current log# 10 seq# 443905 mem# 1: +FRA/xxxx/onlinelog/group_10.274.1062239003
Thu Feb 29 10:07:34 2024
opidcl aborting process unknown ospid (194109) as a result of ORA-1001
Thu Feb 29 10:07:34 2024
opidcl aborting process unknown ospid (191940) as a result of ORA-1001
Thu Feb 29 10:07:34 2024
opidcl aborting process unknown ospid (195118) as a result of ORA-1001
Thu Feb 29 10:07:34 2024
opidcl aborting process unknown ospid (193395) as a result of ORA-1001
Thu Feb 29 10:07:34 2024
LNS: Standby redo logfile selected for thread 1 sequence 443905 for destination LOG_ARCHIVE_DEST_4
Thu Feb 29 10:07:34 2024
LNS: Standby redo logfile selected for thread 1 sequence 443905 for destination LOG_ARCHIVE_DEST_2
Thu Feb 29 10:07:35 2024
opidcl aborting process unknown ospid (193729) as a result of ORA-1001
.....
opidcl aborting process unknown ospid (193238) as a result of ORA-1001
Thu Feb 29 10:07:41 2024
opidcl aborting process unknown ospid (214224) as a result of ORA-1001
Thu Feb 29 10:07:45 2024
Archived Log entry 3795361 added for thread 1 sequence 443904 ID 0xf960dd75 dest 1:
Thu Feb 29 10:07:56 2024
System State dumped to trace file u01/app/oracle/diag/diag/rdbms/xxxx/xxxx1/trace/xxxx1_diag_48529.trc
Thu Feb 29 10:08:11 2024
Thread 1 advanced to log sequence 443906 (LGWR switch)
  Current log# 6 seq# 443906 mem# 0: +REDO/xxxx/onlinelog/group_6.271.1062238957
  Current log# 6 seq# 443906 mem# 1: +FRA/xxxx/onlinelog/group_6.396.1062238963
Thu Feb 29 10:08:12 2024
LNS: Standby redo logfile selected for thread 1 sequence 443906 for destination LOG_ARCHIVE_DEST_2
Thu Feb 29 10:08:19 2024
Archived Log entry 3795366 added for thread 1 sequence 443905 ID 0xf960dd75 dest 1:
Thu Feb 29 10:08:35 2024
Auto-tuning: Starting background process GTXj
Starting background process GTXj
Thu Feb 29 10:08:35 2024
GTXj started with pid=95, OS id=250157 
Thu Feb 29 10:08:37 2024
Global Enqueue Services Deadlock detected. More info in file
 u01/app/oracle/diag/diag/rdbms/xxxx/xxxx1/trace/xxxx1_lmd0_48542.trc.
Thu Feb 29 10:08:42 2024
LNS: Standby redo logfile selected for thread 1 sequence 443906 for destination LOG_ARCHIVE_DEST_4
Thu Feb 29 10:09:54 2024

看上去9：55到10:07都没有进行日志切换，另外lmd进程也检测到了死锁，同时我们也看到也产生了systemstate dump，这一点还是非常好的，希望以后国产数据库也能提供这个功能（目前我们 #MogDB已经提供了gstrace，在后续版本会有更多的功能增强,敬请期待！关于 MogDB 5.0的一些特性，可以参考官方文档 https://docs.mogdb.io/zh/mogdb/v5.0/characteristic-description-overview).

我们先不着急看trace，先来看看是否有awr或者ash之类的。

从ASH数据来看，这个库配置是十分的强悍，单节点224 core，物理内存应该在1.5TB左右，毕竟Oracle SGA 都分配了1TB（这是我见过最大的，之前见过SGA设置800GB的，这次刷新我的认知了，格局被拉满了）。

看到ASH event，似乎数据库日志切换相关操作出现了异常，对于buffer busy waits这通常是并发争用过高（比如热块）、IO问题等所导致。进一步来看看top process情况：

单看ASH的top blocking session，似乎问题都指向了Oracle LGWR进程，如果该进程出问题，那么多自然日志切换是不行的，所有的操作都会被阻塞等待，系统夯死并不奇怪。ASH信息过去简单，但是基本上给我们指明了一个方向，那就是可能lgwr进程出问题了。为了更加清楚的了解这套环境的信息，我继续看看awr报告。

Oracle AWR的 event就不说了，跟ash 如出一辙，从load profile来看，这个系统压力还是很大的，每秒executes(SQL) 2万+。继续来看看Redo相关统计信息和lgwr进程的写情况：

Statistic	Total	per Second	per Trans
redo synch long waits	0	0.00	0.00
redo synch poll writes	60,409	26.94	0.01
redo synch polls	88,363	39.41	0.02
redo synch time	22,885	10.21	0.00
redo synch time (usec)	232,862,072	103,864.61	47.19
redo synch time overhead (usec)	228,728,940	102,021.09	46.36
redo synch time overhead count (<128 msec)	327	0.15	0.00
redo synch time overhead count (<2 msec)	52,177	23.27	0.01
redo synch time overhead count (<32 msec)	1,288	0.57	0.00
redo synch time overhead count (<8 msec)	5,790	2.58	0.00
redo synch time overhead count (>=128 msec)	169	0.08	0.00
redo synch writes	60,409	26.94	0.01
redo wastage	3,251,724,852	1,450,382.79	659.04

		% of Waits
Event	Total Waits	<1ms	<2ms	<4ms	<8ms	<16ms	<32ms	<=1s	>1s
latch: redo allocation	2119	96.7	2.3	.8	.1	.1
latch: redo writing	243.6K	50.0	17.1	14.1	10.1	6.7	2.0	.0
latch: row cache objects	30.1K	98.5	.9	.4	.1	.1	.0
latch: shared pool	442	83.3	4.1	5.4	4.5	1.8	.9
local write wait	45		55.6	22.2	2.2		17.8	2.2
lock deadlock retry	26	100.0
lock escalate retry	1	100.0
log buffer space	5342	16.8	13.3	19.3	19.1	13.4	7.9	10.2
log file parallel write	916.2K	84.8	10.7	3.5	.8	.2	.0	.0
log file sequential read	30.3K	38.2	31.9	22.3	6.2	.8	.3	.2
log file single write	36	100.0
log file switch (checkpoint incomplete)	457				.9				99.1

从上述数据来看，确实很像网友说的lgwr adaptive 这个特性导致的，这里确实有redo poll的操作，另外我们看log file parallel write并没有超过16ms的情况。

看需要分析一下trace，进一步去看看lgwr进程当时的情况了。

分析systemstate dump的时候，刚打开trace，映入眼帘的就是row cache lock死锁，前面部分内容是oracle hanganalyze 分析产生的信息：

*** 2024-02-29 10:06:50.381
===============================================================================
HANG ANALYSIS:
  instances (db_name.oracle_sid): crmprod.crmprod1, crmprod.crmprod2
  oradebug_node_dump_level: 3
  analysis initiated by oradebug
  os thread scheduling delay history: (sampling every 1.000000 secs)
    0.000000 secs at [ 10:06:49 ]
      NOTE: scheduling delay has not been sampled for 0.704062 secs    0.000000 secs from [ 10:06:45 - 10:06:50 ], 5 sec avg
    0.000000 secs from [ 10:05:50 - 10:06:50 ], 1 min avg
    0.000000 secs from [ 10:01:51 - 10:06:50 ], 5 min avg
  vktm time drift history
===============================================================================


Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'row cache lock'<='row cache lock' (cycle)
     Chain 1 Signature Hash: 0x75bdd0c
 [b] Chain 2 Signature: 'row cache lock'<='row cache lock' (cycle)
     Chain 2 Signature Hash: 0x75bdd0c
 [c] Chain 3 Signature: 'row cache lock'<='row cache lock' (cycle)
     Chain 3 Signature Hash: 0x75bdd0c

什么是cycle呢？可以理解为循环、死锁，类似deadlock，从trace来看是一个序列的操作：

===============================================================================
Cycles:


-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 2 (xxxxx.xxxxx2)
                   os id: 165520
              process id: 1798, oracle@xxxdb002
              session id: 209
        session serial #: 13143
    }
    is waiting for 'row cache lock' with wait info:
    {
                      p1: 'cache id'=0xd
                      p2: 'mode'=0x0
                      p3: 'request'=0x5
            time in wait: 1.577432 sec
      heur. time in wait: 8 min 8 sec
           timeout after: 1.422568 sec
                 wait id: 46539
                blocking: 1 session
             current sql: SELECT xxxxxxxxxtab.NEXTVAL FROM DUAL

那么有没有可能是高并发序列产生的row cache lock cycle死锁争用，影响了lgwr进程呢？这里我们先保留这个疑问。

前面继续分析看上去源头是lgwr，那么就继续看看lgwr在干什么？

===============================================================================
Non-intersecting chains:


-------------------------------------------------------------------------------
Chain 4:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (xxxxprod.xxxxprod1)
                   os id: 278128
              process id: 896, oracle@xxxxdb001
              session id: 5
        session serial #: 50255
    }
    is waiting for 'log file switch (checkpoint incomplete)' with wait info:
    {
            time in wait: 2 min 19 sec (last interval)
            time in wait: 8 min 9 sec (total)
           timeout after: never
                 wait id: 1064069
                blocking: 0 sessions
             current sql: UPDATE xxx.T_xxx_xxx_APPLY
                  SET  APPLIER_ID = :1 ,
                    LAST_UPD = SYSDATE,
                    ACCNT_ID = :2 ,                
                    AGREE_NUM = :3 ,
                    RECIPIENT_NAME = :4 ,
                    RECIPIENT_NUM = :5 ,
                    PROVINCE
             short stack: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+178<-ksliwat()+2066<-kslwaitctx()+163<-kcrfws()+508<-kcbchg1_main()+16898<-kcbchg1()+205<-ktuchg2()+1480<-ktbchg2()+281<-kddchg()+799<-kduovw()+4218<-kduurp()+2455<-kdusru()+975<-kauupd()+412<-updrow()+2167<-qerupFetch()+860<-updaul()+1378<-updThreePhaseExe()+318<-updexe()+638<-opiexe()+10378<-kpoal8()+2380<-opiodr()+917<-ttcpip()+2183<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opima
            wait history:
              * time between current wait and wait #1: 0.000000 sec
              1.       event: 'latch: redo writing'
                 time waited: 0.001789 sec
                     wait id: 1064520         p1: 'address'=0x60023ff0
                                              p2: 'number'=0xcf
                                              p3: 'tries'=0x0
              * time between wait #1 and #2: 0.000000 sec
              2.       event: 'log file switch (checkpoint incomplete)'
                 time waited: 0.824966 sec (last interval)
                 time waited: 5 min 49 sec (total)
                     wait id: 1064069         
              * time between wait #2 and #3: 0.000000 sec
              3.       event: 'latch: redo writing'
                 time waited: 0.000005 sec
                     wait id: 1064519         p1: 'address'=0x60023ff0
                                              p2: 'number'=0xcf
                                              p3: 'tries'=0x0
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (xxxxrod.xxxxrod1)
                   os id: 48602
              process id: 33, oracle@xxxxdb001 (LGWR)
              session id: 1123
        session serial #: 1
    }
    which is waiting for 'rdbms ipc message' with wait info:
    {
                      p1: 'timeout'=0xc3
            time in wait: 28486208 min 10 sec
           timeout after: 0.000000 sec
                 wait id: 393202531
                blocking: 1178 sessions
             current sql: <none>
             short stack: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-pread64()+19<-ksfd_skgfqio()+191<-ksfd_io()+767<-ksfdread()+581<-kfk_ufs_sync_io()+411<-kfk_submit_ufs_io()+247<-kfk_submit_io()+90<-kfk_io1()+1089<-kfkRequest()+14<-kfk_transitIO()+1712<-kfioSubmitIO()+4777<-kfioRequestPriv()+182<-kfioRequest()+691<-ksfd_kfioRequest()+649<-ksfd_osmio()+1053<-ksfd_io()+2782<-ksfdread()+581<-kccrbp()+488<-kccrec_rbl()+507<-kccext_ugg()+287<-kccrec_read_write()+212<-kccugg()+493<-kcc_get_record()+43<-kccgli(
            wait history:
              * time between current wait and wait #1: 0.000011 sec
              1.       event: 'KSV master wait'
                 time waited: 0.000057 sec
                     wait id: 393202530       
              * time between wait #1 and #2: 0.000068 sec
              2.       event: 'control file sequential read'
                 time waited: 0.000145 sec
                     wait id: 393202529       p1: 'file#'=0x0
                                              p2: 'block#'=0x2d
                                              p3: 'blocks'=0x1
              * time between wait #2 and #3: 0.000005 sec
              3.       event: 'control file sequential read'
                 time waited: 0.000165 sec
                     wait id: 393202528       p1: 'file#'=0x0
                                              p2: 'block#'=0x330
                                              p3: 'blocks'=0x1
    }


Chain 4 Signature: 'rdbms ipc message'<='log file switch (checkpoint incomplete)'
Chain 4 Signature Hash: 0xfb14fb68

看上去很多update在等lgwr，而且可以看到lgwr 阻塞了1178个Session。但是此时lgwr 进程的等待事件居然是rdbms ipc messages。这是一个空闲等待事件.这似乎有点诡异。同时我看hanganalyze 3的结果，

*** 2024-02-29 10:06:50.700
===============================================================================
HANG ANALYSIS DUMPS:
  oradebug_node_dump_level: 3
===============================================================================


State of LOCAL nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]):
[4]/1/5/50255/0x12d22aea120/278128/NLEAF/[1122]
[7]/1/8/64645/0x12e82b8e8c8/126496/NLEAF/[1122]
[40]/1/41/30141/0x12e82b9dd28/222217/NLEAF/[7378]
......
[1057]/1/1058/20481/0x12d62c8e228/192189/NLEAF/[1122]
[1062]/1/1063/50875/0x12d22cb4460/223750/NLEAF/[1122]
[1088]/1/1089/1/0x12d22cc69a0/48600/SINGLE_NODE_NW/
[1091]/1/1092/51785/0x12e82d6b148/195897/NLEAF/[1122]
[1092]/1/1093/31217/0x12d82c6a400/192238/NLEAF/[1122]
[1122]/1/1123/1/0x12d42c8ef90/48602/LEAF/
[1124]/1/1125/41567/0x12e82d7a5a8/195899/NLEAF/[1122]
......
[7518]/1/7519/21069/0x12d8375e740/195527/NLEAF/[1122]
[7520]/1/7521/36085/0x12d03773710/314154/NLEAF/[1122]
[7524]/1/7525/61423/0x12e8385c3a8/191947/NLEAF/[1122]
[7589]/1/7590/46795/0x12d637ad1a8/195530/NLEAF/[1122]
[7590]/1/7591/18861/0x12e8387ac68/222252/NLEAF/[2798]

其中1/1123 就是lgwr进程，没有nleaf节点，说明从阻塞链条来看确实可能是源头。由于systemstate dump 比较大，这里我们要看某个进程的具体信息的话，那么直接UE 搜索：PROCESS 33即可。

PROCESS 33: LGWR
  ----------------------------------------
  SO: 0x12d427afb70, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x12d427afb70, name=process, file=ksu.h LINE:12721, pg=0
  (process) Oracle pid:33, ser:1, calls cur/top: 0x12d83785f48/0x12d83785f48
            flags : (0x6) SYSTEM
            flags2: (0x0),  flags3: (0x10) 
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 26
              last post received-location: ksa2.h LINE:285 ID:ksasnd
              last process to post me: 0x12d427e5018 239 0
              last post sent: 0 0 126
              last post sent-location: kjc.h LINE:1943 ID:KJCS Post snd proxy to flush msg
              last process posted by me: 0x12d0279f728 1 6
    (latch info) wait_event=0 bits=0x0
    Process Group: DEFAULT, pseudo proc: 0x12d02a887b0
    O/S info: user: oracle, term: UNKNOWN, ospid: 48602 
    OSD pid info: Unix process pid: 48602, image: oracle@crmdb001 (LGWR)
    Short stack dump: 
ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-io_queue_run()+100<-skgfospo()+185<-skgfrwat()+399<-ksfdwtio()+756<-ksfdwat_internal()+649<-ksfdblock()+207<-kfk_block_ufs()+334<-kfk_block()+871<-kfk_transit_waitIO()+188<-kfk_transitIO()+1755<-kfioWaitIO()+721<-kfioRequestPriv()+351<-kfioRequest()+691<-ksfd_kfioRequest()+649<-ksfd_osmbio()+413<-ksfdbio()+1563<-kcrfw_do_write()+974<-kcrfw_redo_write()+2072<-ksbabs()+771<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
    ----------------------------------------
    SO: 0x12d83973a48, type: 14, owner: 0x12d427afb70, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
     proc=0x12d427afb70, name=channel handle, file=ksr2.h LINE:367, pg=0
    (broadcast handle) 0x12d83973a48 flag: (1) PUBLISHER, 
                       owner: 0x12d427afb70 - ospid: 48602
                       event: 2, last message event: 0,
                       last message waited event: 0,
                       next message: (nil)(0), messages published: 0
                      channel: (0x12d638df958) LGWR to NSA2 channel [name: 39]
                               scope: 0, event: 9, last mesage event: 9,
                               publishers/subscribers: 1/1,
                               messages published: 4
                               heuristic msg queue length: 0
                        ......
     SO: 0x12d8395dc38, type: 14, owner: 0x12d427afb70, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
     proc=0x12d427afb70, name=channel handle, file=ksr2.h LINE:367, pg=0
    (broadcast handle) 0x12d8395dc38 flag: (2) ACTIVE SUBSCRIBER, 
                       owner: 0x12d427afb70 - ospid: 48602
                       event: 16, last message event: 16,
                       last message waited event: 16,
                       next message: (nil)(0), messages read: 0
                      channel: (0x12d638ea218) scumnt mount lock [name: 159]
                               scope: 1, event: 4596, last mesage event: 0,
                               publishers/subscribers: 0/58,
                               messages published: 0
                               heuristic msg queue length: 0
    ----------------------------------------
    SO: 0x12d42c8ef90, type: 4, owner: 0x12d427afb70, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x12d427afb70, name=session, file=ksu.h LINE:12729, pg=0
    (session) sid: 1123 ser: 1 trans: (nil), creator: 0x12d427afb70
              flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x409) -/-/INC
              DID: , short-term DID: 
              txn branch: (nil)
              edition#: 0              oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
    ksuxds FALSE at location: 0
    service name: SYS$BACKGROUND
    Current Wait Stack:
     0: waiting for 'log file parallel write'
        files=0x2, blocks=0x40, requests=0xc
        wait_id=393564999 seq_num=36277 snap_id=1
        wait times: snap=28486210 min 16 sec, exc=28486210 min 16 sec, total=28486210 min 16 sec
        wait times: max=infinite, heur=28486210 min 16 sec
        wait counts: calls=0 os=0
        in_wait=1 iflags=0x5a0
    Wait State:
      fixed_waits=0 flags=0x22 boundary=(nil)/-1
    Session Wait History:
        elapsed time of 0.000235 sec since current wait
     0: waited for 'log file parallel write'
        files=0x2, blocks=0x3e, requests=0xc
        wait_id=393564998 seq_num=36276 snap_id=1
        wait times: snap=0.000712 sec, exc=0.000712 sec, total=0.000712 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000185 sec of elapsed time
     ......
     9: waited for 'log file parallel write'
        files=0x2, blocks=0x40, requests=0xc
        wait_id=393564999 seq_num=36277 snap_id=1
        wait times: snap=0.001645 sec, exc=0.001645 sec, total=0.001645 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000235 sec of elapsed time
    Sampled Session History of session 1123 serial 1
    ---------------------------------------------------
    The sampled session history is constructed by sampling
    the target session every 1 second. The sampling process
    captures at each sample if the session is in a non-idle wait,
    an idle wait, or not in a wait. If the session is in a
    non-idle wait then one interval is shown for all the samples
    the session was in the same non-idle wait. If the
    session is in an idle wait or not in a wait for
    consecutive samples then one interval is shown for all
    the consecutive samples. Though we display these consecutive
    samples  in a single interval the session may NOT be continuously
    idle or not in a wait (the sampling process does not know).


    The history is displayed in reverse chronological order.


    sample interval: 1 sec, max history 120 sec
    ---------------------------------------------------
      [1 sample,                                                      10:07:57]
        waited for 'log file parallel write', seq_num: 35813
          p1: 'files'=0x2
          p2: 'blocks'=0x9c
          p3: 'requests'=0xc
          time_waited: >= 0 sec (still in wait)
      。。。。。。
        waited for 'log file parallel write', seq_num: 29854
          p1: 'files'=0x2
          p2: 'blocks'=0x28e
          p3: 'requests'=0x20
          time_waited: 0.007661 sec (sample interval: 0 sec)
      [152 samples,                                        10:05:05 - 10:07:35]
        idle wait at each sample
    ---------------------------------------------------
    Sampled Session History Summary:
      longest_non_idle_wait: 'log file parallel write'
      [1 sample, 10:07:40 ]
          time_waited: 0.026762 sec (sample interval: 0 sec)

我们不难看出lgwr进程在的等待lgwr file parallel write，之前我们看其等待事件柱状图可以发现，有少量等待是在8ms-16ms之间的，对于高端存储环境来讲，这是比较高的响应时间了（916.2k次等待，其中8-16ms等待次数是1%，大概就是9162次).

那么问题究竟是什么呢，是不是lgwr 这个自适应写redo机制导致的呢？为了进一步确认，我们把lgwr 的call stack列出来：

ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112

<-__sighandler()<-io_queue_run()+100

<-skgfospo()+185<-skgfrwat()+399<-ksfdwtio()+756

<-ksfdwat_internal()+649<-ksfdblock()+207<-kfk_block_ufs()+334

<-kfk_block()+871<-kfk_transit_waitIO()+188<-kfk_transitIO()+1755

<-kfioWaitIO()+721<-kfioRequestPriv()+351

<-kfioRequest()+691<-ksfd_kfioRequest()+649<-ksfd_osmbio()+413

<-ksfdbio()+1563<-kcrfw_do_write()+974

<-kcrfw_redo_write()+2072<-ksbabs()+771

<-ksbrdp()+1045<-opirip()+623<-opidrv()+603

<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265

<-main()+201<-__libc_start_main()+253

针对上述几个关键的函数，我们来看看大概是什么含义：

skgfospo： operating system dependent kernel generic file operating system poll for IO completion

skgfrwat： operating system dependent kernel generic file requests wait for IO to complete

ksfdwtio： kernel service (VOS) functions disk IO wait for IO

从call stack函数来看，lgwr进程确实在进行poll切换操作啊。。。看来就是这个特殊导致的了，实锤了！后面网友将参数 _use_adaptive_log_file_sync进行了调整，这个做法我认为是正确！

写在最后如果对数据恢复感兴趣的小伙伴，可以订阅如下课程，当然该课程并不仅仅是分享数据库异常恢复思路，更多的是讲解数据文件结构、块结构，redo、undo等相关原理，非常有助于大家加深理解，对于转战国产数据库，我认为也有一定借鉴意义。订阅的小伙伴，记得加我微信（最好是备注一下已订阅xxx），我拉大家进小群，偶尔解答以及开开小灶，分享技术知识，不限于Oracle、Oceanbase、达梦数据库、openGauss等。

文章转载自Roger的数据库专栏，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

这个案例很神奇 15年老司机人也是第一见

评论