这个案例是关于一次子游标过多导致的RAC节点宕机的故障。子游标与实例宕机会有什么关系?我们通过具体的案例看看如何从表象入手,抽丝剥茧,层层深入发现线索,并逐步还原犯罪场景,找出实例宕机的真实原因。
14.6.1 信息采集,准确定位问题
某天晚上,我们接到客户的电话,说某一核心系统有一个节点挂了,实例自动重启。虽然没有影响业务,但这种无缘无故的重启发生在核心系统上,直接威胁到业务运行,希望我们能协助尽快找到原因。
听到客户的描述,心里第一个反应是:还好,只是单节点重启,问题本身应该不严重,至少对业务影响并不大。
客户目前急需的是快速给出问题的原因,以便根据情况作出后续的规避措施,所以需要尽快拿到现场的一些数据来做分析。毕竟分析所需要的信息数据距离故障时间越近,精准度就越高,越有利分析。
从客户处得到的故障信息如下所示。
重启的是整个数据库架构的2节点,这个库是核心系统,晚上也有业务。
重启实例的情况其实以前也发生过,只不过发生的不多
当前数据库版本为11.2.0.1。
当然只听客户描述是不够的。于是,我们远程登录了客户的服务器,开始做进一步检查。以下是收录的关键线索。最核心的是节点2的告警日志,部分内容如下所示。
Fri Jun 26 20:24:52 2015 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_p001_13581.trc (incident=204565): ORA-00600: 内部错误代码, 参数: [kksfbc-wrong-kkscsflgs], [39913467504], [33], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204565/ orcl2_p001_13581_i204565.trc 。。。。。 Fri Jun 26 21:50:26 2015 LCK0 (ospid: 29987) waits for latch 'object queue header operation' for 77 secs. Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc (incident=204141): ORA-29771: process MMON (OSID 29967) blocks LCK0 (OSID 29987) for more than 70 seconds Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204141/orcl2_lmhb_29939_i204141.trc MMON (ospid: 29967) is blocking LCK0 (ospid: 29987) in a wait LMHB (ospid: 29939) kills MMON (ospid: 29967). Please check LMHB trace file for more detail. Fri Jun 26 21:51:06 2015 Restarting dead background process MMON Fri Jun 26 21:51:06 2015 MMON started with pid=213, OS id=16612 Fri Jun 26 21:54:10 2015 LMS1 (ospid: 29929) waits for latch 'object queue header operation' for 81 secs. LCK0 (ospid: 29987) has not called a wait for 85 secs. Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc (incident=204142): ORA-29770: global enqueue process LMS1 (OSID 29929) is hung for more than 70 seconds Fri Jun 26 21:54:20 2015 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc (incident=204143): ORA-29770: global enqueue process LCK0 (OSID 29987) is hung for more than 70 seconds Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204142/orcl2_lmhb_29939_i204142.trc ERROR: Some process(s) is not making progress. LMHB (ospid: 29939) is terminating the instance. Please check LMHB trace file for more details. Please also check the CPU load, I/O load and other system properties for anomalous behavior ERROR: Some process(s) is not making progress. LMHB (ospid: 29939): terminating the instance due to error 29770 Fri Jun 26 21:54:21 2015 opiodr aborting process unknown ospid (26414) as a result of ORA-1092 Fri Jun 26 21:54:21 2015 ORA-1092 : opitsk aborting process Fri Jun 26 21:54:21 2015 System state dump is made for local instance System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ diag_29889.trc Instance terminated by LMHB, pid = 29939
在告警日志中我们发现一个很明显的ORA-600错误,同时也发现一些其他的ORA报错,见上面着重标识部分。于是我们对这些错误分别进行了分析。
ORA-600 [kksfbc-wrong-kkscsflgs] (Doc ID 970885.1) 确实是一个Bug,在MOS上有详细说明,如表14-1所示。
表14-1
Bug Fixed Description 9067282 11.2.0.1.2, 11.2.0.1.BP01, 11.2.0.3, 12.1.0.1 ORA-600 [kksfbc-wrong-kkscsflgs] can occur 9066130 10.2.0.5, 11.1.0.7.2, 11.2.0.2, 12.1.0.1 OERI [kksfbc-wrong-kkscsflgs] / spin with multiple children 8828328 11.2.0.1.1, 11.2.0.1.BP04, 11.2.0.2, 12.1.0.1 OERI [kksfbc-wrong-kkscsflgs] 8661168 11.2.0.1.1, 11.2.0.1.BP01, 11.2.0.2, 12.1.0.1 OERI[kksfbc-wrong-kkscsflgs] can occur
但MOS上并未说明该Bug会导致实例宕机,这个600错误看来应该与此次重启关系不大。好吧,作为一个问题记下来就是了。
在故障时间点,LMHB 进程check发现MMON进程阻塞了LCK0进程,超过70秒,因此尝试kill MMON进程,该进程被kill之后将会自动重启。
可以看到在在6月26日21:51:06 时间点,MMON进程重启完成。
接下来,在6月26日21:54:10,LMS1进程报错无法获得latch(object queue header operation) 超过85秒。
为了更清楚地理清线索,我们根据节点2的告警日志信息,整理出如下的时间流。
Jun 26 20:24:52 ORA-00600 [kksfbc-wrong-kkscsflgs] Jun 26 21:50:26 LCK0 (ospid: 29987) waits for latch 'object queue header operation' for 77 secs MMON (OSID 29967) blocks LCK0 (OSID 29987) for more than 70 seconds MMON (ospid: 29967) is blocking LCK0 (ospid: 29987) in a wait LMHB (ospid: 29939) kills MMON (ospid: 29967) Jun 26 21:51:06 MMON started with pid=213, OS id=16612 Jun 26 21:54:10 LMS1 (ospid: 29929) waits for latch 'object queue header operation' for 81 secs LCK0 (ospid: 29987) has not called a wait for 85 secs ORA-29770: global enqueue process LMS1 (OSID 29929) is hung for more than 70 seconds Jun 26 21:54:20 ORA-29770: global enqueue process LCK0 (OSID 29987) is hung for more than 70 seconds ERROR: Some process(s) is not making progress. LMHB (ospid: 29939) is terminating the instance LMHB (ospid: 29939): terminating the instance due to error 29770
从最后的信息可以看出,在21:54:20时间点,LMHB进程强行终止了数据库实例。而终止实例的原因是LMHB进程发现LCK0进行hung住了,而且超过了70秒。从前面的信息也可以看出,实际上在21:54:10时间点,LCK0进程就已经没有活动了,而且在该时间点LMS1进程也一直在等待latch。很明显,如果LMS1进程无法正常工作,Oracle为了保证集群数据的一致性,为了避免脑裂,必然将问题节点强行驱逐重启。
14.6.2 层层分析,揪出罪魁祸首
在故障日志中,LMS1进程一直处于等待状态,那么LMS1在等什么呢?LCK0为什么被Hung住了?我们通过更详细的信息来分析。
首先让我们来看看LMS1到底在干什么?
检查orcl2_lmhb_29939_i204142.trc,而该trace 文件产生的时间点是6月26日21:54:10。
SO: 0x9a1175160, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x9a1175160, name=process, file=ksu.h LINE:11459, pg=0 (process) Oracle pid:14, ser:1, calls cur/top: 0x9b17e5330/0x9b17e0e60 flags : (0x6) SYSTEM flags2: (0x100), flags3: (0x0) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 116 last post sent-location: ksl2.h LINE:2160 ID:kslges last process posted by me: 9811032c8 1 14 (latch info) wait_event=0 bits=a Location from where call was made: kcbo.h LINE:890 ID:kcbo_unlink_q_bg: waiting for 993cfec60 Child object queue header operation level=5 child#=117 Location from where latch is held: kcl2.h LINE:3966 ID:kclbufs: Context saved from call: 0 state=busy(shared) [value=0x4000000000000001] wlstate=free [value=0] waiters [orapid (seconds since: put on list, posted, alive check)]: 14 (95, 1435326858, 4) 21 (94, 1435326858, 7) waiter count=2 gotten 73961423 times wait, failed first 4752 sleeps 1927 gotten 33986 times nowait, failed: 4 possible holder pid = 36 ospid=29987 on wait list for 993cfec60 holding (efd=5) 9b59be480 Child gc element level=3 child#=20 Location from where latch is held: kcl2.h LINE:3535 ID:kclbla: Context saved from call: 0 state=busy(exclusive) [value=0x200000000000000e, holder orapid=14] wlstate=free [value=0] holding (efd=5) 9b45cac50 Child cache buffers chains level=1 child#=61221 Location from where latch is held: kcl2.h LINE:3140 ID:kclbla: Context saved from call: 0 state=busy(exclusive) [value=0x200000000000000e, holder orapid=14] wlstate=free [value=0] Process Group: DEFAULT, pseudo proc: 0x9b11ca008 O/S info: user: oracle, term: UNKNOWN, ospid: 29929 OSD pid info: Unix process pid: 29929, image: oracle@ebtadbsvr2 (LMS1)
从LMS1进程的信息来看,LMS1 进程所等待的资源(object queue header operation)正被ospid=29987 持有,那么29987又是什么呢?
进一步分下ospid=29987 是什么?让我们接着往下看。
SO: 0x9911283b0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x9911283b0, name=process, file=ksu.h LINE:11459, pg=0 (process) Oracle pid:36, ser:2, calls cur/top: 0x9b17e58e0/0x9b17e58e0 flags : (0x6) SYSTEM flags2: (0x0), flags3: (0x0) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 35 last post received-location: ksr2.h LINE:603 ID:ksrpublish last process to post me: 981110608 118 0 last post sent: 0 0 36 last post sent-location: ksr2.h LINE:607 ID:ksrmdone last process posted by me: 9911283b0 2 6 (latch info) wait_event=0 bits=20 holding (efd=3) 993cfec60 Child object queue header operation level=5 child#=117 Location from where latch is held: kcl2.h LINE:3966 ID:kclbufs: Context saved from call: 0 state=busy(shared) [value=0x4000000000000001] wlstate=free [value=0] waiters [orapid (seconds since: put on list, posted, alive check)]: 14 (95, 1435326858, 4) 21 (94, 1435326858, 7) waiter count=2 Process Group: DEFAULT, pseudo proc: 0x9b11ca008 O/S info: user: oracle, term: UNKNOWN, ospid: 29987 OSD pid info: Unix process pid: 29987, image: oracle@ebtadbsvr2 (LCK0)
最后一句很明显地告诉我们,29987原来就是LCK0进程。这意味着LMS1 进程所等待的资源正被LCK0 进程所持有。而同时还有另外一个进程orapid=21 也在等待该进程。通过分析我们发现,orapid=21,是DBW2进程,即数据库写进程。
从数据库告警日志来看,在6月26日21:54:10 有提示LCK0 进程已经超过85秒没有响应。
LCK0 (ospid: 29987) has not called a wait for 85 secs
根据时间点来计算,大概在6月26日21:52:45点左右开始,LCK0进程即没有响应了。那么很明显,我们只要知道LCK0进程为什么会hung,就知道了此次故障的原因。
那么来看看LCK0的trace文件,能不能看到一些线索。LCK0进程的trace信息如下所示。
*** 2015-06-26 21:50:29.329 Process diagnostic dump for oracle@ebtadbsvr2 (LCK0), OS id=29987, pid: 36, proc_ser: 2, sid: 1729, sess_ser: 3 ----------------------------------------------- loadavg : 6.47 26.96 34.97 Memory (Avail / Total) = 10598.05M / 64421.55M Swap (Avail / Total) = 20256.00M / 20479.99M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 S oracle 29987 1 0 76 0 - 10541946 semtim Apr05 ? 01:25:21 ora_lck0_orcl2 Short stack dump: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-semop()+7<-skgpwwait()+156<-kslgess()+1799<-ksl_get_shared_latch()+620<-kclbufs()+272<-kclanticheck()+412<-kclahrt()+88<-ksbcti()+212<-ksbabs()+1732<-kclabs()+186<-ksbrdp()+923<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+214<-main()+201<-__libc_start_main()+244<-_start()+36 *** 2015-06-26 21:54:18.438 Process diagnostic dump for oracle@ebtadbsvr2 (LCK0), OS id=29987, pid: 36, proc_ser: 2, sid: 1729, sess_ser: 3 ------------------------------------------------------------------------------- loadavg : 2.04 13.34 27.63 Memory (Avail / Total) = 9519.06M / 64421.55M Swap (Avail / Total) = 20256.00M / 20479.99M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 R oracle 29987 1 0 85 0 - 10541965 ? Apr05 ? 01:26:55 ora_lck0_orcl2 Short stack dump: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-kcbo_get_next_qheader()+255<-kclbufs()+321<-kcldirtycheck()+231<-kclahrt()+93<-ksbcti()+212<-ksbabs()+1732<-kclabs()+186<-ksbrdp()+923<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+214<-main()+201<-__libc_start_main()+244<-_start()+36
从上述LCK0进程的几个时间点的信息来看,第一个时间点21:50:29,wchan 为semtim。wchan,表示进程sleeping的等待表现形式。semtim表示在该时间点,LCK0 进程一直处于sleep状态。所谓的sleep状态,是进程持有的资源是不会释放的。
而在第2个时间点21:54:18,LCK0进程的wchan状态是“?”,这表示未知,如果是为空,则表示该进程处理running状态。在21:50到21:52 时间段内,LCK0进程仍然没有恢复正常。那么LCK0到底被什么阻塞了(或者说它需要的资源被谁占用了)?
同时也可以看到内存和swap都空闲很多,CPU也并不很忙。
继续检查trace,在21:50时间点我们发现,LCK0进程是被MMON进程锁阻塞了。
SO: 0x9d10f97c0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x9d10f97c0, name=process, file=ksu.h LINE:11459, pg=0 (process) Oracle pid:31, ser:1, calls cur/top: 0x965657408/0x9b17e3f18 flags : (0x2) SYSTEM flags2: (0x20), flags3: (0x0) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 35 last post received-location: ksr2.h LINE:603 ID:ksrpublish last process to post me: 9911283b0 2 6 last post sent: 0 0 26 last post sent-location: ksa2.h LINE:282 ID:ksasnd last process posted by me: 9911283b0 2 6 (latch info) wait_event=0 bits=26 holding (efd=7) 993cfec60 Child object queue header operation level=5 child#=117 Location from where latch is held: kcbo.h LINE:884 ID:kcbo_link_q: Context saved from call: 0 state=busy(exclusive) [value=0x200000000000001f, holder orapid=31] wlstate=free [value=0] waiters [orapid (seconds since: put on list, posted, alive check)]: 36 (82, 1435326627, 1) 21 (81, 1435326627, 1) waiter count=2 holding (efd=7) 9b5a5d630 Child cache buffers lru chain level=2 child#=233 Location from where latch is held: kcb2.h LINE:3601 ID:kcbzgws: Context saved from call: 0 state=busy [holder orapid=31] wlstate=free [value=0] holding (efd=7) 9c2a99938 Child cache buffers chains level=1 child#=27857 Location from where latch is held: kcb2.h LINE:3214 ID:kcbgtcr: fast path (cr pin): Context saved from call: 12583184 state=busy(exclusive) [value=0x200000000000001f, holder orapid=31] wlstate=free [value=0] Process Group: DEFAULT, pseudo proc: 0x9b11ca008 O/S info: user: oracle, term: UNKNOWN, ospid: 29967 OSD pid info: Unix process pid: 29967, image: oracle@ebtadbsvr2 (MMON)
从上面的trace可以看到,之前一直被等待的993cfec60 资源(Child object queue header operation)正被 MMON进程持有。
21:50:29的时候LCK0在等待MMON释放资源,而此时MMON出现异常,因此在21:51:06 MMON 进程被LMHB强制重启。然后在重启后,由于MMON的异常,导致21:54:18该资源仍无法被LCK0 进程正常持有,最终导致21:54:20LMHB进程强制重启了整个实例。
因此,最终的罪魁祸首是MMON进程。
Fri Jun 26 21:50:26 2015 LCK0 (ospid: 29987) waits for latch 'object queue header operation' for 77 secs. Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc (incident=204141): ORA-29771: process MMON (OSID 29967) blocks LCK0 (OSID 29987) for more than 70 seconds Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204141/ orcl2_lmhb_29939_i204141.trc MMON (ospid: 29967) is blocking LCK0 (ospid: 29987) in a wait LMHB (ospid: 29939) kills MMON (ospid: 29967). Please check LMHB trace file for more detail. Fri Jun 26 21:51:06 2015 Restarting dead background process MMON Fri Jun 26 21:51:06 2015 MMON started with pid=213, OS id=16612
MMON进程Oracle是用于进行AWR信息收集的。既然案情发生的原因与它有关,那么接下来的分析自然是查看它的相关trace了。
检查MMON的相关trace 可以看到,MMON进程负责处理对象的统计信息。从trace中可以看到大量的游标包含了太多的子游标。
User=b1456358 Session=c146d760 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=558d5760 LibraryHandle: Address=2f79eb08 Hash=3dec6f4a LockMode=N PinMode=0 LoadLockMode=0 Status=VALD ObjectName: Name= select time_mp, scn, num_mappings, tim_scn_map from smon_scn_time where scn = (select max(scn) from smon_scn_time where scn <= :1) FullHashValue=c36d5a579fdc3e19733192893dec6f4a Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=1038905162 OwnerIdn=0 Statistics: InvalidationCount=0 ExecutionCount=23741 LoadCount=107 ActiveLocks=1 TotalLockCount=6093 TotalPinCount=1 Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=106 KeepHandle=106 BucketInUse=6092 HandleInUse=6092 Concurrency: DependencyMutex=2f79ebb8(0, 0, 0, 0) Mutex=2f79ec30(0, 87578, 0, 0) Flags=RON/PIN/TIM/PN0/DBN/[10012841] WaitersLists: Lock=2f79eb98[2f79eb98,2f79eb98] Pin=2f79eba8[2f79eb78,2f79eb78] Timestamp: Current=04-05-2015 09:48:42 LibraryObject: Address=dff3bc60 HeapMask=0000-0001-0001 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000] ChildTable: size='112' Child: id='0' Table=dff3cb60 Reference=dff3c5b0 Handle=2f79e908 Child: id='1' Table=dff3cb60 Reference=dff3c8e0 Handle=2f4e2d90 Child: id='2' Table=dff3cb60 Reference=df3e7400 Handle=2f8c9e90 Child: id='3' Table=dff3cb60 Reference=df3e76e8 Handle=2f8abce8 ...... ...... Child: id='101' Table=dc86f748 Reference=df02d368 Handle=288e6460 Child: id='102' Table=dc86f748 Reference=dd65c3e0 Handle=274d0b40 Child: id='103' Table=dc86f748 Reference=dd65c6c8 Handle=29aa92f8 Child: id='104' Table=dc86f748 Reference=dd65c9b8 Handle=26f3a460 Child: id='105' Table=dc86f748 Reference=dd65ccd0 Handle=25c02dd8 NamespaceDump: Parent Cursor: SQL_id=76cckj4yysvua parent=0x8dff3bd48 maxchild=106 plk=y ppn=n Current Cursor Sharing Diagnostics Nodes: ...... ...... Child Node: 100 ID=34 reason=Rolling Invalidate Window Exceeded(2) size=0x0 already processed: Child Node: 101 ID=34 reason=Rolling Invalidate Window Exceeded(2) size=0x0 already processed:
类似上面的信息非常多。很明显,上述文游标(parent cursor)包含了大量的子游标,这是为什么在20点~21点(节点2还未重启前的时段)的awr报告中出现大量cursor: mutex S 的原因,表14-2是这个时段的等待事件。
表14-2 Event Waits Time(s) Avg wait (ms) % DB time Wait Class DB CPU 47,072 93.05 cursor: mutex S 31,751,317 18,253 1 36.08 Concurrency gc cr multi block request 359,897 1,281 4 2.53 Cluster gc buffer busy acquire 10,465 686 66 1.36 Cluster library cache lock 9,285 550 59 1.09 Concurrency
在MMON正常通过内部SQL收集系统信息时,根本不应该出现这种情况,而此时MMON进程却出现异常,这个异常看来应该是与cursor子游标过多有关了。
最后数据库实例被强行终止的原因是LCK0进程出现异常导致LMHB进程强行终止instance,在最后instance终止之前的diag dump中,LCK0进程的状态仍然是hang的状态,同时也阻塞了3个其他session,如下所示。
SO: 0x9914dbce8, type: 4, owner: 0x9911283b0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x9911283b0, name=session, file=ksu.h LINE:11467, pg=0 (session) sid: 1729 ser: 3 trans: (nil), creator: 0x9911283b0 flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x408) -/- DID: , short-term DID: txn branch: (nil) oct: 0, prv: 0, SQL: (nil), pSQL: (nil), user: 0/SYS ksuxds FALSE at location: 0 service name: SYS$BACKGROUND Current Wait Stack: Not in wait; last wait ended 1 min 39 sec ago There are 3 sessions blocked by this session. Dumping one waiter: inst: 2, sid: 1009, ser: 1 wait event: 'latch: object queue header operation' p1: 'address'=0x993cfec60 p2: 'number'=0xa2 p3: 'tries'=0x0 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 min_blocked_time: 81 secs, waiter_cache_ver: 14285 Wait State: fixed_waits=0 flags=0x20 boundary=(nil)/-1
对于数据库进程,如果状态不是dead,而是busy,而且持有latch不释放,那么这意味着该进程已被挂起,LCK0 持有的latch是object queue header operation。oracle mos文档关于该event 的描述如下:Scans of the object queue in the buffer cache can hold the "object queue header operation"。
14.6.3 对症下药,排除数据故障
基于上述的分析,我们最终判断,LCK0 进程出现问题的原因与游标子游标过多有关。同时,这又与在11.2.0.1版本上的child cursor总数阈值限制过高有关。实际在版本10g中就引入了该Cursor Obsolescence游标废弃特性,10g的child cursor 的总数阈值是1024,即子游标超过1024即被过期,但是这个阈值在11g的初期版本中被移除了。这就导致出现一个父游标下大量子游标即high version count的发生,由此引发了一系列的版本11.2.0.3之前的cursor sharing 性能问题。这意味着版本11.2.0.1和11.2.0.2的数据库,将可能出现大量的Cursor: Mutex S 和 library cache lock等待事件这种症状,进而诱发其他故障的发生。
因此,通常我们建议11.2.0.4以下的版本应尽快将数据库版本升级到11.2.0.4(11.2.0.3中默认就有_cursor_obsolete_threshold了,而且默认值为100),或者通过_cursor_features_enabled 和106001 event来强制控制子游标过多的情况。
14.6.4 深入总结,一次故障长久经验
该案例的分析还是有些曲折,因为常见导致单节点故障最常见的情况主要是心跳、节点资源之类,而该案例的诱发原因相对有些诡异。先是出现了ora-600错误,然后又报了kill MMON的信息,这都让案情分析有些扑朔迷离,当然,最终我们还是找出了问题的主要根源。
通过该起案件我们可以得到如下几点体会。
数据库版本的选择绝对是影响系统稳定性的关键要点。
不要以为性能问题只是影响用户体验,有些性能问题是会诱发系统一系列问题的。
问题的分析不要想当然,通过trace逐步递进,结合原理与经验,才能更为准确的确定问题。
子游标过多千万不能小视,最好能找出根源并解决它。如果确实不好解决,那么可通过设置隐含参数_cursor_features_enabled 和106001 event强制失效子游标的方式来防止子游标过多的情况,操作如下所示。
SQL> alter system set "_cursor_features_enabled"=300 scope=spfile; SQL> alter system set event='106001 trace name context forever,level 1024' scope=spfile;
正常重启数据库即可。