数据库的问题大体上可以分为两类,一类是数据库真的出了问题,不能正常运行,甚至影响到业务的。另一类是潜在的问题,也就是性能问题。对于这两类问题的分析,有相同也有不同之处。我们通过具体的案例来说明。
案例现象及概要
某客户的核心系统数据库是2节点的Oracle RAC数据库, 版本为10.2.0.4,数据库主机平台为IBM AIX,数据量超过10TB。这套数据库在最近一段时间经常出现系统挂起的严重故障,严重影响了生产的运行。在近一个月内,几乎每天可能都会出现故障,甚至是在业务不繁忙的时候。本文主要分析了系统的故障原因,并提出了对应的解决方案,希望能给读者在故障诊断时提供一些思路。
下面是2014年1月7日早上8点至9点时间段的故障日志(为保护案例数据库的信息,将实际的数据库名字替换为orcl)。实例1日志信息。
Tue Jan 7 07:51:08 2014 ALTER SYSTEM ARCHIVE LOG Tue Jan 7 07:51:08 2014 Thread 1 cannot allocate new log, sequence 75616 Checkpoint not complete Current log# 1 seq# 75615 mem# 0: /dev/rlv9 Tue Jan 7 08:15:52 2014 PMON failed to delete process, see PMON trace file Tue Jan 7 08:23:34 2014 WARNING: inbound connection timed out (ORA-3136) Tue Jan 7 08:24:53 2014 Errors in file /u01/app/oracle10g/admin/orcldb/udump/orcldb1_ora_10503176.trc: ORA-00604: error occurred at recursive SQL level 1 ORA-01013: user requested cancel of current operation ORA-06512: at line 2 ....省略部分内容...... Tue Jan 7 08:52:22 2014 Shutting down instance (abort) License high water mark = 3510 Instance terminated by USER, pid = 5731764 Tue Jan 7 09:01:45 2014 Starting ORACLE instance (normal) sskgpgetexecname failed to get name
早上7:51左右出现了“检查点未完成”而不能切换日志的信息,8:23左右出现会话不能登录的信息,而到8:52强制关闭了实例然后重启。
在数据库关闭之前,对数据库做了system state 转储,在跟踪文件orcldb1_ora_5727520.trc中有如下重要信息。
*** 2014-01-07 08:37:45.920 SYSTEM STATE ------------PROCESS 207:-------------- SO: 7000028079afb58, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=207, calls cur/top: 70000275222f680/70000275222f680, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 128 last post received-location: kclrget last process to post me: 70000280f9952b0 1 6 last post sent: 0 0 123 last post sent-location: kcrfw_redo_gen: wake LGWR after redo copy last process posted by me: 7000028079a3e58 1 6 (latch info) wait_event=0 bits=2 holding (efd=10) 7000027ce79ac58 Child cache buffers chains level=1 child#=40228 Location from where latch is held: kcbgtcr: kslbegin excl: Context saved from call: 2160934423 state=busy(exclusive) (val=0x20000000000000cf) holder orapid = 207 waiters [orapid (seconds since: put on list, posted, alive check)]: 40 (3851, 1389055081, 1) waiter count=1 Process Group: DEFAULT, pseudo proc: 70000280fb3a820 O/S info: user: oracle, term: pts/2, ospid: 5297410 OSD pid info: Unix process pid: 5297410, image: oracle@kjorcldb1 (TNS V1-V3) ...省略部分内容... SO: 700002802e78798, type: 4, owner: 7000028079afb58, flag: INIT/-/-/0x00 (session) sid: 4339 trans: 700002775cc8cc8, creator: 7000028079afb58, flag: (41) USR/- BSY/-/-/-/-/- DID: 0001-00CF-00005C72, short-term DID: 0001-00CF-00005C73 txn branch: 0 oct: 6, prv: 0, SQL: 7000026e943b618, pSQL: 70000268a5dd9d0, user: 49/ORCLADM service name: SYS$USERS O/S info: user: oracle, term: pts/2, ospid: 3114266, machine: kjorcldb1 program: SQLplus@kjorcldb1 (TNS V1-V3) application name: SQL*Plus, hash value=3669949024 last wait for 'db file sequential read' blocking sess=0x0 seq=91 wait_time=26947 seconds since wait started=4456 file#=d5, block#=5b789, blocks=1 Dumping Session Wait History for 'db file sequential read' count=1 wait_time=26947 file#=d5, block#=5b789, blocks=1 for 'db file sequential read' count=1 wait_time=426 ...省略部分内容... LIBRARY OBJECT LOCK: lock=700002752e1b380 handle=7000026e943b618 mode=N call pin=0 session pin=0 hpc=0000 hlc=0000 htl=700002752e1b400[700002687a44a90,70000272d437c30] htb=700002687a44a90 ssga=700002687a44538 user=700002802e78798 session=700002802e78798 count=1 flags=[0000] savepoint=0x52cad289 LIBRARY OBJECT HANDLE: handle=7000026e943b618 mtx=7000026e943b748(1) cdp=1 name= UPDATE jl_jlzzda_13 a SET zcbh = (SELECT zcbh FROM pc_dnb_jg_13 b WHERE a.zzbh = b.zzbh AND a.zcbh <> b.zcbh AND ( TRIM (b.zcbh) = :"SYS_B_0" OR TRIM (b.zcbh) = :"SYS_B_1" OR TRIM (b.zcbh) = :"SYS_B_2")) WHERE a.zcbh <> :"SYS_B_3" AND a.zcbh <> :"SYS_B_4" AND EXISTS (SELECT :"SYS_B_5" FROM pc_dnb_jg_13 b WHERE a.zzbh = b.zzbh AND a.zcbh <> b.zcbh AND ( TRIM (b.zcbh) = :"SYS_B_6" OR TRIM (b.zcbh) = :"SYS_B_7" OR TRIM (b.zcbh) = :"SYS_B_8"))
故障详细分析
从system dump的信息,反映出以下几点。
(1)oracle进程(orapid=207,ospid=5297410,session id=4339)持有一个cache buffers chains child latch,子latch号为40228,子latch的地址为:7000027ce79ac58。
(2)这个进程阻塞了一个进程(orapid=40),阻塞时间长为3851秒。
(3)这个进程上一个等待事件是磁盘单块读,到目前为止已经过了4456秒。system state dump是从2014年1月7日08:37:45开始的,那么4456秒之前即latch开始持有的时间是在2014年1月7日07:23:29至2014年1月7日07:28:29之间(由于system state dump本身需要消耗一定的时间,因此会有一定的时间误差,这个时间误差在5分钟之内)。
(4)这个进程的是由数据库主机本机上的SQLplus连接上来的,正在执行的SQL语句如下。
UPDATE jl_jlzzda_13 a SET zcbh = (SELECT zcbh FROM pc_dnb_jg_13 b WHERE a.zzbh = b.zzbh AND a.zcbh <> b.zcbh AND ( TRIM (b.zcbh) = :"SYS_B_0" OR TRIM (b.zcbh) = :"SYS_B_1" OR TRIM (b.zcbh) = :"SYS_B_2")) WHERE a.zcbh <> :"SYS_B_3" AND a.zcbh <> :"SYS_B_4" AND EXISTS (SELECT :"SYS_B_5" FROM pc_dnb_jg_13 b WHERE a.zzbh = b.zzbh AND a.zcbh <> b.zcbh AND ( TRIM (b.zcbh) = :"SYS_B_6" OR TRIM (b.zcbh) = :"SYS_B_7" OR TRIM (b.zcbh) = :"SYS_B_8"))
这里很明显有一个异常,进程持有latch的时间过长。正常情况下latch持有时间在微秒级,而这个进程持有latch长达几千秒。这通常是由于进程异常或挂起所导致。
接下来,看看这个进程所阻塞的进程(orapid=40)的信息。
------------PROCESS 40: ---------------------------------------- SO: 7000028069d2bf8, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=40, calls cur/top: 70000280012d810/70000280012d810, flag: (6) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 114 last post received-location: kcbbza last process to post me: 7000028039b5570 1 6 last post sent: 504403329233300568 122 2 last post sent-location: kslges last process posted by me: 7000028039b35f0 1 14 (latch info) wait_event=0 bits=0 Location from where call was made: kcbkzs: waiting for 7000027ce79ac58 Child cache buffers chains level=1 child#=40228 Location from where latch is held: kcbgtcr: kslbegin excl: Context saved from call: 2160934423 state=busy(exclusive) (val=0x20000000000000cf) holder orapid = 207 waiters [orapid (seconds since: put on list, posted, alive check)]: 40 (3837, 1389055067, 2) waiter count=1 gotten 7812722 times wait, failed first 65133 sleeps 380 gotten 36115 times nowait, failed: 14223 possible holder pid = 207 ospid=5297410 on wait list for 7000027ce79ac58 Process Group: DEFAULT, pseudo proc: 70000280fb3a820 O/S info: user: oracle, term: UNKNOWN, ospid: 2508016 OSD pid info: Unix process pid: 2508016, image: oracle@kjorcldb1 (CKPT) SO: 70000280112aa90, type: 4, owner: 7000028069d2bf8, flag: INIT/-/-/0x00 (session) sid: 6567 trans: 0, creator: 7000028069d2bf8, flag: (51) USR/- BSY/-/-/-/-/- DID: 0001-0028-00000004, short-term DID: 0001-0028-00000005 txn branch: 0 oct: 0, prv: 0, SQL: 0, pSQL: 0, user: 0/SYS service name: SYS$BACKGROUND waiting for 'latch: cache buffers chains' blocking sess=0x0 seq=26223 wait_time=0 seconds since wait started=3835 address=7000027ce79ac58, number=7a, tries=31f0 Dumping Session Wait History for 'latch: cache buffers chains' count=1 wait_time=292977
从上面的信息可以确认以下几点:
(1)被阻塞的进程是CKPT进程,这是系统关键的后台进程——检查点进程。
(2)CKPT进程由于等待cache buffers chains latch而被阻塞,被阻塞的时间长为3837秒,也就是在大约2014年1月7日07:33:48时被阻塞(由于system state dump需要消耗时间,所以可能有5分钟之内的时间误差)。
很显然,由于CKPT进程被长时间阻塞,将不能完成检查点工作。
Tue Jan 7 07:18:53 2014 Thread 1 advanced to log sequence 75613 (LGWR switch) Current log# 3 seq# 75613 mem# 0: /dev/rlv11 Tue Jan 7 07:31:01 2014 Thread 1 advanced to log sequence 75614 (LGWR switch) Current log# 2 seq# 75614 mem# 0: /dev/rlv10 Tue Jan 7 07:43:04 2014 Thread 1 advanced to log sequence 75615 (LGWR switch) Current log# 1 seq# 75615 mem# 0: /dev/rlv9 Tue Jan 7 07:51:08 2014 ALTER SYSTEM ARCHIVE LOG Tue Jan 7 07:51:08 2014 Thread 1 cannot allocate new log, sequence 75616 Checkpoint not complete Current log# 1 seq# 75615 mem# 0: /dev/rlv9
由于一个实例的在线日志文件只有3组,从7:31开始,经过3次切换,就不能再切换了。不能切换日志就意味着不能进行任何数据更改。由于数据库设置有登录触发器,对登录信息进行记录,会涉及数据插入。在这样的情况下,用户登录也不能成功,会一直挂起(或者直至超时)。
由于CKPT挂起,这导致了SMON这个关键后台进程也一直挂起。
-------------------------------------PROCESS 41:------------------------------------- SO: 7000028059c6be8, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=41, calls cur/top: 70000280012dae8/70000280012dae8, flag: (16) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 154 last post received-location: ktmpsm last process to post me: 7000028039b35f0 1 14 last post sent: 0 0 90 last post sent-location: KJCS Post snd proxy to flush msg last process posted by me: 7000028059c4488 1 6 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 70000280fb3a820 O/S info: user: oracle, term: UNKNOWN, ospid: 103858 OSD pid info: Unix process pid: 103858, image: oracle@kjorcldb1 (SMON) (session) sid: 6566 trans: 7000027734e5c60, creator: 7000028059c6be8, flag: (100051) USR/- BSY/-/-/-/-/- DID: 0001-0029-00000002, short-term DID: 0001-0029-00000003 txn branch: 0 oct: 0, prv: 0, SQL: 0, pSQL: 0, user: 0/SYS service name: SYS$BACKGROUND waiting for 'DFS lock handle' blocking sess=0x0 seq=59072 wait_time=0 seconds since wait started=3835 type|mode=43490005, id1=1, id2=2 Dumping Session Wait History for 'DFS lock handle' count=1 wait_time=488290 type|mode=43490005, id1=1, id2=2 for 'DFS lock handle' count=1 wait_time=488291 type|mode=43490005, id1=1, id2=2 ...省略部分内容... ---------------------------------------- SO: 7000027fd890990, type: 18, owner: 7000028003a2c80, flag: INIT/-/-/0x00 ----------enqueue 0x7000027fd890990------------------------ lock version : 3392757 Owner node : 0 grant_level : KJUSERNL req_level : KJUSEREX bast_level : KJUSERNL notify_func : 0 resp : 7000025f95f7e50 procp : 7000028003a2c80 pid : 103858 proc version : 0 oprocp : 0 opid : 0 group lock owner : 0 xid : 0000-0000-00000000 dd_time : 0.0 secs dd_count : 0 timeout : 0.0 secs On_timer_q? : N On_dd_q? : N lock_state : OPENING CONVERTING Open Options : KJUSERPROCESS_OWNED Convert options : KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK History : 0x49a5149a Msg_Seq : 0x0 res_seq : 4090 valblk : 0x00000000000000000000000000000000 . ----------resource 0x7000025f95f7e50---------------------- resname : [0x1][0x2],[CI] Local node : 0 dir_node : 1 master_node : 1 hv idx : 98 hv last r.inc : 4 current inc : 4 hv status : 0 hv master : 1 open options : Held mode : KJUSEREX Cvt mode : KJUSERNL Next Cvt mode : KJUSERNL msg_seq : f0010 res_seq : 4090 grant_bits : KJUSERNL KJUSERPR grant mode : KJUSERNL KJUSERCR KJUSERCW KJUSERPR KJUSERPW KJUSEREX count : 1 0 0 1 0 0 val_state : KJUSERVS_NOVALUE valblk : 0x00000000000000000000000000000000 . access_node : 0 vbreq_state : 0 state : x8 resp : 7000025f95f7e50 On Scan_q? : N Total accesses: 8014573 Imm. accesses: 7891534 Granted_locks : 1 Cvting_locks : 1 value_block: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 GRANTED_Q : lp 7000027fa0ce178 gl KJUSERPR rp 7000025f95f7e50 [0x1][0x2],[CI] master 1 pid 2508016 bast 0 rseq 4090 mseq 0 history 0x95614956 open opt KJUSERPROCESS_OWNED CONVERT_Q: lp 7000027fd890990 gl KJUSERNL rl KJUSEREX rp 7000025f95f7e50 [0x1][0x2],[CI] master 1 pid 103858 bast 0 rseq 4090 mseq 0 history 0x49a5149a convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK ----------------------------------------
从上面的信息可以看到,资源的GRANTED_Q队列中有进程2508016。这个进程正是CKPT进程,即CKPT进程以PROTECT READ模式持有资源,但是SMON进程需要以EXCLUSIVE模式来持有资源。这两种模式是不兼容的,因此SMON进程就被阻塞。
到目前为止,可以确定是由于异常的进程(orapid=207)长时间持有cache buffers chains child latch,使得CKPT被长时间阻塞,导致不能完成检查点,最终导致不能进行任何数据更改和不能连接数据库。
导致系统故障的异常进程,其异常的原因,通常有:
(1)操作系统Bug导致,使得异常的进程不能调度。在trace文件中看到,使用procstack获取进程call stack失败,说明进程在操作系统一级异常,而不是在oracle内部存在spin或stuck hang的情况。
(2)Oracle的Bug,通过在MOS上查找,我们发现有如下Bug:Bug 6859515 Diagnostic collection may hang or crash the instance,可能会有影响。
如何认定一个Bug与故障是匹配的?我们需要从大量的日志和trace中去寻找。一般会去从diag进程的trace和system state dump中寻找:
在diag进程的trace文件中,我们发现现象与Bug 6859515比较匹配。
*** 2014-01-07 07:24:56.655 Dump requested by process [orapid=46] REQUEST:custom dump [2] with parameters [46][207][2][2] Dumping process info of pid[207.5297410] requested by pid[46.1008052] Dumping process 207.5297410 info: *** 2014-01-07 07:24:56.655 Dumping diagnostic information for ospid 5297410: OS pid = 5297410 loadavg : 4.78 4.81 4.52 swap info: free_mem = 77226.37M rsv = 228.00M alloc = 636.91M avail = 58368.00M swap_free = 57731.09M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 A oracle 5297410 1 120 120 20 3acdbd4590 126808 23:46:25 - 27:22 oracleorcldb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) open: The file access permissions do not allow the specified action. procstack: write(/proc/5297410/ctl): The requested resource is busy. 5297410: oracleorcldb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) 0x0000000000000000 ????????() + ??
从上面的数据可以看到,ospid=5297410这个进程正是异常的持有latch的进程。这个进程在2014年1月7日07:24:56被请求转储,从时间上看这与该进程异常的时间点在2014年1月7日07:23:29至2014年1月7日07:28:29之间匹配。诊断转储是由orapid=46,ospid=1008052这个进程请求diag进程发起的。
“procstack: write(/proc/5297410/ctl): The requested resource is busy.”表明进程出现了异常。
PROCESS 46: ---------------------------------------- SO: 7000028079a4638, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=46, calls cur/top: 70000275b01c398/70000275b01c398, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 55 last post received-location: kjata: wake up enqueue owner last process to post me: 7000028059c4488 1 6 last post sent: 0 0 90 last post sent-location: KJCS Post snd proxy to flush msg last process posted by me: 7000028059c4488 1 6 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 70000280fb3a820 O/S info: user: oracle, term: pts/3, ospid: 1008052 OSD pid info: Unix process pid: 1008052, image: oracle@kjorcldb1 (TNS V1-V3) SO: 70000280ffd40b0, type: 4, owner: 7000028079a4638, flag: INIT/-/-/0x00 (session) sid: 6122 trans: 70000276e23b848, creator: 7000028079a4638, flag: (41) USR/- BSY/-/-/-/-/- DID: 0001-002E-001DD0F9, short-term DID: 0001-002E-001DD0FA txn branch: 0 oct: 6, prv: 0, SQL: 7000026e943b618, pSQL: 70000268a5dd9d0, user: 49/ORCLADM service name: SYS$USERS O/S info: user: oracle, term: pts/3, ospid: 3371970, machine: kjorcldb1 program: SQLplus@kjorcldb1 (TNS V1-V3) application name: SQL*Plus, hash value=3669949024 waiting for 'enq: TX - row lock contention' blocking sess=0x0 seq=4560 wait_time=0 seconds since wait started=5330 name|mode=54580006, usn<<16 | slot=2b70003, sequence=5324 Dumping Session Wait History for 'enq: TX - row lock contention' count=1 wait_time=488292 name|mode=54580006, usn<<16 | slot=2b70003, sequence=5324 for 'enq: TX - row lock contention' count=1 wait_time=488292 name|mode=54580006, usn<<16 | slot=2b70003, sequence=5324
从上面的数据可以看到,orapid=46,ospid=1008052这个进程一直在等待TX事务行锁,这个进程同样是SQLplus程序连接数据库产生的进程,执行的SQL语句也与异常进程一样。这个进程被异常进程(即orapid=207)阻塞,阻塞大约是在08:37:45的5330秒之前,即07:08:55左右。
SYS@xj11g> select to_char(sample_time,'yyyy-mm-dd hh24:mi:ss') sample_time,session_id, event,program,blocking_session 2 from dba_hist_active_sess_history where instance_number=1 and dbid=3516340257 3 and sample_id=144574100 4 order by sample_time,session_id; SAMPLE_TIME SESSION_ID EVENT PROGRAM BLOCKING_SESSION ------------- -- ------------------------- -------------------- ---------------- 2014-01-07 07:15:32 3532 DFGD_LONG@kj_orcl_app2 (TNS V1- V3) 2014-01-07 07:15:32 4323 jlzztb.d@kj_orcl_app1 (TNS V1-V 3) 2014-01-07 07:15:32 4339 SQLplus@kjorcldb1 (TNS V1-V3) 2014-01-07 07:15:32 4739 SQLplus@kjorcldb1 (TNS V1-V3) 2014-01-07 07:15:32 5009 enq: TX - row lock contention DFGD_LONG@kj_orcl_app1 (TNS V1- 3532 V3) 2014-01-07 07:15:32 6122 enq: TX - row lock contention SQLplus@kjorcldb1 (TNS V1-V3) 4339
会话6122在等待事务锁,阻塞的会话是4339,即orapid=207的会话进程。由于SQL执行性能的原因,数据锁时间过长,因此会话6122即orapid=46的进程向diag进程申请发起一个对orapid=207进程的转储操作,这样就触发了Bug,导致orapid=207的进程异常。
2014年1月9日12:30左右,数据库节点2的实例再次发生hang故障。从diag的信息来看,同样是因为diag触发Bug导致进程异常。异常进程持有redo copy latch使得其他所有进程,包括LGWR进程不能进行日志生成和写出,不能进行任何数据修改,短时间内堵塞了大量会话,同时数据库实例完全挂起。
*** 2014-01-09 12:25:06.473 Dumping diagnostic information for ospid 385380: OS pid = 385380 loadavg : 12.82 13.52 16.37 swap info: free_mem = 31221.03M rsv = 256.00M alloc = 674.18M avail = 65536.00M swap_free = 64861.82M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 A oracle 385380 1 56 88 20 1ea0cf7590 97272 19:08:20 - 39:10 oracleorcldb2 (LOCAL=NO) open: The file access permissions do not allow the specified action. procstack: write(/proc/385380/ctl): The requested resource is busy. 385380: oracleorcldb2 (LOCAL=NO) 0x0000000000000000 ????????() + ?? *** 2014-01-09 12:25:06.624 ---------------------------------------- SO: 7000028079acc18, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=165, calls cur/top: 700002641f86eb0/700002641f86eb0, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 136 last post received-location: kclrcvt last process to post me: 7000028059c5448 1 6 last post sent: 0 0 117 last post sent-location: kcbzww last process posted by me: 7000028049b5968 7 0 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 70000280fb3a820 O/S info: user: oracle, term: UNKNOWN, ospid: 385380 OSD pid info: Unix process pid: 385380, image: oracle@kjorcldb2 ---------------------------------------- SO: 70000280fdaf820, type: 4, owner: 7000028079acc18, flag: INIT/-/-/0x00 (session) sid: 3252 trans: 700002775b06728, creator: 7000028079acc18, flag: (100141) USR/- BSY/-/-/-/-/- DID: 0002-00A5-00000036, short-term DID: 0002-00A5-00000037 txn branch: 0 oct: 0, prv: 0, SQL: 0, pSQL: 70000262f26cbd8, user: 49/ORCLADM O/S info: user: yxmis, term: , ospid: 25364, machine: kj_orcl_app1 program: DFGD_LONG@kj_orcl_app1 (TNS V1-V3) application name: DFGD_LONG@kj_orcl_app1 (TNS V1-V3), hash value=3372202780 last wait for 'gc current request' blocking sess=0x0 seq=8372 wait_time=1086 seconds since wait started=0 file#=1f9, block#=2c0a, id#=2000008
在hang analyze的结果中可以看到,正是3252 这个会话产生了阻塞。
*** 2014-01-09 12:33:12.692 ============== HANG ANALYSIS: ============== Found 416 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> <1/6568/1/0x69d2bf8/4429090/latch: redo allocation> Found 13 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> <1/4035/5/0x7a51a38/1242406/log file sync> Open chains found: Chain 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> : <1/3252/61/0x79acc18/385380/No Wait> -- <1/6568/1/0x69d2bf8/4429090/latch: redo allocation> -- <1/2590/3/0x5a263a8/2778422/log file sync>
有416个会话在等待6568这个会话,即6568会话阻塞了416个进程,而6568会话在等待latch:redo allocation,而6568会话正是被3252这个会话阻塞。
6568会话是LGWR进程。毫无疑问,LGWR进程不能持有redo allocation latch,将不能写redo到日志文件,也就不能完成事务提交,所以在短时间内会产生大量的进程在等待log file sync,即等待LGWR完成日志写入。
接下来再看看2014年1月8日10:13左右开始,节点2数据库实例很慢然后发生hang故障,这同样是diag触发Bug引起的。
*** 2014-01-08 10:13:21.472 Dumping diagnostic information for ospid 652210: OS pid = 652210 loadavg : 31.73 28.30 25.81 swap info: free_mem = 96209.99M rsv = 256.00M alloc = 608.77M avail = 65536.00M swap_free = 64927.23M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 A oracle 652210 1 68 94 20 3aff9d6590 97376 23:03:53 - 40:07 oracleorcldb2 (LOCAL=NO) open: The file access permissions do not allow the specified action. procstack: write(/proc/652210/ctl): The requested resource is busy. 652210: oracleorcldb2 (LOCAL=NO) *** 2014-01-08 10:13:21.611 ---------------------------------------- SO: 7000028069f3b98, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=509, calls cur/top: 70000261e0f2a90/70000261e0f2a90, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 55 last post received-location: kjata: wake up enqueue owner last process to post me: 7000028059c4488 1 6 last post sent: 0 0 117 last post sent-location: kcbzww last process posted by me: 7000028059e7b88 1 0 (latch info) wait_event=0 bits=30 Process Group: DEFAULT, pseudo proc: 70000280fb3a820 O/S info: user: oracle, term: UNKNOWN, ospid: 652210 OSD pid info: Unix process pid: 652210, image: oracle@kjorcldb2 SO: 700002805ff0c40, type: 4, owner: 7000028069f3b98, flag: INIT/-/-/0x00 (session) sid: 6064 trans: 70000276e1c5678, creator: 7000028069f3b98, flag: (100141) USR/- BSY/-/-/-/-/- DID: 0002-01FD-00000001, short-term DID: 0002-01FD-00000002 txn branch: 0 oct: 0, prv: 0, SQL: 0, pSQL: 70000280daaa140, user: 49/ORCLADM O/S info: user: yxmis, term: , ospid: 25365, machine: kj_orcl_app1 program: DFGD_LONG@kj_orcl_app1 (TNS V1-V3) application name: DFGD_LONG@kj_orcl_app1 (TNS V1-V3), hash value=3372202780 last wait for 'gc current request' blocking sess=0x0 seq=22792 wait_time=315 seconds since wait started=0 file#=147, block#=11c2b2, id#=2000001
查看ASH数据。
SYS@xj11g> col program for a40 SYS@xj11g> select to_char(sample_time,'yyyy-mm-dd hh24:mi:ss') sample_time,session_id, event,program,blocking_session 2 from dba_hist_active_sess_history where sample_time>=sysdate-3 3 and instance_number=2 and dbid=3516340257 4 and sample_id=144689900 5 order by sample_time,session_id; SAMPLE_TIME SESSION_ID EVENT PROGRAM BLOCKING_SESSION ------------------- -------- ---------------- ---------------------------------------- 2014-01-08 10:13:15 4218 KHFW_LONG@kj_orcl_web1 (TNS V1-V3) 2014-01-08 10:13:15 4541 db file sequential read BBGL_LONG_D@kj_orcl_cxbb3 (TNS V1-V3) 2014-01-08 10:13:15 4828 ZHCX_LONG_C@kj_orcl_cxbb1 (TNS V1-V3) 2014-01-08 10:13:15 5002 db file sequential read ORCL_TDGL_LONG@kj_orcl_cxbb2 (TNS V1-V3) 2014-01-08 10:13:15 5022 db file parallel read BBGL_LONG_A@kj_orcl_cxbb3 (TNS V1-V3) 2014-01-08 10:13:15 5237 gc cr multi block request JL_JLZC_LONG@kj_orcl_app2 (TNS V1-V3) 2014-01-08 10:13:15 5417 db file sequential read ZHCX_LONG_C@kj_orcl_app7 (TNS V1-V3) 2014-01-08 10:13:15 5440 ORCL_ZHCX_LONG@kj_orcl_app7 (TNS V1-V3) 2014-01-08 10:13:15 6064 enq: TX - row lock contention DFGD_LONG@kj_orcl_app1 (TNS V1-V3) 2014-01-08 10:13:15 6083 enq: TX - row lock contention DFGD_LONG@kj_orcl_app2 (TNS V1-V3) 2014-01-08 10:13:15 6181 db file sequential read BBGL_LONG_B@kj_orcl_app5 (TNS V1-V3) 2014-01-08 10:13:15 6205 enq: TX - row lock contention DFGD_LONG@kj_orcl_app1 (TNS V1-V3) 2014-01-08 10:13:15 6595 oracle@kjorcldb2 (LMS5) ------------------------------------------------------------------------------------ SYS@xj11g> select to_char(sample_time,'yyyy-mm-dd hh24:mi:ss') sample_time,session_id, event,program,blocking_session 2 from dba_hist_active_sess_history where sample_time>=sysdate-3 3 and instance_number=2 and dbid=3516340257 4 and sample_id=144689910 5 order by sample_time,session_id; SAMPLE_TIME SESSION_ID EVENT PROGRAM BLOCKING_SESSION -------------- --------- ------------------------------ ------------------------------ 2014-01-08 10:13:25 3932 log file sync DFFX_LONG@ORCL-APP4 (TNS V1-V3) 6568 2014-01-08 10:13:25 3942 log file sync DFYW_SHORT@kj_orcl_cxbb3 (TNS V1-V3) 6568 2014-01-08 10:13:25 4056 db file scattered read plSQLdev.exe 2014-01-08 10:13:25 4061 log file sync process.d@kj_orcl_app1 (TNS V1-V3) 6568 2014-01-08 10:13:25 4134 log file sync DFYW_LONG_D@kj_orcl_cxbb1 (TNS V1-V3) 6568 …为节省篇幅,去掉大量类似数据… 2014-01-08 10:13:25 6460 YZX_LONG@kj_orcl_app1 (TNS V1-V3) 2014-01-08 10:13:25 6461 log file sync DFFX_SHORT@kj_orcl_app5 (TNS V1-V3) 6568 2014-01-08 10:13:25 6462 log file sync QXGL_SHORT@kj_orcl_app3 (TNS V1-V3) 6568 2014-01-08 10:13:25 6470 log file sync QXGL_SHORT@kj_orcl_app3 (TNS V1-V3) 6568 2014-01-08 10:13:25 6525 log file sync KHFW_SHORT@kj_orcl_web2 (TNS V1-V3) 6568 2014-01-08 10:13:25 6582 gcs log flush sync oracle@kjorcldb2 (LMSi) 6568 2014-01-08 10:13:25 6596 gcs log flush sync oracle@kjorcldb2 (LMS4) 6568
从ASH数据可以看到,10:13:15系统还处于正常状态,仅仅在10秒之后(在AWR中的ASH数据时间间隔为10秒),即10:13:25系统就产生了大量的阻塞。diag导致Bug被触发是在10:13:21这个时间。
案例总结
经过上述的深入分析,我们判断近段时间故障频发的主要原因是触发了Oracle的Bug 6859515,这个Bug为:Diagnostic collection may hang or crash the instance。具体如下所示。
(1)当Oracle的一个进程在等待某个资源(最常见的是事务锁,即TX锁)时间过长时,会向diag进程发起转储请求。diag进程对引起堵塞的进程发起转储(dump)操作,主要包括操作系统层面进程的信息、进程的call stack、在Oracle内部进程的状态和会话的状态等非常详细的数据,用于阻塞问题的诊断。
(2)diag进程会通过操作系统级的命令来收集被诊断进程的call stack信息,包括gdb调试工具或procstack等(不同的操作系统有所不同)。
gdb或procstack工具在取进程的call stack信息时是不安全的,可能会造成进程异常或挂起。如果此进程当时持有latch等重要资源,这些重要资源将不会被释放。因为进程只是异常或挂起,没有消失,因此Oracle的pmon进程不会清理和释放这些被一直占有的重要资源。
latch是Oracle内部一种低级的内存锁,使用非常频繁。通常latch的持有时间在毫秒级以下,如果进程长时间持有latch,将会使其他请求latch的进程也会一直挂起。最终关键的后台进程,如CKPT、LGWR等进程也会由于请求此latch获取无法得到响应而挂起,关键是后台进程的挂起会导致数据库挂起。
(3)异常进程持有的latch不同,系统的故障现象有所不同。
1)如果是持有cache buffer chains child latch,由于这种latch的子latch数量很多,每个子latch只是保护部分数据块头,所以只有在访问这个被持有的子latch保护的数据块时才会被阻塞(挂起)。进程一直会等待latch: cache buffer chains,随后会出现buffer busy waits、gc buffer busy等与数据块访问相关的等待。被挂起的进程如果处于事务之中,又会出现TX锁等待。如果这个latch保护的数据块是脏块,会导致DBWn(数据库写进程)进程等待这个latch而挂起。CKPT进程如果需要访问这个latch也会挂起,最终结果会导致不能完成检查点,日志不能切换,数据库挂起。cache buffer chains child latch数量众多,如果是由于一个child latch被异常进程持有一直到数据库挂起,历时时间较长,甚至可能达到数小时。
2)如果是持有redo copy latch、redo allocation latch等与redo相关的latch,由于这种latch只有一个或者是子latch数量极少,同时使用极为频繁,因此在短时间内LGWR这样的关键后台进程会挂起,大量进程出现log file sync等待,LMS等关键进程出现gcs log flush sync等待而挂起。其最终结果是短时间内出现大量进程挂起和系统挂起。
3)如果是持有shared pool latch、library cache latch等与SQL解析相关的latch,同样会在短时间内出现大量与解析相关的等待,导致大量进程挂起和系统挂起。
(4)在gdb或procstack导致进程异常的情况下,进程可能会被KILL命令杀掉,也可能不能被杀掉。如果进程能够被杀掉,那么数据库pmon进程会自动清理异常进程占用的资源,系统能够恢复正常。如果进程不能被杀掉,或者是pmon进程本身已经被阻塞而挂起,这样杀进程也不能使系统恢复正常,需要强制重启数据库才能使系统恢复运行。
要解决此问题,可以安装数据库补丁,补丁为8929701:TRACKING Bug FOR 6859515 ON AIX。安装这个补丁后,diag进程不再通过操作系统的gdb或procstack等工具获取被诊断进程的call stack等信息,而是改为使用Oracle内部的oradeBug short_stack命令来获取。如果不能打补丁,可以将隐含参数"_ksb_disable_diagpid"设置为TRUE作为临时解决方案。注意,这种设置之后,将不能自动收集进程诊断信息,这样当出现一些其他问题时可能会缺乏一些有效的诊断数据。
这个案例最终通过打补丁解决了问题。
通过这个案例,我想有如下的两点,对我们有些帮助。
1)很多朋友在安装数据库时,没有安装任何补丁。有些有经验的DBA,也只是安装了最新的PSU补丁,实际上有些对系统影响大的补丁并不在PSU中。比如在MOS文档“11.2.0.4 Patch Set - Availability and Known Issues (文档 ID 1562139.1)”中就列出了11.2.0.4这个版本中发现的出现频繁比较高的Bug。在可能的情况下,要把提到的Bug的补丁都打上。
2)在分析问题时,alert日志、system state dump、hang analyze trace、diagnostic进程trace和ASH(active session history)数据都是很有用的信息来源。多种数据交叉参考,可以避免单一数据形成的信息量不足没有思路,或者单一数据形成错误结论。