问题描述
环境是一套11.2.0.3 2nodes RAC on hpux-ia31, alert中出现ora-32701 hangmgr错误, 从trace文件中发现是m000进程是mmon的辅助进程,用于flush AWR相关数据,有一个wait event: enq: WF – contention, 这也是flush AWR数据时相关的enqueue等待,但是blocker进程是not in wait, (另mmon hang是可以直接kill 该进程spid, 通常会在不重启实例的前掉下重启该进程)。
专家解答
这里简单的记录该问题
# db alert
Thread 2 advanced to log sequence 170773 (LGWR switch) Current log# 8 seq# 170773 mem# 0: /dev/tza_oravg02/rtza_redo08 Thu May 26 17:08:11 2016 Errors in file /oracle/app/oracle/diag/rdbms/ucisa/anbob2/trace/anbob2_dia0_29439.trc (incident=720137): ORA-32701: Possible hangs up to hang ID=0 detected Incident details in: /oracle/app/oracle/diag/rdbms/ucisa/anbob2/incident/incdir_720137/anbob2_dia0_29439_i720137.trc Thu May 26 17:08:12 2016 Sweep [inc][720137]: completed System State dumped to trace file /oracle/app/oracle/diag/rdbms/ucisa/anbob2/incident/incdir_720137/anbob2_m001_22862_i720137_a.trc DIA0 terminating blocker (ospid: 13964 sid: 5649 ser#: 7241) of hang with ID = 3 requested by master DIA0 process on instance 1 Hang Resolution Reason: Although the number of affected sessions did not justify automatic hang resolution initially, this previously ignored hang was automatically resolved. by terminating session sid: 5649 ospid: 13964 DIA0 successfully terminated session sid:5649 ospid:13964 with status 31. Thu May 26 17:09:46 2016 Errors in file /oracle/app/oracle/diag/rdbms/ucisa/anbob2/trace/anbob2_dia0_29439.trc (incident=720138): ORA-32701: Possible hangs up to hang ID=0 detected Incident details in: /oracle/app/oracle/diag/rdbms/ucisa/anbob2/incident/incdir_720138/anbob2_dia0_29439_i720138.trc Thu May 26 17:09:47 2016 Sweep [inc][720138]: completed Thu May 26 17:09:47 2016 Sweep [inc2][720138]: completed Sweep [inc2][720137]: completed System State dumped to trace file /oracle/app/oracle/diag/rdbms/ucisa/anbob2/incident/incdir_720138/anbob2_m003_23913_i720138_a.trc DIA0 terminating blocker (ospid: 13964 sid: 5649 ser#: 7241) of hang with ID = 3 requested by master DIA0 process on instance 1 Hang Resolution Reason: Although the number of affected sessions did not justify automatic hang resolution initially, this previously ignored hang was automatically resolved. by terminating the process DIA0 successfully terminated process ospid:13964. Thu May 26 17:17:16 2016 Thread 2 advanced to log sequence 170774 (LGWR switch) Current log# 9 seq# 170774 mem# 0: /dev/tza_oravg02/rtza_redo09
# trace file: /oracle/app/oracle/diag/rdbms/ucisa/anbob2/trace/anbob2_dia0_29439.trc
Dump continued from file: /oracle/app/oracle/diag/rdbms/ucisa/anbob2/trace/anbob2_dia0_29439.trc ORA-32701: Possible hangs up to hang ID=0 detected ========= Dump for incident 720137 (ORA 32701) ======== ----- Beginning of Customized Incident Dump(s) ----- There are resolvable hangs on your system. Hang Manger will attempt to resolve these hangs. Some information about these is output below. Complete information is available in an incident trace file on instance 1. If the hang is to be resolved by terminating the session or process that is the root or victim of the hang, additional information will be output on the local instance of that session or process. Below are the hangs for which resolution will be attempted. *** 2016-05-26 17:08:11.765 Resolvable Hangs in the System Root Chain Total Hang Hang Hang Inst Root #hung #hung Hang Hang Resolution ID Type Status Num Sess Sess Sess Conf Span Action ----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- 3 HANG RSLNPEND 2 5649 2 2 HIGH GLOBAL Terminate Process Hang Resolution Reason: Although the number of affected sessions did not justify automatic hang resolution initially, this previously ignored hang was automatically resolved. inst# SessId Ser# OSPID PrcNm Event ----- ------ ----- --------- ----- ----- 1 6238 38599 3697 M000 enq: WF - contention 2 5649 7241 13964 M000 not in wait Dumping process info of pid[212.13964] (sid:5649, ser#:7241) requested by master DIA0 process on instance 1. *** 2016-05-26 17:08:11.765 Process diagnostic dump for oracle@qdtza2 (M000), OS id=13964, pid: 212, proc_ser: 66, sid: 5649, sess_ser: 7241 ------------------------------------------------------------------------------- os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 17:08:11 ] NOTE: scheduling delay has not been sampled for 0.103181 secs 0.000667 secs from [ 17:08:07 - 17:08:12 ], 5 sec avg 0.003777 secs from [ 17:07:12 - 17:08:12 ], 1 min avg 0.000976 secs from [ 17:03:12 - 17:08:12 ], 5 min avg loadavg : 0.19 0.18 0.17 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Swapinfo : Avail = 380207.03Mb Used = 165771.44Mb Swap free = 214385.22Mb Kernel rsvd = 15198.44Mb Free Mem = 83933.84Mb F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD 1401 R oracle 13964 1 254 178 20 e000000d77ef2040 102092 - 17:00:09 ? 7:57 ora_m000_anbob2 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Short stack dump: ksedsts()+544<-ksdxfstk()+48<-ksdxcb()+3216<-sspuser()+688<-<-kghalf()+385<-kqlfbctc()+1024<-kgligi()+128<-kglic_cbk()+832<-kglic0()+1376<-kglic()+144<-kqlfbct()+560<-qerfxFetch()+2784<-qerjotRowProc()+2432<-qerhjWalkHashBucket()+768<-qerhjInnerProbeHashTable()+1168<-qerfxFetch()+1680<-rwsfcd()+320<-qerhjFetch()+1040<-qerjotFetch()+3664<-rwsfcd()+320<-qerltcFetch()+1696<-insexe()+1584<-opiexe()+16192<-kpoal8()+4624<-opiodr()+2416<-kpoodrc()+64<-rpiswu2()+1120<-kpoodr()+1008<-upirtrc()+2640<-kpurcsc()+256<-kpuexec()+6672<-OCIStmtExecute()+80<-kewrose_oci_stmt_exec()+96<-kewrgwxf1_gwrsql_exft_1()+576<-kewrgwxf_gwrsql_exft()+928<-kewrftbs_flush_table_by_sql()+352<-kewrfabt_flush_attribute_table()+1120<-kewrfsb_flush_sqlbind()+80<-kewrft_flush_table()+800<-kewrftec_flush_table_ehdlcx()+832<-kewrftsq_flush_topsql()+624<-kewrft_flush_table()+800<-kewrftec_flush_table_ehdlcx()+832<-kewrfat_flush_all_tables()+2144<-kewrfos_flush_onesnap()+272<-kewrfsc_flush_snapshot_c()+832<-kewrafs_auto_flush_slave()+1008<-kebm_slave_main()+944<-ksvrdp()+5440<-opirip()+1360<-opidrv()+1152<-sou2o()+256<-opimai_real()+352<-ssthrdmain()+576<-main()+336<-main_opd_entry()+80 ------------------------------------------------------------------------------- Process diagnostic dump actual duration=0.392000 sec (max dump time=15.000000 sec) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< *** 2016-05-26 17:08:12.156 current sql: insert into wrh$_sql_bind_metadata (snap_id, dbid, sql_id, name, position, dup_position, datatype, datatype_string, character_sid, precision, scale, max_length) SELECT /*+ ordered use_nl(bnd) index(bnd sql_id) */ :lah_snap_id, :dbid, bnd.sql_id, name, position, dup_position, datatype, dataty ---------------------------------------- SO: 0xc000002071d7c428, type: 2, owner: 0x0000000000000000, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0xc000002071d7c428, name=process, file=ksu.h LINE:12616 ID:, pg=0 (process) Oracle pid:212, ser:66, calls cur/top: 0xc000001fb321ec20/0xc000001fb31300e8 flags : (0x2) SYSTEM flags2: (0x30), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 80 last post received-location: kji.h LINE:3418 ID:kjata: wake up enqueue owner last process to post me: c0000020b1d72b30 2 6 last post sent: 0 0 26 last post sent-location: ksa2.h LINE:285 ID:ksasnd last process posted by me: c000002071d60198 2 6 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 0xc0000020a20d45a0 O/S info: user: oracle, term: UNKNOWN, ospid: 13964 OSD pid info: Unix process pid: 13964, image: oracle@qdtza2 (M000) ---------------------------------------- SO: 0xc0000020505807f8, type: 10, owner: 0xc000002071d7c428, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0xc000002071d7c428, name=FileOpenBlock, file=ksfd.h LINE:6448 ID:, pg=0 (FOB) c0000020505807f8 flags=2050 fib=c0000020c8afa058 incno=0 pending i/o cnt=0 fname=/dev/tza_oravg04/rtza_lv16_403 fno=820 lblksz=16384 fsiz=1105919 service name: SYS$BACKGROUND Current Wait Stack: Not in wait; last wait ended 7 min 48 sec ago <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< There are 1 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 6238, ser: 38599 wait event: 'enq: WF - contention' p1: 'name|mode'=0x57460006 p2: '0'=0x46 p3: '0'=0x0 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 <<<<<<<<<<<<<<<<<<<<<<<<<< min_blocked_time: 445 secs, waiter_cache_ver: 50989 Wait State: fixed_waits=0 flags=0x20 boundary=0x0000000000000000/-1 Session Wait History: elapsed time of 7 min 48 sec since last wait 0: waited for 'gc cr block 2-way' =0x3, =0x115a3, =0x1 wait_id=5905 seq_num=5906 snap_id=1 wait times: snap=0.000424 sec, exc=0.000424 sec, total=0.000424 sec wait times: max=infinite wait counts: calls=1 os=1 occurred after 0.000712 sec of elapsed time 1: waited for 'gc current block 2-way' =0x3, =0x13148, =0x1 wait_id=5904 seq_num=5905 snap_id=1 wait times: snap=0.000325 sec, exc=0.000325 sec, total=0.000325 sec wait times: max=infinite wait counts: calls=1 os=1 occurred after 0.001039 sec of elapsed time 2: waited for 'enq: WF - contention' name|mode=0x57460006, 0=0x38, 0=0x0 wait_id=5903 seq_num=5904 snap_id=1 wait times: snap=0.000257 sec, exc=0.000257 sec, total=0.000257 sec wait times: max=infinite wait counts: calls=2 os=2 occurred after 1.078632 sec of elapsed time 3: waited for 'enq: WF - contention' name|mode=0x57460006, 0=0x46, 0=0x0 wait_id=5902 seq_num=5903 snap_id=1 wait times: snap=0.000282 sec, exc=0.000282 sec, total=0.000282 sec wait times: max=infinite wait counts: calls=2 os=2 occurred after 0.002117 sec of elapsed time 4: waited for 'gc current block 2-way' =0x123, =0x31410, =0x2000001 wait_id=5901 seq_num=5902 snap_id=1 wait times: snap=0.000276 sec, exc=0.000276 sec, total=0.000276 sec wait times: max=infinite wait counts: calls=1 os=1 occurred after 0.000529 sec of elapsed time 5: waited for 'gc current block 2-way' =0x123, =0x71a2, =0x2000009 wait_id=5900 seq_num=5901 snap_id=1 wait times: snap=0.000141 sec, exc=0.000141 sec, total=0.000141 sec wait times: max=infinite wait counts: calls=1 os=1 occurred after 0.000231 sec of elapsed time ...
有个类似 Bug 20733574 : HANG DETECTED WHILE CREATING AWR SNAPSHOT, 但当时因为收集信息不足,开发未确认为bug;
还有一种解决方法是跳过收集与该表相关的数据(这个案例是wrh$_sql_bind_metadata记录的是SQL Bind Metadata),使用下面的命令:
alter system set “_awr_disabled_flush_tables” = ‘wrh$_sql_bind_metadata’;
该命令是动态的,不需要重启实例, 禁用多个表时参数用逗号分隔;
原理禁用了和上面表相关的统计信息flush ash to awr, 除了上述的bug ,还有其它一些原因时如空间不足、hang、也可以禁用,除了上面的禁用方法还有另一种禁用方法,先查询AWR Table Info (KEW layer)使用下面的SQL:
select table_id_kewrtb, table_name_kewrtb from x$kewrtb order by 1;
从上面的查询中得到table_id用于下面的level 值,比如我们这里表wrh$_sql_bind_metadata对应的是56
alter session set events ‘immediate trace name awr_flush_table_off level 56’;
后期如果想打开flushing to awr如果使用第一种方法就是把”_awr_disabled_flush_tables” reset掉;如果第二种方法禁用,打开使用
alter session set events ‘immediate trace name awr_flush_table_on level 56’;