【此为"一森咖记"微信——第43篇文章】
【引言】
Acticve DataGuard突然实例宕机。通过查看Alert的Trace发现是LGWR进程把实例宕掉了。
具体alert trace如下:
【报错信息】
ORA-04021: timeout occurred while waiting to lock object
LGWR (ospid: 27954): terminating the instance due to error 4021
Wed Sep 11 02:28:54 2019
opiodr aborting process unknown ospid (28247) as a result of ORA-1092
Wed Sep 11 02:28:54 2019
ORA-1092 : opitsk aborting process
Wed Sep 11 02:28:54 2019
System state dump requested by (instance=1, osid=27954 (LGWR)), summary=[abnormal instance termination].
System State dumped to trace file oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc
Instance terminated by LGWR, pid = 27954
error 4021 detected in background process
可以看到这里首先出现了ORA-04021: timeout occurred while waiting to lock object,紧接着LGWR便terminating the instance。接着看一下Trace文件/oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc。
Trace file oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = opt/oracle/product/11.2.0.4/db_1
System name: HP-UX
Node name: ethan-db
Release: B.11.31
Version: U
Machine: ia64
Instance name: ORCL
Redo thread mounted by this instance: 1
Oracle process number: 16
Unix process pid: 27954, image: oracle@ethan-db (LGWR)
*** 2019-08-22 15:14:45.875
*** SESSION ID:(1162.1) 2019-08-22 15:14:45.875
*** CLIENT ID:() 2019-08-22 15:14:45.875
*** SERVICE NAME:() 2019-08-22 15:14:45.875
*** MODULE NAME:() 2019-08-22 15:14:45.875
*** ACTION NAME:() 2019-08-22 15:14:45.875
DDE rules only execution for: ORA 312
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----
Successfully dispatched
----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (SUCCESS, 0 csec) -----
Executing ASYNC actions
----- END DDE Actions Dump (total 0 csec) -----
DDE rules only execution for: ORA 313
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----
DDE Action 'DB_STRUCTURE_INTEGRITY_CHECK' was flood controlled
----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (FLOOD CONTROLLED, 0 csec) -----
Executing ASYNC actions
----- END DDE Actions Dump (total 0 csec) -----
*** 2019-09-11 02:28:54.002
ORA-04021: timeout occurred while waiting to lock object
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+592<-kjzdssdmp()+720<-kjzduptcctx()+512<-kjzdicrshnfy()+160<-$cold_ksuitm()+5808<-$cold_ksbrdp()+2768<-opirip()+1312<-opidrv()+1152<-sou2o()+256<-
opimai_real()+352<-ssthrdmain()+608<-main()+336<-main_opd_entry()+80
----- End of Abridged Call Stack Trace -----
*** 2019-09-11 02:28:54.017
LGWR (ospid: 27954): terminating the instance due to error 4021
ksuitm: waiting up to [5] seconds before killing DIAG(27921)
TRACE给的信息也有限,只给了一些堆栈信息。
【分析】
借助MOS查下ORA-04021: timeout occurred while waiting to lock object。发现文档ORA-04021: timeout occurred while waiting to lock object : DR Instance terminated by LGWR (文档 ID 2183882.1)与遇到的问题是一致的。判断理由:alter日志报错信息相同,第二,trace内容也一致。由文章得知(文档 ID 2183882.1),应该为Bug 16717701 – ADG SHOULD GET THE INSTANCE PARSE LOCK WITH A TIMEOUT或Bug 11712267 – ACTIVE DATA GUARD DATABASE HUNG ON ‘LIBRARY CACHE: MUTEX X’ WAIT EVENT。
当ADG通过日志恢复时,LGWR将实例状态对象锁定为独占模式。这样的结果是LGWR可以阻止SQL的解析,SQL的解析也能阻止LGWR。为了减少SQL解析对LGWR的影响,根据官网的solution可以将cursor_sharing设置为force(文档 ID 2183882.1)。
SYS@ORCL> show parameter cursor_sharing;
NAME TYPE
------------------------------------ ---------------------------------
VALUE
------------------------------
cursor_sharing string
EXACT
这里简单介绍下参数cursor_sharing,官网介绍如下:
CURSOR_SHARING determines what kind of SQL statements can share the same cursors.
Values:
(1)FORCE
Allows the creation of a new cursor if sharing an existing cursor, or if the cursor plan is not optimal.
(2)SIMILAR
Causes statements that may differ in some literals, but are otherwise identical, to share a cursor, unless the literals affect either the meaning of the statement or the degree to which the plan is optimized.
(3)EXACT
Only allows statements with identical text to share the same cursor.
Notes:
(1)If you set CURSOR_SHARING, then Oracle recommends the FORCE setting unless you are in a DSS environment. FORCE limits the growth of child cursors that can occur when the setting is SIMILAR.
(2)The value of the CURSOR_SHARING parameter has performance implications. Refer to Oracle Database Performance Tuning Guide before setting this parameter.
简单介绍下上述内容:
Cursor_sharing参数是用来告诉Oracle在什么情况下可以共享游标,即SQL重用。
Cursor_sharing参数有3个值可以设置:
1. EXACT:通常来说,exa7ct值是Oracle推荐的,也是默认的,它要求SQL语句在完全相同时才会重用,否则会被重新执行硬解析操作。
2. SIMILAR:similar是在Oracle认为某条SQL语句的谓词条件可能会影响到它的执行计划时,才会被重新分析,否则将重用SQL。
3. FORCE:force是在任何情况下,无条件重用SQL。
注意:
上面所说的SQL重用,仅仅是指谓词条件不同的SQL语句,实际上这样的SQL基本上都在执行同样的业务操作。
查看视图v$sesstat 和v$statname 查看LGWR阻止的SQL
SYS@ORCL> select a.*,b.name from v$sesstat a , v$statname b
2 where a.statistic#=b.statistic#
3 and a.sid=(select distinct sid from v$mystat)
4 and b.name like '%parse%';
SID STATISTIC# VALUE NAME
---------- ---------- ---------- ------------------------------
1743 264 0 ADG parselock X get attempts
1743 265 0 ADG parselock X get successes
1743 622 4 parse time cpu
1743 623 5 parse time elapsed
1743 624 20 parse count (total)
1743 625 7 parse count (hard)
1743 626 0 parse count (failures)
1743 627 0 parse count (describe)
8 rows selected.
MOS(文档 ID 2183882.1)提供的solution如下:
SOLUTION
Issue matches with bug 11712267 and bug 16717701
Since two bugs are matching with the case,
You can try with option (1) . As per Bug 11712267
change the cursor_sharing to force on Active dataguard (ADG).
Monitor your environment for sometime.
If it crashes again then follow with the option (2)Option (2):
As per bug description
LGWR can request DBINSTANCE lock in X mode without any timeout which can lead to a hang deadlock.
Both fixes are already included in 11.2.0.4 but the fix is DISABLED by default.== > To ENABLE the fix one has to set == > "_adg_parselock_timeout" > to the number of centi-seconds == > LGWR should wait before backing off and retrying the request.
Value should be in centi seconds. == > I Don't think there is really any hard fast rule for a value - at default (0) it will not timeout.A value representing a few seconds seems reasonable - if LGWR has been stuck for say 5 seconds waiting it seems reasonable guess it is not going to get the lock.
The param just causes it to abort the current attempt and retry If you want to play safe can start with a higher value then decrease later.A higher value will just mean more sessions blocked for longer in case of the deadlock situation.500 Seems reasonable , but I have no data to base it on.
There should be a statistic "ADG parselock X get attempts" If it gets set too small that value would likely increase a lot due to keep timing out and retrying.
This is a dynamic parameter
Follow option (1) .
change the cursor_sharing to force on ADG
If issue re-appears then follow option (2) as below
Please set "_adg_parselock_timeout" to 500 == >
SQL > alter system set "_adg_parselock_timeout"=500 scope=both sid='*';
简单翻译如下:
1.将cursor_sharing更改为force,减少SQL解析的时间。
2.如果再次发生该问题,将隐含参数”_adg_parselock_timeout”设置成500,加长解析时间,该参数作用防止超时;注意此参数可动态修改。
SQL > alter system set "_adg_parselock_timeout"=500 scope=both sid='*';
【参考】
https://blog.51cto.com/lyzbg/2090812
【参考】
https://blog.csdn.net/vic_qxz/article/details/88296722
【参考】
ORA-04021: timeout occurred while waiting to lock object : DR Instance terminated by LGWR (文档 ID 2183882.1)
【参考】
http://download.oracle.com/docs/cd/E11882_01/server.112/e17110/initparams042.htm#REFRN10025
欢迎关注个人微信微信;
长按以下二维码或微信搜索“一森咖记”
往期精彩文章
=====================================