排故篇：ADG实例被LGWR进程宕掉？！

一森咖记 2019-09-11

2789

【此为"一森咖记"微信——第43篇文章】

【引言】

Acticve DataGuard突然实例宕机。通过查看Alert的Trace发现是LGWR进程把实例宕掉了。

具体alert trace如下：

【报错信息】

ORA-04021: timeout occurred while waiting to lock object

LGWR (ospid: 27954): terminating the instance due to error 4021

Wed Sep 11 02:28:54 2019

opiodr aborting process unknown ospid (28247) as a result of ORA-1092

Wed Sep 11 02:28:54 2019

ORA-1092 : opitsk aborting process

Wed Sep 11 02:28:54 2019

System state dump requested by (instance=1, osid=27954 (LGWR)), summary=[abnormal instance termination].

System State dumped to trace file oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc

Instance terminated by LGWR, pid = 27954

error 4021 detected in background process

可以看到这里首先出现了ORA-04021: timeout occurred while waiting to lock object，紧接着LGWR便terminating the instance。接着看一下Trace文件/oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc。

Trace file oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

ORACLE_HOME = opt/oracle/product/11.2.0.4/db_1

System name: HP-UX

Node name: ethan-db

Release: B.11.31

Version: U

Machine: ia64

Instance name: ORCL

Redo thread mounted by this instance: 1

Oracle process number: 16

Unix process pid: 27954, image: oracle@ethan-db (LGWR)

*** 2019-08-22 15:14:45.875

*** SESSION ID:(1162.1) 2019-08-22 15:14:45.875

*** CLIENT ID:() 2019-08-22 15:14:45.875

*** SERVICE NAME:() 2019-08-22 15:14:45.875

*** MODULE NAME:() 2019-08-22 15:14:45.875

*** ACTION NAME:() 2019-08-22 15:14:45.875

DDE rules only execution for: ORA 312

----- START Event Driven Actions Dump ----

---- END Event Driven Actions Dump ----

----- START DDE Actions Dump -----

Executing SYNC actions

----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----

Successfully dispatched

----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (SUCCESS, 0 csec) -----

Executing ASYNC actions

----- END DDE Actions Dump (total 0 csec) -----

DDE rules only execution for: ORA 313

----- START Event Driven Actions Dump ----

---- END Event Driven Actions Dump ----

----- START DDE Actions Dump -----

Executing SYNC actions

----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----

DDE Action 'DB_STRUCTURE_INTEGRITY_CHECK' was flood controlled

----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (FLOOD CONTROLLED, 0 csec) -----

Executing ASYNC actions

----- END DDE Actions Dump (total 0 csec) -----

*** 2019-09-11 02:28:54.002

ORA-04021: timeout occurred while waiting to lock object

kjzduptcctx: Notifying DIAG for crash event

----- Abridged Call Stack Trace -----

ksedsts()+592<-kjzdssdmp()+720<-kjzduptcctx()+512<-kjzdicrshnfy()+160<-$cold_ksuitm()+5808<-$cold_ksbrdp()+2768<-opirip()+1312<-opidrv()+1152<-sou2o()+256<-

opimai_real()+352<-ssthrdmain()+608<-main()+336<-main_opd_entry()+80

----- End of Abridged Call Stack Trace -----

*** 2019-09-11 02:28:54.017

LGWR (ospid: 27954): terminating the instance due to error 4021

ksuitm: waiting up to [5] seconds before killing DIAG(27921)

TRACE给的信息也有限，只给了一些堆栈信息。

【分析】

借助MOS查下ORA-04021: timeout occurred while waiting to lock object。发现文档ORA-04021: timeout occurred while waiting to lock object : DR Instance terminated by LGWR (文档 ID 2183882.1)与遇到的问题是一致的。判断理由：alter日志报错信息相同，第二，trace内容也一致。由文章得知（文档 ID 2183882.1），应该为Bug 16717701 – ADG SHOULD GET THE INSTANCE PARSE LOCK WITH A TIMEOUT或Bug 11712267 – ACTIVE DATA GUARD DATABASE HUNG ON ‘LIBRARY CACHE: MUTEX X’ WAIT EVENT。

当ADG通过日志恢复时，LGWR将实例状态对象锁定为独占模式。这样的结果是LGWR可以阻止SQL的解析，SQL的解析也能阻止LGWR。为了减少SQL解析对LGWR的影响，根据官网的solution可以将cursor_sharing设置为force（文档 ID 2183882.1）。

SYS@ORCL> show parameter cursor_sharing;
 
NAME                                 TYPE
------------------------------------ ---------------------------------
VALUE
------------------------------
cursor_sharing                       string
EXACT

这里简单介绍下参数cursor_sharing，官网介绍如下：

CURSOR_SHARING determines what kind of SQL statements can share the same cursors.

Values:

（1）FORCE

Allows the creation of a new cursor if sharing an existing cursor, or if the cursor plan is not optimal.

（2）SIMILAR

Causes statements that may differ in some literals, but are otherwise identical, to share a cursor, unless the literals affect either the meaning of the statement or the degree to which the plan is optimized.

（3）EXACT

Only allows statements with identical text to share the same cursor.

Notes:

（1）If you set CURSOR_SHARING, then Oracle recommends the FORCE setting unless you are in a DSS environment. FORCE limits the growth of child cursors that can occur when the setting is SIMILAR.

（2）The value of the CURSOR_SHARING parameter has performance implications. Refer to Oracle Database Performance Tuning Guide before setting this parameter.

简单介绍下上述内容：

Cursor_sharing参数是用来告诉Oracle在什么情况下可以共享游标，即SQL重用。

Cursor_sharing参数有3个值可以设置：

1. EXACT：通常来说，exa7ct值是Oracle推荐的，也是默认的，它要求SQL语句在完全相同时才会重用，否则会被重新执行硬解析操作。

2. SIMILAR：similar是在Oracle认为某条SQL语句的谓词条件可能会影响到它的执行计划时，才会被重新分析，否则将重用SQL。

3. FORCE：force是在任何情况下，无条件重用SQL。

注意：

上面所说的SQL重用，仅仅是指谓词条件不同的SQL语句，实际上这样的SQL基本上都在执行同样的业务操作。

查看视图v$sesstat 和v$statname 查看LGWR阻止的SQL

SYS@ORCL> select a.*,b.name from v$sesstat a , v$statname b
  2  where a.statistic#=b.statistic#
  3  and a.sid=(select distinct sid from v$mystat)
  4  and b.name like '%parse%';
       SID STATISTIC#      VALUE NAME
---------- ---------- ---------- ------------------------------
      1743        264          0 ADG parselock X get attempts
      1743        265          0 ADG parselock X get successes
      1743        622          4 parse time cpu
      1743        623          5 parse time elapsed
      1743        624         20 parse count (total)
      1743        625          7 parse count (hard)
      1743        626          0 parse count (failures)
      1743        627          0 parse count (describe)
8 rows selected.

MOS（文档 ID 2183882.1）提供的solution如下：

SOLUTION

Issue matches with bug 11712267 and bug 16717701

Since two bugs are matching with the case,

You can try with option (1) . As per Bug 11712267

change the cursor_sharing to force on Active dataguard (ADG).

Monitor your environment for sometime.

If it crashes again then follow with the option (2)Option (2):

As per bug description

LGWR can request DBINSTANCE lock in X mode without any timeout which can lead to a hang deadlock.

Both fixes are already included in 11.2.0.4 but the fix is DISABLED by default.== > To ENABLE the fix one has to set == > "_adg_parselock_timeout" > to the number of centi-seconds == > LGWR should wait before backing off and retrying the request.

Value should be in centi seconds. == > I Don't think there is really any hard fast rule for a value - at default (0) it will not timeout.A value representing a few seconds seems reasonable - if LGWR has been stuck for say 5 seconds waiting it seems reasonable guess it is not going to get the lock.

The param just causes it to abort the current attempt and retry If you want to play safe can start with a higher value then decrease later.A higher value will just mean more sessions blocked for longer in case of the deadlock situation.500 Seems reasonable , but I have no data to base it on.

There should be a statistic "ADG parselock X get attempts" If it gets set too small that value would likely increase a lot due to keep timing out and retrying.

This is a dynamic parameter

Follow option (1) .

change the cursor_sharing to force on ADG

If issue re-appears then follow option (2) as below

Please set "_adg_parselock_timeout" to 500 == >

SQL > alter system set "_adg_parselock_timeout"=500 scope=both sid='*';

简单翻译如下：

1.将cursor_sharing更改为force，减少SQL解析的时间。

2.如果再次发生该问题，将隐含参数”_adg_parselock_timeout”设置成500，加长解析时间，该参数作用防止超时；注意此参数可动态修改。

SQL > alter system set "_adg_parselock_timeout"=500 scope=both sid='*';

【参考】

https://blog.51cto.com/lyzbg/2090812

【参考】

https://blog.csdn.net/vic_qxz/article/details/88296722

【参考】

ORA-04021: timeout occurred while waiting to lock object : DR Instance terminated by LGWR (文档 ID 2183882.1)

【参考】

http://download.oracle.com/docs/cd/E11882_01/server.112/e17110/initparams042.htm#REFRN10025

欢迎关注个人微信微信；

长按以下二维码或微信搜索“一森咖记”

往期精彩文章

=====================================

干货：RHEL7.2生产环境下双节点12c RAC搭建实操
干货：RHEL7.1环境下双节点Oracle RAC搭建实操
LINUX环境：MySQL和Oracle开机自启动，咋搞？
What：ASM自动脱落了
实操：12C RAC环境下的ADG同步库搭建
Oracle Rac：关闭透明大页的原因及方法
Oracle ADG同步技术，DBA必备的一种“后悔药”
Logminer：oracle人为误操作之恢复神器
“神器”：Oracle日志采集分析工具——TFA
技术八卦篇：Oracle云时代下的一个不再默认存在账号：Scott

oracle

文章转载自一森咖记，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

排故篇：ADG实例被LGWR进程宕掉？！

评论