暂无图片
暂无图片
1
暂无图片
暂无图片
暂无图片

排故篇:ADG实例被LGWR进程宕掉?!

一森咖记 2019-09-11
1570

【此为"一森咖记"微信——第43篇文章】

【引言

Acticve DataGuard突然实例宕机。通过查看Alert的Trace发现是LGWR进程把实例宕掉了。

具体alert trace如下:


【报错信息】

ORA-04021: timeout occurred while waiting to lock object

LGWR (ospid: 27954): terminating the instance due to error 4021

Wed Sep 11 02:28:54 2019

opiodr aborting process unknown ospid (28247) as a result of ORA-1092

Wed Sep 11 02:28:54 2019

ORA-1092 : opitsk aborting process

Wed Sep 11 02:28:54 2019

System state dump requested by (instance=1, osid=27954 (LGWR)), summary=[abnormal instance termination].

System State dumped to trace file oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc

Instance terminated by LGWR, pid = 27954

 

error 4021 detected in background process


可以看到这里首先出现了ORA-04021: timeout occurred while waiting to lock object,紧接着LGWR便terminating the instance。接着看一下Trace文件/oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc。


Trace file oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

ORACLE_HOME = opt/oracle/product/11.2.0.4/db_1

System name:    HP-UX

Node name:      ethan-db

Release:        B.11.31

Version:        U

Machine:        ia64

Instance name: ORCL

Redo thread mounted by this instance: 1

Oracle process number: 16

Unix process pid: 27954, image: oracle@ethan-db (LGWR)

 

*** 2019-08-22 15:14:45.875

*** SESSION ID:(1162.1) 2019-08-22 15:14:45.875

*** CLIENT ID:() 2019-08-22 15:14:45.875

*** SERVICE NAME:() 2019-08-22 15:14:45.875

*** MODULE NAME:() 2019-08-22 15:14:45.875

*** ACTION NAME:() 2019-08-22 15:14:45.875

 

DDE rules only execution for: ORA 312

----- START Event Driven Actions Dump ----

---- END Event Driven Actions Dump ----

----- START DDE Actions Dump -----

Executing SYNC actions

----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----

Successfully dispatched

----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (SUCCESS, 0 csec) -----

Executing ASYNC actions

----- END DDE Actions Dump (total 0 csec) -----

DDE rules only execution for: ORA 313

----- START Event Driven Actions Dump ----

---- END Event Driven Actions Dump ----

----- START DDE Actions Dump -----

Executing SYNC actions

----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----

DDE Action 'DB_STRUCTURE_INTEGRITY_CHECK' was flood controlled

----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (FLOOD CONTROLLED, 0 csec) -----

Executing ASYNC actions

----- END DDE Actions Dump (total 0 csec) -----

 

*** 2019-09-11 02:28:54.002

ORA-04021: timeout occurred while waiting to lock object

kjzduptcctx: Notifying DIAG for crash event

----- Abridged Call Stack Trace -----

ksedsts()+592<-kjzdssdmp()+720<-kjzduptcctx()+512<-kjzdicrshnfy()+160<-$cold_ksuitm()+5808<-$cold_ksbrdp()+2768<-opirip()+1312<-opidrv()+1152<-sou2o()+256<-

opimai_real()+352<-ssthrdmain()+608<-main()+336<-main_opd_entry()+80

----- End of Abridged Call Stack Trace -----

*** 2019-09-11 02:28:54.017

LGWR (ospid: 27954): terminating the instance due to error 4021

ksuitm: waiting up to [5] seconds before killing DIAG(27921)


TRACE给的信息也有限,只给了一些堆栈信息。


【分析】

借助MOS查下ORA-04021: timeout occurred while waiting to lock object。发现文档ORA-04021: timeout occurred while waiting to lock object : DR Instance terminated by LGWR (文档 ID 2183882.1)与遇到的问题是一致的。判断理由:alter日志报错信息相同,第二,trace内容也一致。由文章得知(文档 ID 2183882.1),应该为Bug 16717701 – ADG SHOULD GET THE INSTANCE PARSE LOCK WITH A TIMEOUT或Bug 11712267 – ACTIVE DATA GUARD DATABASE HUNG ON ‘LIBRARY CACHE: MUTEX X’ WAIT EVENT。


当ADG通过日志恢复时,LGWR将实例状态对象锁定为独占模式。这样的结果是LGWR可以阻止SQL的解析,SQL的解析也能阻止LGWR。为了减少SQL解析对LGWR的影响,根据官网的solution可以将cursor_sharing设置为force文档 ID 2183882.1)。

    SYS@ORCL> show parameter cursor_sharing;

    NAME TYPE
    ------------------------------------ ---------------------------------
    VALUE
    ------------------------------
    cursor_sharing string
    EXACT

    这里简单介绍下参数cursor_sharing,官网介绍如下:

     CURSOR_SHARING determines what kind of SQL statements can share the same cursors.

     Values:

    1FORCE

           Allows the creation of a new cursor if sharing an existing cursor, or if the cursor plan is not optimal.

     2SIMILAR

           Causes statements that may differ in some literals, but are otherwise identical, to share a cursor, unless the literals affect either the meaning of the statement or the degree to which the plan is optimized.

     3EXACT

           Only allows statements with identical text to share the same cursor.


    Notes:

           1If you set CURSOR_SHARING, then Oracle recommends the FORCE setting unless you are in a DSS environment. FORCE limits the growth of child cursors that can occur when the setting is SIMILAR.

       2The value of the CURSOR_SHARING parameter has performance implications. Refer to Oracle Database Performance Tuning Guide before setting this parameter.


    简单介绍下上述内容:

    Cursor_sharing参数是用来告诉Oracle在什么情况下可以共享游标,即SQL重用。

    Cursor_sharing参数有3个值可以设置:

    1. EXACT:通常来说,exa7ct值是Oracle推荐的,也是默认的,它要求SQL语句在完全相同时才会重用,否则会被重新执行硬解析操作。

    2. SIMILARsimilar是在Oracle认为某条SQL语句的谓词条件可能会影响到它的执行计划时,才会被重新分析,否则将重用SQL

    3. FORCEforce是在任何情况下,无条件重用SQL


    注意:

    上面所说的SQL重用,仅仅是指谓词条件不同的SQL语句,实际上这样的SQL基本上都在执行同样的业务操作。


    查看视图v$sesstat 和v$statname 查看LGWR阻止的SQL

      SYS@ORCL> select a.*,b.name from v$sesstat a , v$statname b
      2 where a.statistic#=b.statistic#
      3 and a.sid=(select distinct sid from v$mystat)
      4 and b.name like '%parse%';
      SID STATISTIC# VALUE NAME
      ---------- ---------- ---------- ------------------------------
      1743 264 0 ADG parselock X get attempts
      1743 265 0 ADG parselock X get successes
      1743 622 4 parse time cpu
      1743 623 5 parse time elapsed
      1743 624 20 parse count (total)
      1743 625 7 parse count (hard)
      1743 626 0 parse count (failures)
      1743 627 0 parse count (describe)
      8 rows selected.


      MOS(文档 ID 2183882.1)提供的solution如下:

      SOLUTION

      Issue matches with bug 11712267 and bug 16717701

      Since two bugs are matching with the case,

      You can try with option (1) . As per Bug 11712267

      change the cursor_sharing to force on Active dataguard (ADG).

      Monitor your environment for sometime.

      If it crashes again then follow with the option (2)Option (2):

      As per bug description

      LGWR can request DBINSTANCE lock in X mode without any timeout which can lead to a hang deadlock.

      Both fixes are already included in 11.2.0.4 but the fix is DISABLED by default.== > To ENABLE the fix one has to set == > "_adg_parselock_timeout" > to the number of centi-seconds == > LGWR should wait before backing off and retrying the request.

      Value should be in centi seconds. == > I Don't think there is really any hard fast rule for a value - at default (0) it will not timeout.A value representing a few seconds seems reasonable - if LGWR has been stuck for say 5 seconds waiting it seems reasonable guess it is not going to get the lock.

      The param just causes it to abort the current attempt and retry If you want to play safe can start with a higher value then decrease later.A higher value will just mean more sessions blocked for longer in case of the deadlock situation.500 Seems reasonable , but I have no data to base it on.

      There should be a statistic "ADG parselock X get attempts" If it gets set too small that value would likely increase a lot due to keep timing out and retrying.

      This is a dynamic parameter

      Follow option (1) .

      change the cursor_sharing to force on ADG

      If issue re-appears then follow option (2) as below

      Please set "_adg_parselock_timeout" to 500 == >

      SQL > alter system set "_adg_parselock_timeout"=500 scope=both sid='*';


      简单翻译如下:

      1.将cursor_sharing更改为force,减少SQL解析的时间。

      2.如果再次发生该问题,将隐含参数”_adg_parselock_timeout”设置成500,加长解析时间,该参数作用防止超时;注意此参数可动态修改。

        SQL > alter system set "_adg_parselock_timeout"=500 scope=both sid='*';


        【参考】

        https://blog.51cto.com/lyzbg/2090812

        【参考】

        https://blog.csdn.net/vic_qxz/article/details/88296722

        【参考】

        ORA-04021: timeout occurred while waiting to lock object : DR Instance terminated by LGWR (文档 ID 2183882.1)

        【参考】

        http://download.oracle.com/docs/cd/E11882_01/server.112/e17110/initparams042.htm#REFRN10025



        欢迎关注个人微信微信;

        长按以下二维码或微信搜索“一森咖记”


        往期精彩文章

        =====================================

        1. 干货:RHEL7.2生产环境下双节点12c RAC搭建实操

        2. 干货:RHEL7.1环境下双节点Oracle RAC搭建实操

        3. LINUX环境:MySQL和Oracle开机自启动,咋搞?

        4. What:ASM自动脱落了

        5. 实操:12C RAC环境下的ADG同步库搭建

        6. Oracle Rac:关闭透明大页的原因及方法

        7. Oracle ADG同步技术,DBA必备的一种“后悔药”

        8. Logminer:oracle人为误操作之恢复神器

        9. “神器”:Oracle日志采集分析工具——TFA

        10. 技术八卦篇:Oracle云时代下的一个不再默认存在账号:Scott




        文章转载自一森咖记,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

        评论