暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

震惊:Oracle 11gR2 RAC ADG 并没有高可用

JiekeXu之路 2020-11-03
990

作者 | JiekeXu

来源 | JiekeXu之路(ID: JiekeXu_IT)

转载请联系授权 | (微信ID:xxq1426321293)

大家好,我是 JiekeXu,很高兴又和大家见面了,今天分享一篇 Oracle 11gR2 RAC ADG 并没有高可用的案例本文首发于微信公众号【JiekeXu之路】,欢迎点击上方蓝字关注我,标星或置顶,更多干货第一时间到达!

摘  要:上个月中的一次主机超过运行天数重启主机时发现 Oracle 11gR2 RAC ADG 并没有高可用,Linux 下 11204 的物理备库由于节点一主机关闭导致节点二异常关闭。

因 ADG 灾备环境在上个月时主机超过运行天数需要重启主机 RAC1 即节点一,便直接关闭了 GI 集群,理所当然本节点的数据库将正常关闭,但是却发现另一个节点数据库状态是 Mounted 的,这就有点奇怪了,下面来一起看看。

    --检查集群状态全为 ONLINE,数据库为 Open,Readonly  
    JiekeXu-rac1:/home/grid$crsctl stat res -t
    --------------------------------------------------------------------------------
    NAME TARGET STATE SERVER STATE_DETAILS
    --------------------------------------------------------------------------------
    Local Resources
    --------------------------------------------------------------------------------
    ora.ARCH.dg
    ONLINE ONLINE JiekeXu-rac1
    ONLINE ONLINE JiekeXu-rac2
    ora.DATA.dg
    ONLINE ONLINE JiekeXu-rac1
    ONLINE ONLINE JiekeXu-rac2
    ora.LISTENER.lsnr
    ONLINE ONLINE JiekeXu-rac1
    ONLINE ONLINE JiekeXu-rac2
    ora.OCR.dg
    ONLINE ONLINE JiekeXu-rac1
    ONLINE ONLINE JiekeXu-rac2
    ora.asm
    ONLINE ONLINE JiekeXu-rac1 Started
    ONLINE ONLINE JiekeXu-rac2 Started
    ora.gsd
    OFFLINE OFFLINE JiekeXu-rac1
    OFFLINE OFFLINE JiekeXu-rac2
    ora.net1.network
    ONLINE ONLINE JiekeXu-rac1
    ONLINE ONLINE JiekeXu-rac2
    ora.ons
    ONLINE ONLINE JiekeXu-rac1
    ONLINE ONLINE JiekeXu-rac2
    ora.registry.acfs
    ONLINE ONLINE JiekeXu-rac1
    ONLINE ONLINE JiekeXu-rac2
    --------------------------------------------------------------------------------
    Cluster Resources
    --------------------------------------------------------------------------------
    ora.LISTENER_SCAN1.lsnr
    1 ONLINE ONLINE JiekeXu-rac1
    ora.cvu
    1 ONLINE ONLINE JiekeXu-rac1
    ora.JiekeXuadg.db
    1 ONLINE ONLINE JiekeXu-rac1 Open,Readonly
    2 ONLINE ONLINE JiekeXu-rac2 Open,Readonly
    ora.JiekeXu-rac1.vip
    1 ONLINE ONLINE JiekeXu-rac1
    ora.JiekeXu-rac2.vip
    1 ONLINE ONLINE JiekeXu-rac2
    ora.oc4j
    1 ONLINE ONLINE JiekeXu-rac1
    ora.scan1.vip
    1 ONLINE ONLINE JiekeXu-rac1
    JiekeXu-rac1:/home/grid$
    JiekeXu-rac1:/home/grid$exit
    JiekeXu-rac1:/home/oracle$sqlplus as sysdba
    SQL*Plus: Release 11.2.0.4.0 Production on Fri Oct 16 21:39:51 2020
    Copyright (c) 1982, 2013, Oracle. All rights reserved.
    Connected to:
    Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
    With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
    Data Mining and Real Application Testing options
    SQL> set linesize 600;
    col NAME for a20;
    col PCT_FREE for a30;
    select GROUP_NUMBER,NAME,TYPE,STATE,total_mb/1024 total_gb,free_mb/1024 free_gb,round((free_mb/total_mb)*100,2)||'%' pct_free from v$asm_diskgroup;
    SQL> SQL> SQL>
    GROUP_NUMBER NAME TYPE STATE TOTAL_GB FREE_GB PCT_FREE
    ------------ -------------------- ------ ----------- ---------- ---------- ------------------------------
    1 OCR NORMAL MOUNTED 3 2.09570313 69.86%
    2 ARCH EXTERN CONNECTED 100 21.8701172 21.87%
    3 DATA EXTERN CONNECTED 200 12.4785156 6.24%
    --检查数据库磁盘使用率,主备同步情况
    SQL> set linesize 150;
    set pagesize 20;
    column name format a13;
    column value format a20;
    column unit format a30;
    column TIME_COMPUTED format a30;
    select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');
    SQL> SQL> SQL> SQL> SQL> SQL>
    NAME VALUE UNIT TIME_COMPUTED
    ------------- -------------------- ------------------------------ ------------------------------
    transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 21:39:58
    apply lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 21:39:58
    SQL> set line 345
    col instance_name for a15
    col host_name for a30
    select inst_id,INSTANCE_NAME,HOST_name,status,version,STARTUP_TIME from gv$instance;
    SQL> SQL> SQL>
    INST_ID INSTANCE_NAME HOST_NAME STATUS VERSION STARTUP_TIME
    ---------- --------------- ------------------------------ ------------ ----------------- -------------------
    1 JiekeXuadg1 JiekeXu-rac1 OPEN 11.2.0.4.0 2020-08-16 03:36:59
    2 JiekeXuadg2 JiekeXu-rac2 OPEN 11.2.0.4.0 2020-08-16 02:01:57

    --检查数据库主机名,状态,启动时间,打开模式等等
    SQL> select name,DATABASE_ROLE,OPEN_MODE,DB_UNIQUE_NAME,GUARD_STATUS,CREATED from gv$database;

    NAME DATABASE_ROLE OPEN_MODE DB_UNIQUE_NAME GUARD_S CREATED
    ------------- ---------------- -------------------- ------------------------------ ------- -------------------
    JiekeXu PHYSICAL STANDBY READ ONLY WITH APPLY JiekeXuadg NONE 2018-10-09 09:48:19
    JiekeXu PHYSICAL STANDBY READ ONLY WITH APPLY JiekeXuadg NONE 2018-10-09 09:48:19
    --检查 MRP0 进程应用日志情况
    SQL> SELECT PROCESS, STATUS,SEQUENCE#,thread# FROM V$MANAGED_STANDBY;

    PROCESS STATUS SEQUENCE# THREAD#
    --------- ------------ ---------- ----------
    ARCH CLOSING 4450 2
    ARCH CLOSING 4471 2
    ARCH CONNECTED 0 0
    ARCH CLOSING 4477 2
    ARCH CLOSING 4508 2
    ARCH CLOSING 4477 1
    ARCH CLOSING 4475 1
    ARCH CLOSING 4446 2
    ARCH CLOSING 4476 1
    ARCH CLOSING 4476 2
    MRP0 APPLYING_LOG 4513 2
    RFS IDLE 0 0
    RFS       IDLE                  0          0

    13 rows selected.
    SQL> exit

    查看应用连接,直接停止集群,然后 reboot

      JiekeXu-rac1:/home/oracle$exit
      JiekeXu-rac1:~ # cd app/product/11.2.0/grid/bin
      JiekeXu-rac1:/app/product/11.2.0/grid/bin # ps -ef | grep LOCAL=NO | grep -v grep | awk '{print $2}' | wc -l
      2
      JiekeXu-rac1:/app/product/11.2.0/grid/bin # ./crsctl stop crs
      CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.crsd' on 'JiekeXu-rac1'
      CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.ARCH.dg' on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.OCR.dg' on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.registry.acfs' on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.JiekeXuadg.db' on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.oc4j' on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.cvu' on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.cvu' on 'JiekeXu-rac1' succeeded
      CRS-2672: Attempting to start 'ora.cvu' on 'JiekeXu-rac2'
      CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac1' succeeded
      CRS-2673: Attempting to stop 'ora.scan1.vip' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'JiekeXu-rac1' succeeded
      CRS-2673: Attempting to stop 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac1'
      CRS-2676: Start of 'ora.cvu' on 'JiekeXu-rac2' succeeded
      CRS-2677: Stop of 'ora.ARCH.dg' on 'JiekeXu-rac1' succeeded
      CRS-2677: Stop of 'ora.registry.acfs' on 'JiekeXu-rac1' succeeded
      CRS-2677: Stop of 'ora.scan1.vip' on 'JiekeXu-rac1' succeeded
      CRS-2672: Attempting to start 'ora.scan1.vip' on 'JiekeXu-rac2'
      CRS-2677: Stop of 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac1' succeeded
      CRS-2672: Attempting to start 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac2'
      CRS-2676: Start of 'ora.scan1.vip' on 'JiekeXu-rac2' succeeded
      CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac2'
      CRS-2676: Start of 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac2' succeeded
      CRS-2677: Stop of 'ora.oc4j' on 'JiekeXu-rac1' succeeded
      CRS-2672: Attempting to start 'ora.oc4j' on 'JiekeXu-rac2'
      CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac2' succeeded
      CRS-2677: Stop of 'ora.JiekeXuadg.db' on 'JiekeXu-rac1' succeeded
      CRS-2673: Attempting to stop 'ora.DATA.dg' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.DATA.dg' on 'JiekeXu-rac1' succeeded
      CRS-2677: Stop of 'ora.OCR.dg' on 'JiekeXu-rac1' succeeded
      CRS-2673: Attempting to stop 'ora.asm' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.asm' on 'JiekeXu-rac1' succeeded
      CRS-2676: Start of 'ora.oc4j' on 'JiekeXu-rac2' succeeded
      CRS-2673: Attempting to stop 'ora.ons' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.ons' on 'JiekeXu-rac1' succeeded
      CRS-2673: Attempting to stop 'ora.net1.network' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.net1.network' on 'JiekeXu-rac1' succeeded
      CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'JiekeXu-rac1' has completed
      CRS-2677: Stop of 'ora.crsd' on 'JiekeXu-rac1' succeeded
      CRS-2673: Attempting to stop 'ora.mdnsd' on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.ctssd' on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.evmd' on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.asm' on 'JiekeXu-rac1'
      CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.ctssd' on 'JiekeXu-rac1' succeeded
      CRS-2677: Stop of 'ora.evmd' on 'JiekeXu-rac1' succeeded
      CRS-2677: Stop of 'ora.mdnsd' on 'JiekeXu-rac1' succeeded
      CRS-2677: Stop of 'ora.drivers.acfs' on 'JiekeXu-rac1' succeeded
      CRS-2677: Stop of 'ora.asm' on 'JiekeXu-rac1' succeeded
      CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'JiekeXu-rac1' succeeded
      CRS-2673: Attempting to stop 'ora.cssd' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.cssd' on 'JiekeXu-rac1' succeeded
      CRS-2673: Attempting to stop 'ora.crf' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.crf' on 'JiekeXu-rac1' succeeded
      CRS-2673: Attempting to stop 'ora.gipcd' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.gipcd' on 'JiekeXu-rac1' succeeded
      CRS-2673: Attempting to stop 'ora.gpnpd' on 'JiekeXu-rac1'
      CRS-2677: Stop of 'ora.gpnpd' on 'JiekeXu-rac1' succeeded
      CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'JiekeXu-rac1' has completed
      CRS-4133: Oracle High Availability Services has been stopped.
      JiekeXu-rac1:/app/product/11.2.0/grid/bin #
      Broadcast message from root (pts/0) (Fri Oct 16 21:45:46 2020):

      The system is going down for reboot NOW!
      Connection to 192.16.11X.1XX closed by remote host.
      Connection to 192.16.11X.1XX closed.
      bye

      重启主机后检查集群状态

        JiekeXu-rac1:/home/grid$crsctl stat res -t 
        --------------------------------------------------------------------------------
        NAME TARGET STATE SERVER STATE_DETAILS
        --------------------------------------------------------------------------------
        Local Resources
        --------------------------------------------------------------------------------
        ora.ARCH.dg
        ONLINE ONLINE JiekeXu-rac1
        ONLINE ONLINE JiekeXu-rac2
        ora.DATA.dg
        ONLINE ONLINE JiekeXu-rac1
        ONLINE ONLINE JiekeXu-rac2
        ora.LISTENER.lsnr
        ONLINE ONLINE JiekeXu-rac1
        ONLINE ONLINE JiekeXu-rac2
        ora.OCR.dg
        ONLINE ONLINE JiekeXu-rac1
        ONLINE ONLINE JiekeXu-rac2
        ora.asm
        ONLINE ONLINE JiekeXu-rac1 Started
        ONLINE ONLINE JiekeXu-rac2 Started
        ora.gsd
        OFFLINE OFFLINE JiekeXu-rac1
        OFFLINE OFFLINE JiekeXu-rac2
        ora.net1.network
        ONLINE ONLINE JiekeXu-rac1
        ONLINE ONLINE JiekeXu-rac2
        ora.ons
        ONLINE ONLINE JiekeXu-rac1
        ONLINE ONLINE JiekeXu-rac2
        ora.registry.acfs
        ONLINE ONLINE JiekeXu-rac1
        ONLINE ONLINE JiekeXu-rac2
        --------------------------------------------------------------------------------
        Cluster Resources
        --------------------------------------------------------------------------------
        ora.LISTENER_SCAN1.lsnr
        1 ONLINE ONLINE JiekeXu-rac2
        ora.cvu
        1 ONLINE ONLINE JiekeXu-rac2
        ora.JiekeXuadg.db
              1        ONLINE  OFFLINE                               Instance Shutdown                   
        2 ONLINE INTERMEDIATE JiekeXu-rac2 Mounted (Closed)
        ora.JiekeXu-rac1.vip
        1 ONLINE ONLINE JiekeXu-rac1
        ora.JiekeXu-rac2.vip
        1 ONLINE ONLINE JiekeXu-rac2
        ora.oc4j
        1 ONLINE ONLINE JiekeXu-rac2
        ora.scan1.vip
        1 ONLINE ONLINE JiekeXu-rac2

        启动节点一数据库,应用 MRP0 进程检查同步情况

          JiekeXu-rac1:/home/grid$exit
          logout
          JiekeXu-rac1:~ # su - oracle
          JiekeXu-rac1:/home/oracle$
          JiekeXu-rac1:/home/oracle$sqlplus as sysdba

          SQL*Plus: Release 11.2.0.4.0 Production on Fri Oct 16 22:13:25 2020

          Copyright (c) 1982, 2013, Oracle. All rights reserved.

          Connected to an idle instance.

          SQL> startup
          ORACLE instance started.

          Total System Global Area 7516033024 bytes
          Fixed Size 2267872 bytes
          Variable Size 1996490016 bytes
          Database Buffers 5502926848 bytes
          Redo Buffers 14348288 bytes
          Database mounted.
          Database opened.
          SQL> set linesize 150;
          set pagesize 20;
          column name format a13;
          column value format a20;
          column unit format a30;
          column TIME_COMPUTED format a30;
          select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');
          SQL> SQL> SQL> SQL> SQL> SQL>
          NAME VALUE UNIT TIME_COMPUTED
          ------------- -------------------- ------------------------------ ------------------------------
          transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:13:58
          apply lag day(2) to second(0) interval 10/16/2020 22:13:58

          SQL> alter database recover managed standby database using current logfile disconnect from session;

          Database altered.

          SQL> select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');

          NAME VALUE UNIT TIME_COMPUTED
          ------------- -------------------- ------------------------------ ------------------------------
          transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:14:31
          apply lag day(2) to second(0) interval 10/16/2020 22:14:31

          SQL> SELECT PROCESS, STATUS,SEQUENCE#,thread# FROM V$MANAGED_STANDBY;

          PROCESS STATUS SEQUENCE# THREAD#
          --------- ------------ ---------- ----------
          ARCH CONNECTED 0 0
          ARCH CONNECTED 0 0
          ARCH CONNECTED 0 0
          ARCH CONNECTED 0 0
          ARCH CONNECTED 0 0
          ARCH CONNECTED 0 0
          ARCH CONNECTED 0 0
          ARCH CONNECTED 0 0
          ARCH CONNECTED 0 0
          ARCH CONNECTED 0 0
          RFS IDLE 0 0
          RFS IDLE 4516 1
          RFS IDLE 4516 2
          RFS IDLE 0 0
          MRP0 APPLYING_LOG 4516 2

          15 rows selected.

          SQL> select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');

          NAME VALUE UNIT TIME_COMPUTED
          ------------- -------------------- ------------------------------ ------------------------------
          transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:14:39
          apply lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:14:39

          SQL>
          SQL> exit
          Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
          With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
          Data Mining and Real Application Testing options

          登陆节点二查看数据库状态也是变成了 MOUNTED 这里并没有做任何操作,实在是奇怪吧。

            JiekeXu-rac1:/home/oracle$ssh JiekeXu-rac2
            Last login: Sun Aug 16 02:34:33 2020 from JiekeXu-rac1
            JiekeXu-rac2:/home/oracle$sqlplus / as sysdba 
            SQL*Plus: Release 11.2.0.4.0 Production on Fri Oct 16 21:56:04 2020

            Copyright (c) 1982, 2013, Oracle. All rights reserved.

            Connected to:
            Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
            With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
            Data Mining and Real Application Testing options

            SQL> set line 345
            col instance_name for a15
            col host_name for a30
            select inst_id,INSTANCE_NAME,HOST_name,status,version,STARTUP_TIME from gv$instance;
            SQL> SQL> SQL>
            INST_ID INSTANCE_NAME HOST_NAME STATUS VERSION STARTUP_TIME
            ---------- --------------- ------------------------------ ------------ ----------------- -------------------
            2 JiekeXuadg2 JiekeXu-rac2 MOUNTED 11.2.0.4.0 2020-08-16 02:01:57
                     1 JiekeXuadg1      JiekeXu-rac1         OPEN         11.2.0.4.0        2020-10-16 22:13:29
                     
            --这里数据库是 MOUNTED 由之前的 open 直接变成了 MOUNTED。
            --故在这里说一句,众所周知数据库启动过程为started、mounted、open 三阶段,
            --那么关闭过程是对应有三阶段的过程 close 到 MOUNTED,dismount 到 STARTED 然后 shutdown 关闭实例。
            --如下所示,单点数据库实例关闭过程 
            SYS@JiekeXu>col HOST_NAME for a18
            SYS@JiekeXu>select INSTANCE_NAME,HOST_NAME,VERSION,STATUS from v$instance;

            INSTANCE_NAME HOST_NAME VERSION STATUS
            ---------------- ------------------ ----------------- ------------
            JiekeXu JiekeXu 11.2.0.4.0 OPEN

            SYS@JiekeXu>alter database close;

            Database altered.

            SYS@JiekeXu>select INSTANCE_NAME,HOST_NAME,VERSION,STATUS from v$instance;

            INSTANCE_NAME HOST_NAME VERSION STATUS
            ---------------- ------------------ ----------------- ------------
            JiekeXu JiekeXu 11.2.0.4.0 MOUNTED

            SYS@JiekeXu>alter database dismount;

            Database altered.

            SYS@JiekeXu>select INSTANCE_NAME,HOST_NAME,VERSION,STATUS from v$instance;

            INSTANCE_NAME HOST_NAME VERSION STATUS
            ---------------- ------------------ ----------------- ------------
            JiekeXu JiekeXu 11.2.0.4.0 STARTED

            SYS@JiekeXu>shutdown;
            ORA-01507: database not mounted

            ORACLE instance shut down.
            SYS@JiekeXu>               

            这里节点二直接 open 数据库检查同步情况

              SQL> alter database open;

              Database altered.
              SQL> set line 345
              col instance_name for a15
              col host_name for a30
              select inst_id,INSTANCE_NAME,HOST_name,status,version,STARTUP_TIME from gv$instance;
              SQL> SQL> SQL>
              INST_ID INSTANCE_NAME HOST_NAME STATUS VERSION STARTUP_TIME
              ---------- --------------- ------------------------------ ------------ ----------------- -------------------
              2 JiekeXuadg2 JiekeXu-rac2 OPEN 11.2.0.4.0 2020-08-16 02:01:57
              1 JiekeXuadg1 JiekeXu-rac1 OPEN 11.2.0.4.0 2020-10-16 22:13:29

              SQL> set linesize 150;
              set pagesize 20;
              column name format a13;
              column value format a20;
              column unit format a30;
              column TIME_COMPUTED format a30;
              select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');
              SQL> SQL> SQL> SQL> SQL> SQL>
              NAME VALUE UNIT TIME_COMPUTED
              ------------- -------------------- ------------------------------ ------------------------------
              transport lag +00 00:00:00 day(2) to second(0) interval 10/17/2020 00:52:52
              apply lag +00 00:00:00 day(2) to second(0) interval 10/17/2020 00:52:52

              --查看数据库补丁 PSU 11.2.0.4.180717
              SQL> set line 150
              SQL> col ACTION_TIME for a30
              SQL> col ACTION for a8
              col NAMESPACE for a8
              col VERSION for a10
              col BUNDLE_SERIES for a5
              col COMMENTS for a20
              select * from dba_registry_history;

              ACTION_TIME ACTION NAMESPAC VERSION ID BUNDL COMMENTS
              ------------------------------ -------- -------- ---------- ---------- ----- --------------------
              24-AUG-13 12.03.45.119862 PM APPLY SERVER 11.2.0.4 0 PSU Patchset 11.2.0.2.0
              09-OCT-18 09.49.44.442749 AM APPLY SERVER 11.2.0.4 180717 PSU PSU 11.2.0.4.180717

              下面则通过 alert 日志查看问题

                --节点 2 的 alert 日志

                Fri Oct 16 21:41:17 2020
                Reconfiguration started (old inc 12, new inc 14)
                List of instances:
                2 (myinst: 2)
                Global Resource Directory frozen
                * dead instance detected - domain 0 invalid = TRUE
                Communication channels reestablished
                Master broadcasted resource hash value bitmaps
                Non-local Process blocks cleaned out
                Fri Oct 16 21:41:17 2020
                LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
                LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
                Set master node info
                Submitted all remote-enqueue requests
                Dwn-cvts replayed, VALBLKs dubious
                All grantable enqueues granted
                Post SMON to start 1st pass IR
                Submitted all GCS remote-cache requests
                Post SMON to start 1st pass IR
                Fix write in gcs resources
                Reconfiguration complete

                --集群重配完成后实例直接 crash
                Fri Oct 16 21:41:18 2020
                Recovery session aborted due to instance crash
                Close the database due to aborted recovery session
                SMON: disabling tx recovery
                All dispatchers and shared servers shutdown
                CLOSE: killing server sessions.
                CLOSE: all sessions shutdown successfully.
                SMON: disabling tx recovery
                SMON: disabling cache recovery
                Fri Oct 16 21:41:29 2020
                RFS[4]: Possible network disconnect with primary database
                Fri Oct 16 21:41:32 2020
                idle dispatcher 'D000' terminated, pid = (27, 1)
                Fri Oct 16 21:41:50 2020
                RFS[5]: Possible network disconnect with primary database
                Fri Oct 16 21:42:17 2020
                Decreasing number of real time LMS from 2 to 0
                Fri Oct 16 21:47:00 2020
                Primary database is in MAXIMUM PERFORMANCE mode
                RFS[8]: Assigned to RFS process 27227
                RFS[8]: Selected log 5 for thread 1 sequence 4514 dbid 1813902435 branch 989056101
                Fri Oct 16 21:47:01 2020
                Primary database is in MAXIMUM PERFORMANCE mode
                RFS[9]: Assigned to RFS process 27236
                RFS[9]: Selected log 9 for thread 2 sequence 4514 dbid 1813902435 branch 989056101
                Fri Oct 16 21:47:01 2020
                RFS[10]: Assigned to RFS process 27264
                RFS[10]: Selected log 8 for thread 2 sequence 4513 dbid 1813902435 branch 989056101
                Fri Oct 16 21:47:02 2020
                Archived Log entry 5708 added for thread 2 sequence 4513 ID 0x6fa9692f dest 1:
                Fri Oct 16 21:47:54 2020
                Archived Log entry 5709 added for thread 1 sequence 4514 ID 0x6fa9692f dest 1:
                Fri Oct 16 21:47:54 2020
                RFS[8]: Selected log 5 for thread 1 sequence 4515 dbid 1813902435 branch 989056101
                Fri Oct 16 21:47:55 2020
                RFS[9]: Selected log 8 for thread 2 sequence 4515 dbid 1813902435 branch 989056101
                Fri Oct 16 21:47:55 2020
                Archived Log entry 5710 added for thread 2 sequence 4514 ID 0x6fa9692f dest 1:
                Fri Oct 16 21:52:25 2020
                Reconfiguration started (old inc 14, new inc 16)
                List of instances:
                1 2 (myinst: 2)
                Global Resource Directory frozen
                Communication channels reestablished
                Fri Oct 16 21:52:25 2020
                * domain 0 valid = 0 according to instance 1
                Master broadcasted resource hash value bitmaps
                Non-local Process blocks cleaned out
                Fri Oct 16 21:52:25 2020
                LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
                Fri Oct 16 21:52:25 2020
                LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
                Set master node info
                Submitted all remote-enqueue requests
                Dwn-cvts replayed, VALBLKs dubious
                All grantable enqueues granted
                Post SMON to start 1st pass IR
                Submitted all GCS remote-cache requests
                Post SMON to start 1st pass IR
                Fix write in gcs resources
                Reconfiguration complete
                Fri Oct 16 21:52:32 2020
                Managed Standby Recovery starting Real Time Apply
                Fri Oct 16 21:53:25 2020
                Increasing number of real time LMS from 0 to 2
                Reconfiguration started (old inc 16, new inc 18)
                List of instances:
                2 (myinst: 2)
                Global Resource Directory frozen
                * dead instance detected - domain 0 invalid = TRUE
                Communication channels reestablished
                Master broadcasted resource hash value bitmaps
                Non-local Process blocks cleaned out
                Fri Oct 16 21:53:34 2020
                Fri Oct 16 21:53:34 2020
                LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
                LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
                Set master node info
                Submitted all remote-enqueue requests
                Dwn-cvts replayed, VALBLKs dubious
                All grantable enqueues granted
                Post SMON to start 1st pass IR
                Submitted all GCS remote-cache requests
                Post SMON to start 1st pass IR
                Fix write in gcs resources
                Reconfiguration complete
                Fri Oct 16 21:53:51 2020
                RFS[11]: Assigned to RFS process 30085
                RFS[11]: Selected log 6 for thread 1 sequence 4513 dbid 1813902435 branch 989056101
                Fri Oct 16 21:53:53 2020
                Archived Log entry 5711 added for thread 1 sequence 4513 ID 0x6fa9692f dest 1:
                Fri Oct 16 21:54:35 2020
                Decreasing number of real time LMS from 2 to 0
                Reconfiguration started (old inc 18, new inc 20)
                List of instances:
                1 2 (myinst: 2)
                Global Resource Directory frozen
                Communication channels reestablished
                Fri Oct 16 21:54:44 2020
                * domain 0 valid = 0 according to instance 1
                Master broadcasted resource hash value bitmaps
                Non-local Process blocks cleaned out
                Fri Oct 16 21:54:44 2020
                LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
                Fri Oct 16 21:54:44 2020
                LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
                Set master node info
                Submitted all remote-enqueue requests
                Dwn-cvts replayed, VALBLKs dubious
                All grantable enqueues granted
                Post SMON to start 1st pass IR
                Submitted all GCS remote-cache requests
                Post SMON to start 1st pass IR
                Fix write in gcs resources
                Reconfiguration complete
                Fri Oct 16 21:54:51 2020
                Managed Standby Recovery starting Real Time Apply
                Fri Oct 16 21:54:54 2020
                Archived Log entry 5712 added for thread 1 sequence 4515 ID 0x6fa9692f dest 1:
                Fri Oct 16 21:54:55 2020
                Archived Log entry 5713 added for thread 2 sequence 4515 ID 0x6fa9692f dest 1:
                Fri Oct 16 21:55:33 2020
                Managed Standby Recovery starting Real Time Apply
                Fri Oct 16 21:55:33 2020
                Reconfiguration started (old inc 20, new inc 22)
                List of instances:
                1 2 (myinst: 2)
                Global Resource Directory frozen
                Communication channels reestablished
                Fri Oct 16 21:55:33 2020
                * domain 0 valid = 1 according to instance 1
                Master broadcasted resource hash value bitmaps
                Non-local Process blocks cleaned out
                Fri Oct 16 21:55:33 2020
                LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
                Fri Oct 16 21:55:33 2020
                LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
                Set master node info
                Submitted all remote-enqueue requests
                Dwn-cvts replayed, VALBLKs dubious
                All grantable enqueues granted
                Submitted all GCS remote-cache requests
                Fix write in gcs resources
                Reconfiguration complete
                Fri Oct 16 21:56:23 2020
                alter database open
                Picked Lamport scheme to generate SCNs
                Fri Oct 16 21:56:23 2020
                SMON: enabling cache recovery
                No Resource Manager plan active
                Physical standby database opened for read only access.
                Completed: alter database open
                Fri Oct 16 21:56:34 2020
                Increasing number of real time LMS from 0 to 2
                Fri Oct 16 23:58:42 2020

                从日志中发现 "Close the database due to aborted recovery session"  节点二数据库 close 是因为节点一 recover session 终止了,下面查看节点一的日志直接 abort 了,没有相信的日志信息。

                  节点 1 的 alert 日志
                  --21:41:16 节点一直接 abort 没有相信日志信息,相当于异常断电
                  Fri Oct 16 21:41:16 2020
                  Shutting down instance (abort)
                  License high water mark = 13
                  USER (ospid: 26959): terminating the instance
                  Instance terminated by USER, pid = 26959
                  Fri Oct 16 21:41:18 2020
                  Instance shutdown complete

                  -- 22:11:08 数据库再一次启动,大页相关配置信息,可见这里并没有配置,不过后面有推荐设置 3601
                  Fri Oct 16 22:11:08 2020
                  Starting ORACLE instance (normal)
                  ************************ Large Pages Information *******************
                  Per process system memlock (soft) limit = UNLIMITED

                  Total Shared Global Region in Large Pages = 0 KB (0%)

                  Large Pages used by this instance: 0 (0 KB)
                  Large Pages unused system wide = 0 (0 KB)
                  Large Pages configured system wide = 0 (0 KB)
                  Large Page size = 2048 KB

                  RECOMMENDATION:
                  Total System Global Area size is 7202 MB. For optimal performance,
                  prior to the next instance restart:
                  1. Increase the number of unused large pages by
                  at least 3601 (page size 2048 KB, total size 7202 MB) system wide to
                  get 100% of the System Global Area allocated with large pages
                  ********************************************************************
                  LICENSE_MAX_SESSION = 0
                  LICENSE_SESSIONS_WARNING = 0
                  Initial number of CPU is 4
                  Private Interface 'eth1:1' configured from GPnP for use as a private interconnect.
                  [name='eth1:1', type=1, ip=169.254.111.0, mac=00-50-56-b9-xx-xx, net=10.xx.xx.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
                  Public Interface 'eth0' configured from GPnP for use as a public interface.
                  [name='eth0', type=1, ip=10.xx.xx.8, mac=fa-16-3e-77-xx-xx, net=10.xx.xx.0/24, mask=255.255.255.0, use=public/1]
                  Public Interface 'eth0:1' configured from GPnP for use as a public interface.
                  [name='eth0:1', type=1, ip=10.xx.xx.1, mac=fa-16-3e-77-xx-xx, net=10.xx.xx.0/24, mask=255.255.255.0, use=public/1]
                  Picked latch-free SCN scheme 3

                  --这里全部都是数据库参数相关设置
                  Autotune of undo retention is turned on.
                  LICENSE_MAX_USERS = 0
                  SYS auditing is disabled
                  Starting up:
                  Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
                  With the Partitioning, Real Application Clusters, OLAP, Data Mining
                  and Real Application Testing options.
                  ORACLE_HOME = app/product/11.2.0/db
                  System name: Linux
                  Node name: JiekeXu-rac1
                  Release: 3.0.76-0.11-default
                  Version: #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990)
                  Machine: x86_64
                  VM name: VMWare Version: 6
                  Using parameter settings in server-side pfile app/product/11.2.0/db/dbs/initJiekeXuadg1.ora
                  System parameters with non-default values:
                  processes = 2000
                  sessions = 3024
                  event = "28401 TRACE NAME CONTEXT FOREVER, LEVEL 1"
                  event = "10949 TRACE NAME CONTEXT FOREVER"
                  spfile = "+DATA/JiekeXuadg/parameterfile/spfile.256.1046662725"
                  sga_target = 7200M
                  control_files = "+DATA/JiekeXuadg/controlfile/current.257.995197687"
                  db_file_name_convert = "+DATA"
                  db_file_name_convert = "+DATA"
                  log_file_name_convert = "+DATA"
                  log_file_name_convert = "+DATA"
                  db_block_size = 8192
                  compatible = "11.2.0.4.0"
                  log_archive_dest_1 = "location=+ARCH valid_for=(all_logfiles,all_roles) db_unique_name=JiekeXuadg"
                  log_archive_dest_2 = "service=JiekeXu LGWR ASYNC valid_for=(online_logfiles,primary_role) db_unique_name=JiekeXu"
                  log_archive_dest_state_1 = "enable"
                  log_archive_dest_state_2 = "enable"
                  fal_client = "JiekeXuadg"
                  fal_server = "JiekeXu"
                  log_archive_config = "dg_config=(JiekeXuadg,JiekeXu)"
                  log_archive_format = "%t_%s_%r.dbf"
                  log_archive_max_processes= 10
                  db_files = 1024
                  cluster_database = TRUE
                  db_create_file_dest = "+DATA"
                  db_recovery_file_dest = ""
                  db_recovery_file_dest_size= 30G
                  standby_file_management = "AUTO"
                  thread = 1
                  undo_tablespace = "UNDOTBS1"
                  undo_retention = 1440
                  instance_number = 1
                  remote_login_passwordfile= "EXCLUSIVE"
                  db_domain = ""
                  service_names = "JiekeXu"
                  dispatchers = "(PROTOCOL=TCP) (SERVICE=JiekeXuXDB)"
                  remote_listener = "db-cluster-scan:1521"
                  session_cached_cursors = 100
                  job_queue_processes = 15
                  audit_file_dest = "/app/oracle/admin/JiekeXu/adump"
                  audit_trail = "NONE"
                  db_name = "JiekeXu"
                  db_unique_name = "JiekeXuadg"
                  open_cursors = 500
                  pga_aggregate_target = 2397M
                  deferred_segment_creation= TRUE
                  enable_ddl_logging = TRUE
                  diagnostic_dest = "/app/oracle"
                  Cluster communication is configured to use the following interface(s) for this instance
                  169.254.111.0
                  cluster interconnect IPC version:Oracle UDP/IP (generic)
                  IPC Vendor 1 proto 2
                  Fri Oct 16 22:11:16 2020
                  PMON started with pid=2, OS id=8557
                  Fri Oct 16 22:11:16 2020
                  PSP0 started with pid=3, OS id=8559
                  Fri Oct 16 22:11:17 2020
                  VKTM started with pid=4, OS id=8748 at elevated priority
                  VKTM running at (1)millisec precision with DBRM quantum (100)ms

                  这种情况还真是比较少见的,于是只能通过搜索引擎来查看相关问题了。MOS 中 Doc ID 1357597.1 给出了明确解释:

                    In an Active Data Guard RAC standby, if the redo apply instance crashes in the middle of media recovery, it leaves the RAC cache fusion locks on the surviving instances 
                    and the data files on disk in an in-flux state. In such a situation, queries on the surviving instances can potentially see inconsistent data. To resolve this in-flux state, the entire standby database is closed.
                    Upon opening the first instance after such a close, the buffer caches and datafiles are made consistent again.
                    翻译下来就是:在 Active Data Guard RAC 备库中,如果 redo apply 实例在恢复过程中崩溃,它会使幸存实例和磁盘上的数据文件上的 RAC 缓存融合锁定处于通量状态。在这种情况下,对幸存实例的查询可能会看到不一致的数据。
                    要解决此处于流动状态,将关闭整个备用数据库。在关闭后打开第一个实例时,缓冲区缓存和数据文件再次保持一致。

                    大概意思就是说,如果 apply redo 应用日志的实例进程异常终止后,其它所有OPEN READ ONLY 的实例会 close, 因为在 RAC ADG 环境中,如果实例在应用日志过程中中断 crash, 会把CACHE FUSION 的锁留到残留幸存的实例中,会导致数据查询不一致,因次需要关闭数据库,重新打开来保证 buffer cache 和 datafile 的一致状态。如果配置了 DG Broker 这个操作可以自动完成, 版本大于11.2.0.2,如果没有配置 DG broker,手动方式直接 open 就可以了,接着手动执行应用日志命令,不需要重启实例,继续在幸存的节点上应用日志。

                    在墨天轮中,Anbob 大神写到从 12.1 版本引入了新特性 ”ADG instance recovery” ,解决的是当 redo apply instance crash 时,影响其它实例也 close 问题,从 12.1 以后保存下来的 ADG 实例会自动做 adg instance recovery ,保证数据一致性。11204 中要安装了较新的 PSU,修复了 bug 18331944 和 19516448,同时再配置隐藏参数””_adg_instance_recovery=TRUE””。关于隐含参数介绍可查看我之前的一篇文章:

                    Oracle 隐含参数查看
                    https://www.modb.pro/db/24621
                      SQL> col name for a30
                      SQL> col value for a30
                      SQL> col DESCRIPTION for a70
                      SQL> set line 234
                      SQL> select a.ksppinm name, b.ksppstvl value,a.ksppdesc description from x$ksppi a, x$ksppcv b where a.indx = b.indx and a.ksppinm like '_adg_instance_recovery';

                      NAME VALUE DESCRIPTION
                      ------------------------------ ------------------------------ ----------------------------------------------------------------------
                      _adg_instance_recovery         FALSE                          enable ADG instance recovery



                        EXPLANATION
                        --------------

                        Before this feature (prior to 12.1), when the ADG apply instance crashed in the middle of applying changes (standby media recovery session is on-going), all remaining open instances will be closed.

                        But from 12.1 when the apply instance crashed in the middle of applying changes, one of the remaining open instances will be automatically posted to do "ADG instance recovery", after the ADG instance recovery.We can see this,
                        ADG instance recovery by checking the alert log, for the messages like "Beginning ADG Instance Recovery" and "Completed ADG Instance Recovery".
                        If DG broker is enabled then Broker will start the MRP on any of the surviving instances.

                        Please note that the new feature is enabled by default in 12.1. We backported it to 11.2.0.4, you can enable it on 11.2.0.4 by setting "_adg_instance_recovery=TRUE".
                        The default behaviour on 11.2.0.4 is to close all remaining open instances

                        翻译:在此特性之前(12.1之前),当 ADG apply 实例在应用更改的过程中崩溃(备用媒体恢复会话正在进行)时,所有剩余的打开实例将被关闭。 
                        但是从 12.1 开始,当应用实例在应用更改的过程中崩溃时,在 ADG 实例恢复之后,剩余的一个打开的实例将被自动提交去做“ADG实例恢复”。我们可以通过查看警报日志看到 ADG 实例恢复,对于“开始ADG实例恢复”和“完成ADG实例恢复”这样的消息。如果 DG broker 是启用的,那么 broker 将在任何存活的实例上启动 MRP。
                        请注意,12.1 中的新特性是默认启用的。我们将它向后移植到 11.2.0.4,您可以通过设置 “_adg_instance_recovery=TRUE” 在 11.2.0.4 上启用它。
                        11.2.0.4 的默认行为是关闭所有其余打开的实例。

                        下面是一个 MOS 上测试案例分享一下:

                          Test case
                          ------------

                          I have 2 node primary and 3 node physical standby(instance number 2,4 and 5).

                          MRP running on instance 2 and other instances are in open read only.

                          1. On instance 2,


                          SQL> select d.open_mode,i.instance_number from v$database d, v$instance i;

                          OPEN_MODE INSTANCE_NUMBER
                          -------------------- ---------------
                          READ ONLY WITH APPLY 2

                          SQL> select process,status,sequence#,thread# from v$managed_standby where process like '%MRP%';


                          PROCESS STATUS SEQUENCE# THREAD#
                          --------- ------------ ---------- ----------
                          MRP0 WAIT_FOR_LOG 22 1

                          SQL> show parameter _adg

                          NAME TYPE VALUE
                          ------------------------------------ ----------- ------------------------------
                          _adg_instance_recovery boolean TRUE

                          2. On instance 4,

                          SQL> select process,status,sequence#,thread# from v$managed_standby where process like '%MRP%';

                          no rows selected

                          SQL> show parameter _adg

                          NAME TYPE VALUE
                          ------------------------------------ ----------- ------------------------------
                          _adg_instance_recovery boolean TRUE

                          3. Shut abort instance 2.

                          SQL> shut abort
                          ORACLE instance shut down.
                          SQL> exit


                          4. Instance 4 still open_read only.


                          SQL> select d.open_mode,i.instance_number from v$database d, v$instance i;


                          OPEN_MODE INSTANCE_NUMBER
                          -------------------- ---------------
                          READ ONLY 4
                           
                          NOTE :
                          1. As said before if DG broker configured then Broker will start MRP on any one of the available instances.
                          2. If you want to use ADG instance recovery feature on 11.2.0.4 (including 11.2.0.4 BP or PSUs), please make sure the following patches for following fixes are present. The ADG instance recovery feature is not usable without those fixes.
                          - Bug fix for 18331944.
                          - Bug fix for 19516448.

                          当然,我的生产灾备环境是没有相关补丁和参数设置的,PSU 11.2.0.4.180717 过旧,故会产生此类问题。


                          参考链接:

                          https://www.modb.pro/db/32021

                          Active Data Guard RAC Standby – Apply Instance Node Failure Impact (Doc ID 1357597.1)

                          Behavior of Active Dataguard(ADG) When Apply Node Aborts/Crash (Doc ID 1613719.1)

                          今天的分享就到这里了,如果本文对您有一丁点儿帮助,请多支持“在看”与转发,不求小费了哪怕是一个小小的赞,您的鼓励都将是我熬夜写文章最大的动力,让我有一直写下去的动力,最后一起加油,奥利给



                          Oracle 12c 及以上版本补丁更新说明及下载方法(收藏版)

                          Oracle 11.2.0.4 RAC 最新补丁下载(11.2.0.4.200714)

                          案例分享|Oracle 11g RAC 数据库连接数过高处理办法

                          11g RAC 在线存储迁移实现 OCR 磁盘组完美替换

                          如何通过 Shell 监控异常等待事件和活跃会话 

                          我的 OCM 之路|书写无悔青春追梦永不止步

                          Oracle 19c 之多租户 PDB 连接与访问(三)

                          案例:RMAN 备份控制文件报错 ORA-00230

                          Oracle 12C 最新补丁下载与安装操作指北

                          DBA 常用的软件工具有哪些(分享篇)?

                          Oracle 11g 临时表空间管理

                          Oracle 每日一题系列合集


                          一键三连分享、在看与点赞”,给我充点儿电吧~


                          文章转载自JiekeXu之路,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                          评论