暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Oracle RAC集群OCR注册信息恢复案例分享

921
近期帮忙客户处理过两次集群OCR表决盘数据异常问题,借此机会梳理下处理过程。本次生产环境的数据库版本是11.2.0.4,双节点部署架构。
一、工作原理
Oracle Clusterware Registry(OCR)存储了所有集群RAC数据库相关的配置信息,并且这种配置信息都是公用的交由多个节点进行共享,所以OCR会存放在共享磁盘上。

OCR是存放在共享存储中的,所以在整个集群中每个节点都通过本地的CRSD进程去访问OCR缓存在其内存中维护着的一个副本。同时为了保证OCR的完整性,Oracle不允许所有节点都能直接操作OCR,只有一个进程可以对OCR进行修改,这个进程被称为OCR Master。这个进程负责刷新自己本地的OCR cache以及集群中其他节点的OCR cache。换个方式说,每个节点的OCR客户端查询的都是本地的OCR缓存,当这个节点需要对OCR进行一些修改的时候他们将通过本地的CRSD进程与OCR Master进程进行交互来实现自己的修改。

二、常用管理命令

1、ocrdump命令使用

OCR中存储的内容也是常规的键值对的形式,整个OCR的信息是树形结构,有3个分支:SYSTEM、DATABASE和CRS。使用ocrdump命令对其进行导出或者指定参数进行指定分支导出。直接执行会在当前目录下生成OCRDUMPFILE文件。也可以指定一些参数按需求输出。

    stdout 直接把内容输出到屏幕上,不生成文件
    Filename 把内容输出到指定文件(ocrdump.txt)
    keyname 只打印指定键及其子键内容
    xml 以xml格式输出
    样例:ocrdump  -keyname SYSTEM -xml ocrdump.txt


    命令:ocrdump
    命令:cat OCRDUMPFILE


    查看集群版本
    crsctl query crs activeversion

    2、OCR健康状态检查

    ocrcheck用于检查OCR内容的一致性,直接执行即可不需要参数。执行这个命令的同时也会生成一个日志文件,具体内容如下:

      命令:ocrcheck
      find u01 -name ocrcheck_*
      /u01/app/11.2.0/grid/log/nhzhbg-db1/client/ocrcheck_88595.log


      # more ocrcheck_88595.log 
      Oracle Database 11g Clusterware Release 11.2.0.4.0 - Production Copyright 1996, 2011 Oracle. All rights reserved.
      2025-01-09 22:06:04.415: [OCRCHECK][1104815936]ocrcheck starts...
      2025-01-09 22:06:06.092: [OCRCHECK][1104815936]protchcheck: OCR status : total = [262120], used = [2956], avail = [259164]
      2025-01-09 22:06:27.855: [OCRCHECK][1104815936]Exiting [status=success]...

      3、手动备份ocr

      OCR配置 ocrconfig,需要使用root用户执行。

        命令:ocrconfig -manualbackup
        2025/01/09 22:14:00     /u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/backup_20250109_221400.ocr
        查看ocr备份,正常自动备份4小时进行一次
        命令:ocrconfig -showbackup
        2025/01/07 04:23:03     u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/backup00.ocr

        4、OCR修复

        健忘就是当一个节点更新了OCR上的部分信息,比如新加了一块磁盘。其他节点因为重启或者别的故障原因导致无法将信息及时收到,这样就造成了故障节点不知道OCR已经被更改。这个情况就被称为健忘。

          ocrconfig -repair
          #修复primary ocr
          ocrconfig -repair ocr device_name
          #修复mirror ocr
          ocrconfig -repair ocrmirror device_name

          5、OCR替换

            ocrconfig -replace
            ocrconfig -replace ocr
            ocrconfig -replace ocrmirror

            三、案例受理阶段

            1、监控平台监测到两台数据库服务器同时出现异常重启。登录操作系统发现集群状态已异常,OHASD、CRS、CSS、EVM服务已经异常无法通讯。

              命令:crsctl check crs
              输出:
              CRS-4638: Cannot communicate with Oracle High Availability Services
              CRS-4535: Cannot communicate with Cluster Ready Services
              CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
              CRS-4534: Cannot communicate with Event Manager


              命令:ps -ef|grep d.bin 以及 ps -ef|grep ohasd


              命令:crsctl stat res -t
              输出:
              CRS-4535: Cannot communicate with Cluster Ready Services
              CRS-4000: Command Status failed, or completed with errors.

              2、及时检查集群的运行日志(alert主机名.log),看到CSS集群同步服务守护进程遇到致命错误被中止。

                2025-01-07 11:16:04.793
                [cssd(2738)]CRS-1656:The CSS daemon is terminating due to a fatal error; 
                Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/节点1主机名/cssd/ocssd.log
                2025-01-07 11:16:04.910
                [cssd(2738)]CRS-1652:Starting clean up of CRSD resources.
                2025-01-07 11:16:05.409
                [cssd(2738)]CRS-1604:CSSD voting file is offline: /dev/oracleasm/disks/VOTE02
                details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/节点1主机名/cssd/ocssd.log.
                2025-01-07 11:16:05.409
                [cssd(2738)]CRS-1604:CSSD voting file is offline: /dev/oracleasm/disks/VOTE03
                details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/节点1主机名/cssd/ocssd.log.

                3、进一步查看cssd日志,从报错堆栈里可以看到调用clssnmvReadDskHeartbeat函数后,集群就异常了。即磁盘心跳信号检测进程,检测到磁盘心跳信号失败,触发相应的处理机制,直接把节点被踢出集群,并触发服务器重启。这种机制对于保持集群的稳定性和数据的完整性至关重要。

                  操作系统日志:
                  Jan  7 11:16:08 abrt-hook-ccpp: Process 2738 (ocssd.bin) of user 54322 killed by SIGABRT - dumping core


                  CSSD日志:
                  2025-01-07 11:16:04.793: [    CSSD][3252147968]clssnmvReadDskHeartbeat: node 3 record written by node 2
                  2025-01-07 11:16:04.793: [    CSSD][3252147968]clssnmRemoveNodeInTerm: node 1, 节点1主机名 terminated due to Normal Shutdown. 
                  Removing from member and connected bitmaps
                  2025-01-07 11:16:04.793: [    CSSD][3252147968]###################################
                  2025-01-07 11:16:04.793: [    CSSD][3252147968]clssscExit: CSSD aborting from thread clssnmvWorkerThread
                  2025-01-07 11:16:04.793: [    CSSD][3252147968]###################################
                  2025-01-07 11:16:04.793: [    CSSD][3252147968](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
                  2025-01-07 11:16:04.793: [    CSSD][3252147968]


                  堆栈信息:
                  Call Stack Trace
                  clone-》start_thread-》clssscthrdmain-》clssnmvWorkerThread-》clssnmvReadDskHeartbeat-》clssscExit
                  4、往下查看cssd.log日志,看到三块OCR仲裁盘(表决盘)的数据出现不一致错误,其中2块盘被离线,不满足集群冗余条件。
                    2025-01-07 11:16:05.408: [    CSSD][3236308736]clssnmRemoveNodeInTerm: node 1, nhzhbg-db1 terminated due to Normal ShutdownRemoving from member and connected bitmaps
                    2025-01-07 11:16:05.408: [    CSSD][3236308736]clssscExit: abort already set 1
                    2025-01-07 11:16:05.409: [    CSSD][3241056000](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 251143204 ms for voting file dev/oracleasm/disks/VOTE02)
                    2025-01-07 11:16:05.409: [    CSSD][3241056000]clssnmvDiskAvailabilityChange: voting file dev/oracleasm/disks/VOTE02 now offline
                    2025-01-07 11:16:05.409: [    CSSD][3241056000](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 1643987470 ms for voting file dev/oracleasm/disks/VOTE03)
                    2025-01-07 11:16:05.409: [    CSSD][3241056000]clssnmvDiskAvailabilityChange: voting file dev/oracleasm/disks/VOTE03 now offline
                    2025-01-07 11:16:05.409: [    CSSD][3241056000](:CSSNM00018:)clssnmvDiskCheck: Aborting1 of 3 configured voting disks available, need 2
                    2025-01-07 11:16:05.409: [    CSSD][3241056000]clssnmRemoveNodeInTerm: node 1, nhzhbg-db1 terminated due to Normal ShutdownRemoving from member and connected bitmaps
                    2025-01-07 11:16:05.409: [    CSSD][3241056000]clssscExit: abort already set 1
                    2025-01-07 11:16:05.409: [   SKGFD][3250554624]Lib :UFS:: closing handle 0x7fc698039dc0 for disk :/dev/oracleasm/disks/VOTE02:
                    2025-01-07 11:16:05.409: [   SKGFD][3247400704]Lib :UFS:: closing handle 0x7fc68808e1f0 for disk :/dev/oracleasm/disks/VOTE02:
                    四、案例处理阶段

                    通过多次尝试重启集群单节点,依旧无法把集群拉起来。因此需要进行OCR重建。

                    1、查看OCR备份信息

                      [root@节点1 ~]# ocrconfig -showbackup
                      nhzhbg-db2     2025/01/07 04:23:03     /u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/backup00.ocr
                      nhzhbg-db2     2025/01/07 00:23:01     u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/backup01.ocr
                      nhzhbg-db2     2025/01/06 20:23:00     /u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/backup02.ocr
                      nhzhbg-db2     2025/01/06 00:22:59     u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/day.ocr
                      nhzhbg-db2     2024/12/26 04:22:24     /u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/week.ocr
                      PROT-25Manual backups for the Oracle Cluster Registry are not available

                      2、将3块OCR ASM盘进行移除或清理

                        查看当前服务器挂载磁盘信息
                        [root@~]# lsblk 
                        NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
                        fd0               2:0    1    4K  0 disk 
                        sda               8:0    0   10G  0 disk 
                        └─sda1            8:1    0   10G  0 part 
                        sdb               8:16   0   10G  0 disk 
                        └─sdb1            8:17   0   10G  0 part 
                        sdc               8:32   0   10G  0 disk 
                        └─sdc1            8:33   0   10G  0 part 
                        sdd               8:48   0  500G  0 disk 
                        └─sdd1            8:49   0  500G  0 part 
                        sde               8:64   0  200G  0 disk 
                        └─sde1            8:65   0  200G  0 part 
                        sr0              11:0    1 1024M  0 rom  
                        vda             252:0    0   80G  0 disk 
                        ├─vda1          252:1    0    1G  0 part /boot
                        └─vda2          252:2    0   78G  0 part 
                          ├─centos-root 253:0    0   62G  0 lvm  /
                          └─centos-swap 253:1    0   16G  0 lvm  [SWAP]


                        用dd进行清理(生产环境确认后再操作,否则数据会丢失)
                        dd if=/dev/zero of=/dev/sda1 bs=1024k count=1024
                        [root@ ~]# dd if=/dev/zero of=/dev/sda1 bs=1024k count=1024
                        1024+0 records in
                        1024+0 records out
                        1073741824 bytes (1.1 GB) copied, 1.32494 s, 810 MB/s
                        [root@ ~]# dd if=/dev/zero of=/dev/sdb1 bs=1024k count=1024
                        1024+0 records in
                        1024+0 records out
                        1073741824 bytes (1.1 GB) copied, 1.33162 s, 806 MB/s
                        [root@ ~]# dd if=/dev/zero of=/dev/sdc1 bs=1024k count=1024
                        1024+0 records in
                        1024+0 records out
                        1073741824 bytes (1.1 GB) copied, 1.39061 s, 772 MB/s


                        oracleasm deletedisk OCR1
                        oracleasm deletedisk OCR2
                        oracleasm deletedisk OCR3

                        3、创建新的ASM磁盘

                          # oracleasm scandisks
                          Reloading disk partitions: done
                          Cleaning any stale ASM disks...
                          Cleaning disk "VOTE01"
                          Cleaning disk "VOTE02"
                          Cleaning disk "VOTE03"
                          Scanning system for ASM disks...


                          # oracleasm createdisk OCR1 /dev/sda1
                          Writing disk header: done
                          Instantiating disk: done


                          # oracleasm createdisk OCR2 /dev/sdb1
                          Writing disk header: done
                          Instantiating disk: done


                          # oracleasm createdisk OCR3 /dev/sdc1
                          Writing disk header: done
                          Instantiating disk: done


                          # oracleasm scandisks
                          Reloading disk partitions: done
                          Cleaning any stale ASM disks...
                          Scanning system for ASM disks...


                          # oracleasm listdisks
                          ARCH
                          DATA
                          OCR1
                          OCR2
                          OCR3

                          4、以excl方式启动单节点上的集群服务

                            # crsctl start crs -excl -nocrs
                            CRS-4123: Oracle High Availability Services has been started.
                            CRS-2672: Attempting to start 'ora.mdnsd' on '节点1'
                            CRS-2676Start of 'ora.mdnsd' on '节点1' succeeded
                            CRS-2672: Attempting to start 'ora.gpnpd' on '节点1'
                            CRS-2676Start of 'ora.gpnpd' on '节点1' succeeded
                            CRS-2672: Attempting to start 'ora.cssdmonitor' on '节点1'
                            CRS-2672: Attempting to start 'ora.gipcd' on '节点1'
                            CRS-2676Start of 'ora.cssdmonitor' on '节点1' succeeded
                            CRS-2676Start of 'ora.gipcd' on '节点1' succeeded
                            CRS-2672: Attempting to start 'ora.cssd' on '节点1'
                            CRS-2672: Attempting to start 'ora.diskmon' on '节点1'
                            CRS-2676Start of 'ora.diskmon' on '节点1' succeeded
                            CRS-2676Start of 'ora.cssd' on '节点1' succeeded
                            CRS-2679: Attempting to clean 'ora.cluster_interconnect.haip' on '节点1'
                            CRS-2672: Attempting to start 'ora.ctssd' on '节点1'
                            CRS-2681: Clean of 'ora.cluster_interconnect.haip' on '节点1' succeeded
                            CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on '节点1'
                            CRS-2676Start of 'ora.ctssd' on '节点1' succeeded
                            CRS-2676Start of 'ora.cluster_interconnect.haip' on '节点1' succeeded
                            CRS-2672: Attempting to start 'ora.asm' on '节点1'
                            CRS-2676Start of 'ora.asm' on '节点1' succeeded
                            # crsctl check crs
                            CRS-4638: Oracle High Availability Services is online
                            CRS-4535: Cannot communicate with Cluster Ready Services
                            CRS-4529: Cluster Synchronization Services is online
                            CRS-4534: Cannot communicate with Event Manager
                            # crsctl status res --init

                            5、重新创建磁盘组,记得磁盘组与旧的OCR磁盘组名称保持一致。命令操作过程中,有遇到报错。

                              SQL> create diskgroup VOTE normal redundancy disk '/dev/oracleasm/disks/OCR1','/dev/oracleasm/disks/OCR2','/dev/oracleasm/disks/OCR3ATTRIBUTE 'compatible.rdbms' = '11.2', 'compatible.asm' = '11.2';
                              *
                              ERROR at line 1:
                              ORA-15018: diskgroup cannot be created
                              ORA-15020: discovered duplicate ASM disk "OCR_0002"




                              col path for a60
                              set linesize 1000
                              select group_number,DISK_NUMBER,HEADER_STATUS,NAME,PATH,TOTAL_MB from v$asm_disk;
                              GROUP_NUMBER DISK_NUMBER HEADER_STATU NAME     PATH    TOTAL_MB
                              ------------ ----------- ------------ ------------------------------ ------------------------------------------------------------ ----------
                                0       0 MEMBER      ORCL:ARCH   0
                                0       1 MEMBER      ORCL:DATA   0
                                0       2 PROVISIONED  ORCL:OCR1   0
                                0       3 PROVISIONED  ORCL:OCR2   0
                                0       9 MEMBER      /dev/oracleasm/disks/DATA   0
                                0       5 PROVISIONED  /dev/oracleasm/disks/OCR3   0
                                0       6 PROVISIONED  /dev/oracleasm/disks/OCR2   0
                                0       7 PROVISIONED  /dev/oracleasm/disks/OCR1   0
                                0       8 MEMBER      /dev/oracleasm/disks/ARCH   0
                                0       4 PROVISIONED  ORCL:OCR3   0




                              SQL> show parameter asm_diskstring
                              NAME     TYPE  VALUE
                              ------------------------------------ ----------- ------------------------------
                              asm_diskstring     string


                              SQL> alter system set asm_diskstring='/dev/oracleasm/disks';
                              System altered.
                              GROUP_NUMBER DISK_NUMBER HEADER_STATU NAME     PATH    TOTAL_MB
                              ------------ ----------- ------------ ------------------------------ ------------------------------------------------------------ ----------
                                3       1 MEMBER       VOTE_0001     dev/oracleasm/disks/OCR2       10239
                                2       0 MEMBER       DATA_0000     dev/oracleasm/disks/DATA      511999
                                3       2 MEMBER       VOTE_0002     /dev/oracleasm/disks/OCR3       10239
                                1       0 MEMBER       ARCH_0000     /dev/oracleasm/disks/ARCH      204799
                                3       0 MEMBER       VOTE_0000     /dev/oracleasm/disks/OCR1       10239




                              SQL> create diskgroup VOTE normal redundancy disk '/dev/oracleasm/disks/OCR1','/dev/oracleasm/disks/OCR2','/dev/oracleasm/disks/OCR3' attribute 'compatible.asm'='11.2.0.4.0', 'compatible.rdbms'='11.2.0.4.0';
                              Diskgroup created.

                              6、使用ocrconfig命令进行OCR恢复。

                                # crconfig -restore /u01/app/11.2.0/grid/cdata/rac-cluster/backup00.ocr
                                # ocrcheck
                                # crsctl query css votedisk
                                Located 0 voting disk(s).
                                # crsctl replace votedisk +VOTE
                                Successful addition of voting disk 9f86cdc3ee0d4fbdbfdb0ff029c22746.
                                Successful addition of voting disk 0022be1f0fe34ff8bf0948096e5e8e8c.
                                Successful addition of voting disk 7f07b3bd08a64f3abfd3102b232b066c.
                                Successfully replaced voting disk group with +VOTE.
                                CRS-4266: Voting file(s) successfully replaced
                                # crsctl query css votedisk
                                ##  STATE    File Universal Id                File Name Disk group
                                --  -----    -----------------                --------- ---------
                                 1. ONLINE   9f86cdc3ee0d4fbdbfdb0ff029c22746 (/dev/oracleasm/disks/OCR1) [VOTE]
                                 2. ONLINE   0022be1f0fe34ff8bf0948096e5e8e8c (/dev/oracleasm/disks/OCR2) [VOTE]
                                 3. ONLINE   7f07b3bd08a64f3abfd3102b232b066c (/dev/oracleasm/disks/OCR3) [VOTE]
                                Located 3 voting disk(s).

                                7、先手动创建一个pfile文件,然后再恢复asm实例的spfile文件,命令操作过程中,有遇到报错。

                                  # su - grid
                                  Last login: Fri Nov 22 14:05:46 CST 2024


                                  $cat pfile.ora
                                  +ASM1.asm_diskgroups='DATA','ARCH'#Manual Mount
                                  +ASM2.asm_diskgroups='DATA','ARCH'#Manual Mount
                                  *.asm_diskstring='/dev/oracleasm/disks/*'
                                  *.asm_power_limit=1
                                  *.diagnostic_dest='/u01/app/grid'
                                  *.instance_type='asm'
                                  *.large_pool_size=12M
                                  *.memory_max_target=4294967296
                                  *.memory_target=4294967296
                                  *.remote_login_passwordfile='EXCLUSIVE'
                                  *.sga_max_size=4294967296


                                  SQL> show parameter spfile;
                                  NAME     TYPE  VALUE
                                  ------------------------------------ ----------- ------------------------------
                                  spfile     string


                                  SQL> create spfile from pfile='/home/grid/pfile.ora';
                                  create spfile from pfile='/home/grid/pfile.ora'
                                  *
                                  ERROR at line 1:
                                  ORA-17502: ksfdcre:4 Failed to create file
                                  +VOTE/nyzhbg-cluster1/asmparameterfile/registry.253.1181765237
                                  ORA-15177: cannot operate on system aliases


                                  SQL> create spfile from memory;
                                  File created.

                                  8、重启数据库集群,可以看到集群服务可以正常启动。

                                    # crsctl stop crs -f
                                    # crsctl check crs
                                    CRS-4638: Oracle High Availability Services is online
                                    CRS-4535: Cannot communicate with Cluster Ready Services
                                    CRS-4529: Cluster Synchronization Services is online
                                    CRS-4534: Cannot communicate with Event Manager
                                    # crsctl status res -t -init
                                    # crsctl status res -t -init

                                    9、由于业务已经切到ADG备库,并将备库IP调整成SCAN IP。因此,这边增加修改集群IP的步骤。

                                      # srvctl disable scan_listener
                                      # srvctl stop scan_listener
                                      # srvctl config scan
                                      # srvctl modify scan -n nhzhbg-cluster1-scan


                                      # vi /etc/hosts
                                      # cat /etc/hosts
                                      xx.xx.xx.xx nhzhbg-cluster1-scan


                                      # srvctl enable scan_listener
                                      # srvctl start scan_listener

                                      10、将另一个节点上的集群服务启动起来。

                                        # oracleasm scandisks
                                        Reloading disk partitions: done
                                        Cleaning any stale ASM disks...
                                        Scanning system for ASM disks...
                                        Instantiating disk "OCR1"
                                        Instantiating disk "OCR2"
                                        Instantiating disk "OCR3"


                                        # crsctl start crs
                                        11、到此,数据库集群服务已正常恢复了。因此,通过重建集群仲裁盘把集群恢复起来,进一步验证由于磁盘数据不一致导致集群异常。

                                        五、近期热门文章:
                                        👉学习笔记-DM8达梦数据库启动过程
                                        👉达梦数据库DM8小版本升级案例分享
                                        👉如何利用RMAN Debug命令来诊断问题
                                        👉达梦数据守护集群异常脑裂处理案例总结
                                        👉Linux运维技能-du命令常见用例总结
                                        👉Oracle 10g 备份恢复及容灾部署技术问题总结
                                        👉TiDB 7.5 实验测试环境搭建及小插曲处理
                                        👉DBA实验手册第3讲 运用bbed工具恢复delete误删除的数据
                                        👉DBA实验手册第5讲 运用bbed工具恢复truncate表及反向构造段头块
                                        全文完,希望可以帮到正在阅读的你,如果觉得有帮助,可以分享给你身边的朋友,同事,你关心谁就分享给谁,一起学习共同进步~~~

                                        文章转载自数据库运维之道,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                                        评论