
故障诊断 Oracle 19c RAC crsd 无法启动


前 言

大概也许可能是一两年前吧,在我个人 16G 的PC 笔记本上通过虚拟机安装了一套 19c RAC,安装教程可查看使用 VMware 16 RHEL7.7 虚拟机静默安装 Oracle 19c RAC,一般情况下也不怎么使用,只是偶尔会用一下,记得上次使用还是去年 8 月份,由于好久没使用了,这两天一开机就因为内存不足,宿主机 CPU 100% 内存 100% 卡死,整个 PC 重启了,再次打开虚拟机,两节点 CRSD 进程则无法启动,这样可能导致了我后面的故障。

VMware® Workstation 16 Pro:16.1.1 build-17801498
OS:Red Hat Enterprise Linux Server release 7.7 (Maipo) 4g/8g 100g 磁盘
Oracle: RAC 19c RU12
Connected to:
Oracle Database 19c Enterprise Edition Release - Production
------------ --------------- ----------- ------ ------------- ------------ ------------------- ----------------------- ----------- -----------------
           1 ARCH            MOUNTED     EXTERN            10   9.86328125          9.86328125                       0           0         .13671875
           2 DATA            CONNECTED    EXTERN            20   14.1210938          14.1210938                       0           0        5.87890625
           3 OCR             MOUNTED     NORMAL            9   7.98046875          2.49023438                    3072           0        1.01953125




使用命令 /u01/app/19.0.0/grid/bin/crsctl start res ora.crsd -init 无法拉起 crsd 进程。重启主机也没法拉起 crsd 进程,查看节点 2 也是一样的问题。


查看 ASM 磁盘还是可以正常访问,说明 ASM 实例还算正常。


查看 crsd 日志

由于 12c 以上的 RAC 集群日志发生了变化,集群日志的位置在 ADR_HOME中:$ADR_BASE/diag/crs/hostname/crs https://www.modb.pro/db/43099

以前 11g RAC 集群的日志在 $GRID_HOME/log/hostname/ 目录下。

jiekexu-r1:/home/grid(+ASM1)$ adrci

ADRCI: Release - Production on Thu Feb 1 16:51:18 2024

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

ADR base = "/u01/app/grid"
adrci> show home
ADR Homes: 
adrci> show problems


[root@jiekexu-r1 ~]# cd /u01/app/grid/diag/crs/jiekexu-r1/crs/trace
[root@jiekexu-r1 trace]# ls -lrt crsd*
[root@jiekexu-r1 trace]# vim crsd.trc
2024-01-31 19:15:39.582 :  OCRRAW:844039936: rtnode:3: invalid tnode 145
2024-01-31 19:15:39.582 :  OCRRAW:844039936: propropen:0: could not read tnode addrd=0
2024-01-31 19:15:39.583 :  OCRRAW:844039936: proprseterror: Error in accessing physical storage [26] Marking context invalid.
2024-01-31 19:15:39.583 :  OCRRAW:844039936: proprdc: backend_ctx->prop_ctx_tag=PROPCTXT
2024-01-31 19:15:39.583 :  OCRRAW:844039936: proprdc: backend_ctx->prop_valid=0
2024-01-31 19:15:39.583 :  OCRRAW:844039936: proprdc: backend_ctx->prop_boot_mode=1
2024-01-31 19:15:39.584 :  OCRRAW:844039936: proprdc: begin dumping backenctx->prop_ctx


多次重启后的 CRSD 日志中均出现此错误 proprseterror: Error in accessing physical storage [26] Marking context invalid.那么这到底是个啥错误呢,只能借助搜索引擎了,打开 MOS 居然没有搜到相关问题,一度怀疑是我粘错了呢,可惜还是没有,转而求助必应,幸运的是在博客园上搜到了一篇《由于OCR文件损坏造成Oracle RAC不能启动的现象和处理方法》 中报错和我的一样,通过备份恢复 OCR 得到了解决,那么我也就只能死马当活马医,反正也是个人虚拟机,那就开始吧。

然后我们也可以使用 TFA 工具打包故障时间点的日志,如下收集今天早上九点到十一点的日志。

[root@jiekexu-r1 bin]# pwd
[root@jiekexu-r1 bin]# /opt/oracle.ahf/tfa/bin/tfactl diagcollect -all -from "Feb/1/2024 09:00:00" -to "Feb/1/2024 11:00:00"
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.
The -all switch is being deprecated as collection of all components is the default behavior. TFA will continue to collect all components.
Collecting data for all nodes
Scanning files from Feb/1/2024 09:00:00 to Feb/1/2024 11:00:00

Collection Id : 20240201110645jiekexu-r1

Detailed Logging at : /u01/app/grid/oracle.ahf/data/repository/collection_Thu_Feb_01_11_06_46_CST_2024_node_all/diagcollect_20240201110645_jiekexu-r1.log
2024/02/01 11:06:52 CST : NOTE : Any file or directory name containing the string .com will be renamed to replace .com with dotcom
2024/02/01 11:06:52 CST : Collection Name : tfa_Thu_Feb_01_11_06_46_CST_2024.zip
2024/02/01 11:06:52 CST : Collecting diagnostics from hosts : [jiekexu-r2, jiekexu-r1]
2024/02/01 11:06:53 CST : Scanning of files for Collection in progress...
2024/02/01 11:06:53 CST : Collecting additional diagnostic information...
2024/02/01 11:07:18 CST : Getting list of files satisfying time range [02/01/2024 09:00:00 CST, 02/01/2024 11:00:00 CST]
2024/02/01 11:07:52 CST : Collecting ADR incident files...
2024/02/01 11:12:35 CST : Completed collection of additional diagnostic information...
2024/02/01 11:12:38 CST : Completed Local Collection
2024/02/01 11:12:38 CST : Remote Collection in Progress...
|          Collection Summary          |
| Host       | Status    | Size | Time |
| jiekexu-r2 | Completed | 18MB | 337s |
| jiekexu-r1 | Completed | 23MB | 346s |

Logs are being collected to: /u01/app/grid/oracle.ahf/data/repository/collection_Thu_Feb_01_11_06_46_CST_2024_node_all

WARNING - AHF Software is older than 180 days 这里提示 AHF 过旧,可以下载最新的 AHF 进行安装,可以查看此教程进程安装及简单使用。

我们解压这个文件到 /tmp 下,然后在 /tmp/jiekexu-r1(主机名) 查看生成的所有log,这里就不展开讲了,感兴趣的朋友可以自行研究。接下来我们直接开始恢复 OCR。

unzip /u01/app/grid/oracle.ahf/data/repository/collection_Thu_Feb_01_11_06_46_CST_2024_node_all/jiekexu-r1.tfa_Thu_Feb_01_11_06_46_CST_2024.zip -d /tmp

恢复 OCR

[root@jiekexu-r1 ~]# /u01/app/19.0.0/grid/bin/ocrconfig -showbackup

jiekexu-r1     2023/08/01 17:20:58     +OCR:/jiekexu-racscan/OCRBACKUP/backup00.ocr.262.1143739253     3998055650

jiekexu-r1     2023/08/01 11:26:31     +OCR:/jiekexu-racscan/OCRBACKUP/backup01.ocr.265.1143717731     3998055650

jiekexu-r1     2022/07/14 18:50:09     +OCR:/jiekexu-racscan/OCRBACKUP/backup02.ocr.258.1110048603     3998055650

jiekexu-r1     2023/08/01 11:26:31     +OCR:/jiekexu-racscan/OCRBACKUP/day.ocr.259.1143718007     3998055650

jiekexu-r1     2023/08/01 11:26:31     +OCR:/jiekexu-racscan/OCRBACKUP/week.ocr.260.1143718019     3998055650
PROT-25: Manual backups for the Oracle Cluster Registry are not available


使用 ocrconfig -restore 命令恢复 OCR

[root@jiekexu-r1 ~]# /u01/app/19.0.0/grid/bin/ocrconfig -restore /home/grid/backup_ocr.ocr



[root@jiekexu-r1 ~]# /u01/app/19.0.0/grid/bin/crsctl stop crs
CRS-2796: The command may not proceed when Cluster Ready Services is not running
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.
[root@jiekexu-r1 ~]# /u01/app/19.0.0/grid/bin/crsctl stop crs -f
CRS-2796: The command may not proceed when Cluster Ready Services is not running
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.
[root@jiekexu-r1 ~]# /u01/app/19.0.0/grid/bin/crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'jiekexu-r1'
CRS-2673: Attempting to stop 'ora.storage' on 'jiekexu-r1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'jiekexu-r1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'jiekexu-r1'
CRS-2677: Stop of 'ora.storage' on 'jiekexu-r1' succeeded
CRS-2673: Attempting to stop 'ora.ctssd' on 'jiekexu-r1'
CRS-2673: Attempting to stop 'ora.evmd' on 'jiekexu-r1'
CRS-2673: Attempting to stop 'ora.asm' on 'jiekexu-r1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'jiekexu-r1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'jiekexu-r1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'jiekexu-r1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'jiekexu-r1' succeeded
CRS-2677: Stop of 'ora.asm' on 'jiekexu-r1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'jiekexu-r1'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'jiekexu-r1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'jiekexu-r1'
CRS-2677: Stop of 'ora.cssd' on 'jiekexu-r1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'jiekexu-r1'
CRS-2673: Attempting to stop 'ora.crf' on 'jiekexu-r1'
CRS-2677: Stop of 'ora.crf' on 'jiekexu-r1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'jiekexu-r1'
CRS-2677: Stop of 'ora.gpnpd' on 'jiekexu-r1' succeeded
CRS-2677: Stop of 'ora.gipcd' on 'jiekexu-r1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'jiekexu-r1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@jiekexu-r1 ~]### ---- -wait 参数屏幕上可以看到 RAC 集群启动过程 
[root@jiekexu-r1 ~]# /u01/app/19.0.0/grid/bin/crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.evmd' on 'jiekexu-r1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'jiekexu-r1'
CRS-2676: Start of 'ora.evmd' on 'jiekexu-r1' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'jiekexu-r1'
CRS-2676: Start of 'ora.gpnpd' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'jiekexu-r1'
CRS-2676: Start of 'ora.gipcd' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'jiekexu-r1'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'jiekexu-r1'
CRS-2676: Start of 'ora.cssdmonitor' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'jiekexu-r1'
CRS-2672: Attempting to start 'ora.diskmon' on 'jiekexu-r1'
CRS-2676: Start of 'ora.diskmon' on 'jiekexu-r1' succeeded
CRS-2676: Start of 'ora.crf' on 'jiekexu-r1' succeeded
CRS-2676: Start of 'ora.cssd' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'jiekexu-r1'
CRS-2672: Attempting to start 'ora.ctssd' on 'jiekexu-r1'
CRS-2676: Start of 'ora.ctssd' on 'jiekexu-r1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'jiekexu-r1'
CRS-2676: Start of 'ora.asm' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'jiekexu-r1'
CRS-2676: Start of 'ora.storage' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'jiekexu-r1'
CRS-2676: Start of 'ora.crsd' on 'jiekexu-r1' succeeded
CRS-6023: Starting Oracle Cluster Ready Services-managed resources
CRS-2672: Attempting to start 'ora.ASMNET1LSNR_ASM.lsnr' on 'jiekexu-r1'
CRS-6017: Processing resource auto-start for servers: jiekexu-r1
CRS-2672: Attempting to start 'ora.jiekexu-r1.vip' on 'jiekexu-r1'
CRS-2672: Attempting to start 'ora.scan1.vip' on 'jiekexu-r1'
CRS-2672: Attempting to start 'ora.qosmserver' on 'jiekexu-r1'
CRS-2672: Attempting to start 'ora.jiekexu-r2.vip' on 'jiekexu-r1'
CRS-2672: Attempting to start 'ora.ons' on 'jiekexu-r1'
CRS-2672: Attempting to start 'ora.chad' on 'jiekexu-r1'
CRS-2676: Start of 'ora.jiekexu-r1.vip' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.LISTENER.lsnr' on 'jiekexu-r1'
CRS-2676: Start of 'ora.ASMNET1LSNR_ASM.lsnr' on 'jiekexu-r1' succeeded
CRS-2676: Start of 'ora.jiekexu-r2.vip' on 'jiekexu-r1' succeeded
CRS-2676: Start of 'ora.chad' on 'jiekexu-r1' succeeded
CRS-2676: Start of 'ora.scan1.vip' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'jiekexu-r1'
CRS-2676: Start of 'ora.LISTENER.lsnr' on 'jiekexu-r1' succeeded
CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'jiekexu-r1' succeeded
CRS-2676: Start of 'ora.ons' on 'jiekexu-r1' succeeded
CRS-2679: Attempting to clean 'ora.jiekexu.db' on 'jiekexu-r1'
CRS-2681: Clean of 'ora.jiekexu.db' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.jiekexu.db' on 'jiekexu-r1'
CRS-2676: Start of 'ora.qosmserver' on 'jiekexu-r1' succeeded
CRS-2676: Start of 'ora.jiekexu.db' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.jiekexu.jiekexu_single.svc' on 'jiekexu-r1'
CRS-2676: Start of 'ora.jiekexu.jiekexu_single.svc' on 'jiekexu-r1' succeeded
CRS-6016: Resource auto-start has completed for server jiekexu-r1
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.
[root@jiekexu-r1 ~]# /u01/app/19.0.0/grid/bin/crsctl status res -t -init 

正常启动节点 2

[root@jiekexu-r2 ~]# /u01/app/19.0.0/grid/bin/crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.evmd' on 'jiekexu-r2'
CRS-2672: Attempting to start 'ora.mdnsd' on 'jiekexu-r2'
CRS-2676: Start of 'ora.evmd' on 'jiekexu-r2' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'jiekexu-r2'
CRS-2676: Start of 'ora.gpnpd' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'jiekexu-r2'
CRS-2676: Start of 'ora.gipcd' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'jiekexu-r2'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'jiekexu-r2'
CRS-2676: Start of 'ora.cssdmonitor' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'jiekexu-r2'
CRS-2672: Attempting to start 'ora.diskmon' on 'jiekexu-r2'
CRS-2676: Start of 'ora.diskmon' on 'jiekexu-r2' succeeded
CRS-2676: Start of 'ora.crf' on 'jiekexu-r2' succeeded
CRS-2676: Start of 'ora.cssd' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'jiekexu-r2'
CRS-2672: Attempting to start 'ora.ctssd' on 'jiekexu-r2'
CRS-2676: Start of 'ora.ctssd' on 'jiekexu-r2' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'jiekexu-r2'
CRS-2676: Start of 'ora.asm' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'jiekexu-r2'
CRS-2676: Start of 'ora.storage' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'jiekexu-r2'
CRS-2676: Start of 'ora.crsd' on 'jiekexu-r2' succeeded
CRS-6017: Processing resource auto-start for servers: jiekexu-r2
CRS-2673: Attempting to stop 'ora.jiekexu-r2.vip' on 'jiekexu-r1'
CRS-2672: Attempting to start 'ora.chad' on 'jiekexu-r2'
CRS-2672: Attempting to start 'ora.ons' on 'jiekexu-r2'
CRS-2677: Stop of 'ora.jiekexu-r2.vip' on 'jiekexu-r1' succeeded
CRS-2672: Attempting to start 'ora.jiekexu-r2.vip' on 'jiekexu-r2'
CRS-2676: Start of 'ora.jiekexu-r2.vip' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.LISTENER.lsnr' on 'jiekexu-r2'
CRS-2676: Start of 'ora.chad' on 'jiekexu-r2' succeeded
CRS-2676: Start of 'ora.LISTENER.lsnr' on 'jiekexu-r2' succeeded
CRS-33672: Attempting to start resource group 'ora.asmgroup' on server 'jiekexu-r2'
CRS-2672: Attempting to start 'ora.asmnet1.asmnetwork' on 'jiekexu-r2'
CRS-2676: Start of 'ora.asmnet1.asmnetwork' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.ASMNET1LSNR_ASM.lsnr' on 'jiekexu-r2'
CRS-2676: Start of 'ora.ASMNET1LSNR_ASM.lsnr' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'jiekexu-r2'
CRS-2676: Start of 'ora.ons' on 'jiekexu-r2' succeeded
CRS-2676: Start of 'ora.asm' on 'jiekexu-r2' succeeded
CRS-33676: Start of resource group 'ora.asmgroup' on server 'jiekexu-r2' succeeded.
CRS-2672: Attempting to start 'ora.DATA.dg' on 'jiekexu-r2'
CRS-2676: Start of 'ora.DATA.dg' on 'jiekexu-r2' succeeded
CRS-2679: Attempting to clean 'ora.jiekexu.db' on 'jiekexu-r2'
CRS-2681: Clean of 'ora.jiekexu.db' on 'jiekexu-r2' succeeded
CRS-2672: Attempting to start 'ora.jiekexu.db' on 'jiekexu-r2'
CRS-2676: Start of 'ora.jiekexu.db' on 'jiekexu-r2' succeeded
CRS-6016: Resource auto-start has completed for server jiekexu-r2
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.



set line  240 
col HOST_NAME for a30 
16:19:35 SYS@JiekeXu2> 
INSTANCE_NAME    HOST_NAME                      VERSION           STARTUP_TIME        STATUS
---------------- ------------------------------ ----------------- ------------------- ------------
JiekeXu2         jiekexu-r2                   2024-02-01 15:47:57 OPEN
JiekeXu1         jiekexu-r1                   2024-02-01 15:44:30 OPEN


手动备份 OCR

使用 root 用如下命令手动备份 OCR

/u01/app/19.0.0/grid/bin/ocrconfig -manualbackup


/u01/app/19.0.0/grid/bin/ocrconfig -showbackup

默认情况下,每 4 个小时自动备份,oracle 会保留最近 5 份 ocr 备份:3 份最近的、一份昨天和一份上周的。

检查 OCR 完整性

$ cluvfy comp ocr -n all



[root@jiekexu-r1 ~]# /u01/app/19.0.0/grid/bin/ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          4
         Total space (kbytes)     :     901284
         Used space (kbytes)      :      84464
         Available space (kbytes) :     816820
         ID                       :  608646820
         Device/File Name         :       +OCR
                                    Device/File integrity check succeeded

                                    Device/File not configured

                                    Device/File not configured

                                    Device/File not configured

                                    Device/File not configured

         Cluster registry integrity check succeeded

         Logical corruption check succeeded
[root@jiekexu-r1 ~]# /u01/app/19.0.0/grid/bin/crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   14ed0aa13ffd4f89bfe2d79061f96fbc (/dev/asm_ocr03) [OCR]
 2. ONLINE   27cc8fbc135f4fd3bf574b4d2e62531e (/dev/asm_ocr01) [OCR]
 3. ONLINE   c5c806e3a2414f74bf1c70f2add4a821 (/dev/asm_ocr02) [OCR]
Located 3 voting disk(s).



微信公众号:JiekeXu DBA之路
CSDN :https://blog.csdn.net/JiekeXu

最后修改时间:2024-02-23 10:31:47
