数据库集群crs磁盘组dismount，crs服务丢失故障分析

RootLiu 2021-05-25

5042

早上刚上班就遇到几个问题，其中有几套rac出现了相似的故障，即业务反应监听无法连接，环境均是11.2.0.4 两节点的rac，登录环境查看时发现如下特定：

监听服务无public地址
CRS磁盘组宕机
crsd进程丢失

一、问题分析

1、查看监听状态及集群状态

PPS[root@amlsdb01 ~]#su - grid
grid@amlsdb02:/home/grid$ lsnrctl status
LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 08-MAR-2021 09:11:28
Copyright (c) 1991, 2013, Oracle.  All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 11.2.0.4.0 - Production
Start Date                11-JAN-2021 08:37:04
Uptime                    56 days 0 hr. 34 min. 23 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   home/db/grid/11.2.0/network/admin/listener.ora
Listener Log File         /home/db/grid/diag/tnslsnr/amlsdb01/listener/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.6.81.165)(PORT=1521)))
Services Summary...
Service "+ASM" has 1 instance(s).
  Instance "+ASM1", status READY, has 1 handler(s) for this service...
Service "amldb" has 1 instance(s).
  Instance "amldb1", status READY, has 1 handler(s) for this service...
Service "amldbXDB" has 1 instance(s).
  Instance "amldb1", status READY, has 1 handler(s) for this service...
The command completed successfully
grid@amlsdb01:/home/grid$
grid@amlsdb01:/home/grid$ crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

监听服务里只有VIP地址，没有public IP，集群状态无法查看。

2、查看hosts及私网

grid@amlsdb01:/home/grid$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
#Public IP address
10.6.81.121  amlsdb01
10.6.81.122  amlsdb02
#Virtual IP address
10.6.81.165   amlsdb01-vip
10.6.81.166   amlsdb02-vip
# Private IP address
6.81.121.1   amlsdb01-pri
6.81.121.2   amlsdb02-pri
#SCAN IP address
10.6.81.167  amlsdb-scan
grid@amlsdb01:/home/grid$ ping 6.81.121.1
PING 6.81.121.1 (6.81.121.1) 56(84) bytes of data.
64 bytes from 6.81.121.1: icmp_seq=1 ttl=64 time=0.028 ms
64 bytes from 6.81.121.1: icmp_seq=2 ttl=64 time=0.024 ms
^C
--- 6.81.121.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1350ms
rtt min/avg/max/mdev = 0.024/0.026/0.028/0.002 ms
grid@amlsdb01:/home/grid$ ping 6.81.121.2
PING 6.81.121.2 (6.81.121.2) 56(84) bytes of data.
64 bytes from 6.81.121.2: icmp_seq=1 ttl=64 time=0.205 ms
64 bytes from 6.81.121.2: icmp_seq=2 ttl=64 time=0.205 ms
^C
--- 6.81.121.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1557ms
rtt min/avg/max/mdev = 0.205/0.205/0.205/0.000 ms
grid@amlsdb01:/home/grid$

私网两个节点均可以互相ping通。

3、分析集群日志

2021-03-03 15:21:25.384:
[ohasd(48740)]CRS-1301:Oracle High Availability Service started on node amlsdb01.
2021-03-03 15:21:25.384:
[ohasd(48740)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2021-03-03 15:21:28.718:
[/home/db/grid/11.2.0/bin/orarootagent.bin(48842)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2021-03-03 15:21:33.103:
[gpnpd(48948)]CRS-2328:GPNPD started on node amlsdb01.
2021-03-03 15:21:35.529:
[cssd(49014)]CRS-1713:CSSD daemon is started in clustered mode
2021-03-03 15:21:37.347:
[ohasd(48740)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2021-03-03 15:21:55.235:
[cssd(49014)]CRS-1707:Lease acquisition for node amlsdb01 number 1 completed
2021-03-03 15:21:56.487:
[cssd(49014)]CRS-1605:CSSD voting file is online: /dev/asmdisk3-crs; details in /home/db/grid/11.2.0/log/amlsdb01/cssd/ocssd.log.
2021-03-03 15:21:56.492:
[cssd(49014)]CRS-1605:CSSD voting file is online: /dev/asmdisk2-crs; details in /home/db/grid/11.2.0/log/amlsdb01/cssd/ocssd.log.
2021-03-03 15:21:56.498:
[cssd(49014)]CRS-1605:CSSD voting file is online: /dev/asmdisk1-crs; details in /home/db/grid/11.2.0/log/amlsdb01/cssd/ocssd.log.
2021-03-03 15:22:00.024:
[cssd(49014)]CRS-1601:CSSD Reconfiguration complete. Active nodes are amlsdb01 amlsdb02 .
2021-03-03 15:22:02.537:
[ctssd(49443)]CRS-2403:The Cluster Time Synchronization Service on host amlsdb01 is in observer mode.
2021-03-03 15:22:02.785:
[ctssd(49443)]CRS-2407:The new Cluster Time Synchronization Service reference node is host amlsdb02.
2021-03-03 15:22:02.786:
[ctssd(49443)]CRS-2401:The Cluster Time Synchronization Service started on host amlsdb01.
2021-03-03 15:22:04.523:
[ohasd(48740)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
[client(49523)]CRS-10001:03-Mar-21 15:22 ACFS-9391: Checking for existing ADVM/ACFS installation.
[client(49528)]CRS-10001:03-Mar-21 15:22 ACFS-9392: Validating ADVM/ACFS installation files for operating system.
[client(49530)]CRS-10001:03-Mar-21 15:22 ACFS-9393: Verifying ASM Administrator setup.
[client(49533)]CRS-10001:03-Mar-21 15:22 ACFS-9308: Loading installed ADVM/ACFS drivers.
[client(49536)]CRS-10001:03-Mar-21 15:22 ACFS-9154: Loading 'oracleoks.ko' driver.
[client(49564)]CRS-10001:03-Mar-21 15:22 ACFS-9154: Loading 'oracleadvm.ko' driver.
[client(49619)]CRS-10001:03-Mar-21 15:22 ACFS-9154: Loading 'oracleacfs.ko' driver.
[client(49680)]CRS-10001:03-Mar-21 15:22 ACFS-9327: Verifying ADVM/ACFS devices.
[client(49682)]CRS-10001:03-Mar-21 15:22 ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'.
[client(49686)]CRS-10001:03-Mar-21 15:22 ACFS-9156: Detecting control device '/dev/ofsctl'.
[client(49691)]CRS-10001:03-Mar-21 15:22 ACFS-9322: completed
2021-03-03 15:22:17.813:
[ctssd(49443)]CRS-2409:The clock on host amlsdb01 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2021-03-03 15:22:26.488:
[crsd(50046)]CRS-1012:The OCR service started on node amlsdb01.
2021-03-03 15:22:26.608:
[evmd(49463)]CRS-1401:EVMD started on node amlsdb01.
2021-03-03 15:22:27.662:
[crsd(50046)]CRS-1201:CRSD started on node amlsdb01.
2021-03-03 15:22:28.406:
[/home/db/grid/11.2.0/bin/oraagent.bin(50167)]CRS-5011:Check of resource "amldb" failed: details at "(:CLSN00007:)" in "/home/db/grid/11.2.0/log/amlsdb01/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2021-03-03 15:22:33.554:
[crsd(50046)]CRS-2772:Server 'amlsdb01' has been assigned to pool 'Generic'.
2021-03-03 15:22:33.636:
[crsd(50046)]CRS-2772:Server 'amlsdb01' has been assigned to pool 'ora.amlsdb'.
2021-03-03 15:22:33.636:
[crsd(50046)]CRS-2772:Server 'amlsdb01' has been assigned to pool 'ora.amldb'.
2021-03-03 15:23:15.321:
[crsd(50046)]CRS-2807:Resource 'ora.amlsdb.db' failed to start automatically.
2021-03-03 15:23:15.321:
[crsd(50046)]CRS-2807:Resource 'ora.amlsdb02.vip' failed to start automatically.
2021-03-03 15:52:18.167:
[ctssd(49443)]CRS-2409:The clock on host amlsdb01 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2021-03-03 16:22:18.523:
[ctssd(49443)]CRS-2409:The clock on host amlsdb01 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2021-03-03 18:09:17.415:
[cssd(49014)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 9180 msecs.
2021-03-03 18:09:25.714:
[cssd(49014)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 17480 msecs.
2021-03-03 18:09:33.716:
[cssd(49014)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 25480 msecs.
2021-03-03 18:16:43.836:
[ctssd(49443)]CRS-2409:The clock on host amlsdb01 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
。。。。。
上述日志每隔半个小时报一次
2021-03-03 21:23:03.456:
[client(24951)]CRS-10051:CVU found following errors with Clusterware setup : PRVF-7532 : 节点 "amlsdb01" 上缺少程序包 "pdksh"
PRVF-5415 : 检查以确定 NTP 守护程序或服务是否运行失败
PRVG-1101 : SCAN 名称 "amlsdb-scan" 无法解析
PRVF-4657 : "amlsdb-scan" (IP 地址: 10.6.81.167) 的名称解析设置检查失败
PRVF-4664 : 发现与 SCAN 名称 "amlsdb-scan" 不一致的名称解析条目
2021-03-03 22:30:46.767:
[ctssd(49443)]CRS-2409:The clock on host amlsdb01 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
。。。。。
以上日志持续输出，直到3月5号，开始报如下错误：
2021-03-05 19:23:25.900:
[ctssd(49443)]CRS-2409:The clock on host amlsdb01 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2021-03-05 19:37:37.006:
[crsd(50046)]CRS-1006:The OCR location +CRS is inaccessible. Details in /home/db/grid/11.2.0/log/amlsdb01/crsd/crsd.log.
2021-03-05 19:37:37.007:
[crsd(50046)]CRS-1006:The OCR location +CRS is inaccessible. Details in /home/db/grid/11.2.0/log/amlsdb01/crsd/crsd.log.
2021-03-05 19:37:58.350:
[/home/db/grid/11.2.0/bin/orarootagent.bin(50179)]CRS-5822:Agent '/home/db/grid/11.2.0/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:122} in /home/db/grid/11.2.0/log/amlsdb01/agent/crsd/orarootagent_root/orarootagent_root.log.
2021-03-05 19:37:58.352:
[/home/db/grid/11.2.0/bin/oraagent.bin(50163)]CRS-5822:Agent '/home/db/grid/11.2.0/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:7} in /home/db/grid/11.2.0/log/amlsdb01/agent/crsd/oraagent_grid/oraagent_grid.log.
2021-03-05 19:37:58.353:
[/home/db/grid/11.2.0/bin/scriptagent.bin(50190)]CRS-5822:Agent '/home/db/grid/11.2.0/bin/scriptagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:7:10} in /home/db/grid/11.2.0/log/amlsdb01/agent/crsd/scriptagent_grid/scriptagent_grid.log.
2021-03-05 19:37:58.354:
[/home/db/grid/11.2.0/bin/oraagent.bin(50305)]CRS-5822:Agent '/home/db/grid/11.2.0/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:9:12324} in /home/db/grid/11.2.0/log/amlsdb01/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2021-03-05 19:37:58.355:
[ohasd(48740)]CRS-2765:Resource 'ora.crsd' has failed on server 'amlsdb01'.
2021-03-05 19:37:59.681:
[crsd(42690)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /home/db/grid/11.2.0/log/amlsdb01/crsd/crsd.log.
2021-03-05 19:37:59.689:
[crsd(42690)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /home/db/grid/11.2.0/log/amlsdb01/crsd/crsd.log.
2021-03-05 19:38:00.381:
[ohasd(48740)]CRS-2765:Resource 'ora.crsd' has failed on server 'amlsdb01'.

可以看到2021-03-03 15:21:25启动集群，2021-03-03 15:52:18时钟同步异常，集群及observer模式运行。从上述日志可以看出，最开始是集群时间出了问题。直到2021-03-05 19:37:37报错crs出现问题，OCR所在的磁盘组+CRS不通，导致crsd服务无法正常服务，具体信息在trace日志/home/db/grid/11.2.0/log/amlsdb01/crsd/crsd.log。

关于ctss我们知道如下特点：

1）可以采用操作系统的NTP服务，也可以使用Oracle自带的服务ctss，如果ntp没有启用，那么Oracle会自动启用自己的ctssd进程。

2）NTP和CTSS是可以共存的，且NTP的优先级要高于CTSS。

3）如果系统中同时有NTP和CTSS，那么集群的时间是由NTP同步的，CTSS会处于观望（Observer）模式，只有当集群关闭所有的NTP服务，CTSS才会处于激活（Active）模式。在一个集群中，只要有一个节点的ntp处于活动状态，那么集群的所有节点的CTSS都会处于激活（Active）模式

需要注意的是，要让CTSS处于激活（Active）模式，则不仅要关闭ntp服务（/sbin/service ntpd stop），还要删除/etc/ntp.conf文件（mv /etc/ntp.conf /etc/ntp.conf.bak），否则不能启用CTSS。

4、trace日志分析

2021-03-05 17:08:35.897: [UiServer][2071967488]{1:18733:4024} Sending message to PE. ctx= 0x7fc73001a290, Client PID: 49780
2021-03-05 17:08:35.898: [   CRSPE][2074068736]{1:18733:4024} Cmd : 0x7fc73411ff00 : flags: EVENT_TAG | FORCE_TAG | QUEUE_TAG
2021-03-05 17:08:35.898: [   CRSPE][2074068736]{1:18733:4024} Processing PE command id=4081. Description: [Stop Resource : 0x7fc73411ff00]
2021-03-05 17:08:35.898: [   CRSPE][2074068736]{1:18733:4024} Expression Filter : (((NAME == ora.CRS.dg) AND (LAST_SERVER == amlsdb01)) AND (STATE != OFFLINE))
2021-03-05 17:08:35.898: [   CRSPE][2074068736]{1:18733:4024} Expression Filter : (((NAME == ora.CRS.dg) AND (LAST_SERVER == amlsdb01)) AND (STATE != OFFLINE))
2021-03-05 17:08:35.899: [   CRSPE][2074068736]{1:18733:4024} Attribute overrides for the command: USR_ORA_OPI = true;
2021-03-05 17:08:35.899: [   CRSPE][2074068736]{1:18733:4024} Filtering duplicate ops: server [] state [OFFLINE]
2021-03-05 17:08:35.899: [   CRSPE][2074068736]{1:18733:4024} Op 0x7fc734126800 has 6 WOs
2021-03-05 17:08:35.900: [   CRSPE][2074068736]{1:18733:4024} RI [ora.CRS.dg amlsdb01 1] new target state: [OFFLINE] old value: [ONLINE]
2021-03-05 17:08:35.900: [   CRSPE][2074068736]{1:18733:4024} RI [ora.CRS.dg amlsdb01 1] new internal state: [STOPPING] old value: [STABLE]
2021-03-05 17:08:35.900: [   CRSPE][2074068736]{1:18733:4024} Sending message to agfw: id = 85445
2021-03-05 17:08:35.900: [   CRSPE][2074068736]{1:18733:4024} CRS-2673: Attempting to stop 'ora.CRS.dg' on 'amlsdb01'
2021-03-05 17:08:35.900: [  CRSOCR][2688939776]{1:18733:4024} Multi Write Batch processing...
2021-03-05 17:08:35.900: [    AGFW][2691041024]{1:18733:4024} Agfw Proxy Server received the message: RESOURCE_STOP[ora.CRS.dg amlsdb01 1] ID 4099:85445
2021-03-05 17:08:35.900: [    AGFW][2691041024]{1:18733:4024} Agfw Proxy Server forwarding the message: RESOURCE_STOP[ora.CRS.dg amlsdb01 1] ID 4099:85445 to the agent /home/db/grid/11.2.0/bin/oraagent_grid
2021-03-05 17:08:35.901: [UiServer][2071967488]{1:18733:4024} Container [ Name: ORDER
    MESSAGE:
    TextMessage[CRS-2673: Attempting to stop 'ora.CRS.dg' on 'amlsdb01']
    MSGTYPE:
    TextMessage[3]
    OBJID:
    TextMessage[ora.CRS.dg amlsdb01 1]
    WAIT:
    TextMessage[0]
]
2021-03-05 17:08:35.901: [ COMMCRS][2071967488]clscsendx: (0x7fc7480e7f60) Connection not active

可以看到crs磁盘组被停掉了。

5、asm日志分析

Wed Mar 03 15:22:12 2021
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 4
Number of processor cores in the system is 4
Number of processor sockets in the system is 2
Private Interface 'eth1:1' configured from GPnP for use as a private interconnect.
  [name='eth1:1', type=1, ip=169.254.151.227, mac=00-50-56-9f-41-e2, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'eth0' configured from GPnP for use as a public interface.
  [name='eth0', type=1, ip=10.6.81.121, mac=00-50-56-9f-15-c4, net=10.6.81.0/24, mask=255.255.255.0, use=public/1]
Public Interface 'eth0:1' configured from GPnP for use as a public interface.
  [name='eth0:1', type=1, ip=10.6.81.167, mac=00-50-56-9f-15-c4, net=10.6.81.0/24, mask=255.255.255.0, use=public/1]
Public Interface 'eth0:3' configured from GPnP for use as a public interface.
  [name='eth0:3', type=1, ip=10.6.81.165, mac=00-50-56-9f-15-c4, net=10.6.81.0/24, mask=255.255.255.0, use=public/1]
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /home/db/grid/11.2.0/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
NOTE: Volume support  enabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options.
ORACLE_HOME = /home/db/grid/11.2.0
System name:    Linux
Node name:    amlsdb01
Release:    2.6.32-431.el6.x86_64
Version:    #1 SMP Sun Nov 10 22:19:54 EST 2013
Machine:    x86_64
VM name:    VMWare Version: 6
Using parameter settings in server-side spfile +CRS/amlsdb-cluster/asmparameterfile/registry.253.939411949
System parameters with non-default values:
  large_pool_size          = 12M
  instance_type            = "asm"
  remote_login_passwordfile= "EXCLUSIVE"
  asm_diskstring           = "/dev/asm*"
  asm_diskgroups           = "DATA"
  asm_power_limit          = 1
  diagnostic_dest          = "/home/db/oracle"
Cluster communication is configured to use the following interface(s) for this instance
  169.254.151.227
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
。。。。。。
NOTE: GMON heartbeating for grp 1
GMON querying group 1 at 4 for pid 23, osid 49857
NOTE: cache opening disk 0 of grp 1: CRS_0000 path:/dev/asmdisk1-crs
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache opening disk 1 of grp 1: CRS_0001 path:/dev/asmdisk2-crs
NOTE: F1X0 found on disk 1 au 2 fcn 0.0
NOTE: cache opening disk 2 of grp 1: CRS_0002 path:/dev/asmdisk3-crs
NOTE: F1X0 found on disk 2 au 2 fcn 0.0
NOTE: cache mounting (first) normal redundancy group 1/0x8093464B (CRS)
* allocate domain 1, invalid = TRUE
kjbdomatt send to inst 2
NOTE: attached to recovery domain 1
NOTE: cache recovered group 1 to fcn 0.572
NOTE: redo buffer size is 256 blocks (1053184 bytes)
NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (CRS)
Process LGWR (pid 49809) is running at high priority QoS for Exadata I/O
NOTE: LGWR found thread 1 closed at ABA 11.73
NOTE: LGWR mounted thread 1 for diskgroup 1 (CRS)
NOTE: LGWR opening thread 1 at fcn 0.572 ABA 12.74
NOTE: cache mounting group 1/0x8093464B (CRS) succeeded
NOTE: cache ending mount (success) of group CRS number=1 incarn=0x8093464b
GMON querying group 2 at 5 for pid 23, osid 49857
NOTE: cache opening disk 0 of grp 2: DATA_0000 path:/dev/asmdisk1-data
NOTE: F1X0 found on disk 0 au 2 fcn 0.90688862
NOTE: cache opening disk 1 of grp 2: DATA_0001 path:/dev/asmdisk2-data
NOTE: cache opening disk 2 of grp 2: DATA_0002 path:/dev/asmdisk3-data
NOTE: cache opening disk 3 of grp 2: DATA_0003 path:/dev/asmdisk4-data
NOTE: cache opening disk 4 of grp 2: DATA_0004 path:/dev/asmdisk5-data
NOTE: cache opening disk 5 of grp 2: DATA_0005 path:/dev/asmdisk6-data
NOTE: cache opening disk 6 of grp 2: DATA_0006 path:/dev/asmdisk7-data
NOTE: cache opening disk 7 of grp 2: DATA_0007 path:/dev/asmdisk8-data
NOTE: cache mounting (not first) external redundancy group 2/0x4533464C (DATA)
kjbdomatt send to inst 2
NOTE: attached to recovery domain 2
NOTE: redo buffer size is 256 blocks (1053184 bytes)
NOTE: LGWR attempting to mount thread 1 for diskgroup 2 (DATA)
NOTE: LGWR found thread 1 closed at ABA 966.8944
NOTE: LGWR mounted thread 1 for diskgroup 2 (DATA)
NOTE: LGWR opening thread 1 at fcn 0.136958502 ABA 967.8945
NOTE: cache mounting group 2/0x4533464C (DATA) succeeded
NOTE: cache ending mount (success) of group DATA number=2 incarn=0x4533464c
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1
SUCCESS: diskgroup CRS was mounted
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 2
SUCCESS: diskgroup DATA was mounted
SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */
SQL> ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:0:2} */
SUCCESS: ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:0:2} */
Wed Mar 03 15:22:25 2021
NOTE: [crsd.bin@amlsdb01 (TNS V1-V3) 50046] opening OCR file
Starting background process ASMB
Wed Mar 03 15:22:26 2021
ASMB started with pid=25, OS id=50071
Wed Mar 03 15:22:26 2021
NOTE: client +ASM1:+ASM registered, osid 50073, mbr 0x0
Wed Mar 03 15:22:26 2021
WARNING: failed to online diskgroup resource ora.CRS.dg (unable to communicate with CRSD/OHASD)
Wed Mar 03 15:22:26 2021
NOTE: Attempting voting file refresh on diskgroup CRS
NOTE: Refresh completed on diskgroup CRS
. Found 3 voting file(s).
NOTE: Voting file relocation is required in diskgroup CRS
NOTE: Attempting voting file relocation on diskgroup CRS
NOTE: Successful voting file relocation on diskgroup CRS
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)
Wed Mar 03 15:22:42 2021
NOTE: client amlsdb1:amlsdb registered, osid 50519, mbr 0x1
NOTE: ASM client amlsdb1:amlsdb disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /home/db/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_50519.trc
Wed Mar 03 15:22:54 2021
NOTE: client amldb1:amldb registered, osid 50820, mbr 0x1
Wed Mar 03 15:23:28 2021
ALTER SYSTEM SET local_listener=' (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.6.81.165)(PORT=1521))))' SCOPE=MEMORY SID='+ASM1';
Fri Mar 05 17:08:05 2021
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
Fri Mar 05 17:08:05 2021
NOTE: process _b000_+asm1 (130862) initiating offline of disk 0.509851336 (CRS_0000) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (130862) initiating offline of disk 1.509851335 (CRS_0001) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (130862) initiating offline of disk 2.509851334 (CRS_0002) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
GMON checking disk modes for group 1 at 8 for pid 27, osid 130862
ERROR: no read quorum in group: required 2, found 0 disks
NOTE: checking PST for grp 1 done.
NOTE: initiating PST update: grp = 1, dsk = 0/0x1e63b6c8, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 1, dsk = 1/0x1e63b6c7, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 1, dsk = 2/0x1e63b6c6, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 9 for pid 27, osid 130862
ERROR: no read quorum in group: required 2, found 0 disks
Fri Mar 05 17:08:05 2021
NOTE: cache dismounting (not clean) group 1/0x8093464B (CRS)
WARNING: Offline for disk CRS_0000 in mode 0x7f failed.
WARNING: Offline for disk CRS_0001 in mode 0x7f failed.
NOTE: messaging CKPT to quiesce pins Unix process pid: 130864, image: oracle@amlsdb01 (B001)
WARNING: Offline for disk CRS_0002 in mode 0x7f failed.
Fri Mar 05 17:08:05 2021
NOTE: halting all I/Os to diskgroup 1 (CRS)
Fri Mar 05 17:08:05 2021
NOTE: LGWR doing non-clean dismount of group 1 (CRS)
NOTE: LGWR sync ABA=12.76 last written ABA 12.76
Fri Mar 05 17:08:05 2021
NOTE: No asm libraries found in the system
Fri Mar 05 17:08:05 2021
kjbdomdet send to inst 2
detach from dom 1, sending detach message to inst 2
ASM Health Checker found 1 new failures
Fri Mar 05 17:08:05 2021
List of instances:
1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 28)
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 1 invalid = TRUE
0 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
freeing rdom 1
Fri Mar 05 17:08:05 2021
WARNING: dirty detached from domain 1
NOTE: cache dismounted group 1/0x8093464B (CRS)
SQL> alter diskgroup CRS dismount force /* ASM SERVER:2157135435 */
Fri Mar 05 17:08:05 2021
NOTE: cache deleting context for group CRS 1/0x8093464b
GMON dismounting group 1 at 10 for pid 28, osid 130864
NOTE: Disk CRS_0000 in mode 0x7f marked for de-assignment
NOTE: Disk CRS_0001 in mode 0x7f marked for de-assignment
NOTE: Disk CRS_0002 in mode 0x7f marked for de-assignment
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 1
Fri Mar 05 17:08:35 2021
SUCCESS: diskgroup CRS was dismounted
SUCCESS: alter diskgroup CRS dismount force /* ASM SERVER:2157135435 */
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group CRS
Fri Mar 05 17:08:35 2021
NOTE: diskgroup resource ora.CRS.dg is offline
Fri Mar 05 17:08:35 2021
Errors in file /home/db/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_50069.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
Errors in file /home/db/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_50069.trc:
ORA-15078: ASM diskgroup was forcibly dismounted

从asm日志可以看出，Wed Mar 03 15:22:12 2021的时候启动的asm实例，到Fri Mar 05 17:08:05 2021，group1的三块盘超时15秒，失去响应，crs磁盘组被强制dismount了。也就是说在这个时候，才是真正导致的最开始反馈的故障现象，crs无法识别到OCR，crsd服务出现问题，导致的业务无法连接到监听。但是为啥磁盘连接会超时呢？

6、主机日志分析

从主机日志可以看出，磁盘IO超时时间段，networkmanager启动过网卡，由此怀疑是由于私网网卡重启，但是日志没有明确显示，或者虚拟机还是存储存在问题，联系虚拟机和存储工程师排查，没有检查出异常。

二、问题处理

1、时钟同步的问题

可以看到集群日志报错是时钟同步出现了问题，因为该故障当时处理的时候查看两个节点时间是同步的，由于紧急，随机重启了集群，没有深入分析，先列出如下分析思路。

检查分析

1）检查两个节点时间是否存在差异

ssh rac-2 date

date

2)检查OS的NTPD服务

# service ntpd status

ntpd is stopped

3）检查ctssd的运行情况

su - grid

crsctl stat res ora.ctssd -init

4) 借助crs的cluvfy工具诊断ctss错误的原因

cluvfy comp clocksync -n all -verbose

处理

为了解决时钟同步的问题，可以用以下方法解决（两个节点都执行）

方法一：

1）停用ctssd （grid用户执行）

crsctl stop resource ora.ctssd -init

2) 停止ntpd (root用户执行）

stopsrc -s xntpd

3) 启动ntpd,加入”-X“ 模式（root用户执行，启动需几分钟）

startsrc -s xntpd -a "-X"

4) 启动ctssd （grid 用户执行）

crsctl start resource ora.ctssd -init

方法二：

要让CTSS处于激活（Active）模式，则不仅要关闭ntp服务（/sbin/service ntpd stop），还要删除/etc/ntp.conf文件（mv /etc/ntp.conf /etc/ntp.conf.bak），否则不能启用CTSS。

该案例中检查ntp服务是stop状态，通过mv ntp配置文件使ctss处理active状态，具体步骤如下：

1）检查ctss状态

crsctl check ctss

2）移动ntp配置文件

mv /etc/ntp.conf /etc/ntp.conf.bak

3）再次检查ctss状态

crsctl check ctss

4）检查ctss服务

crsctl stat res ora.ctssd -t -init

发现已经是active状态了，集群日志也显示active了。本次以该方法处理，详细过程如下：

2、CRS磁盘组force dismount的问题

1）由于主机日志有显示networkmanager频繁让网卡重启，所以停掉networkmanager。

2）默认ASM 有个Delayed ASM PST heart beats的时间为15秒，超过15秒会被dismount，在dg的冗余配置成external redundanccy下可避免被dismount（normal/high redundancy 会发生这种情况）。asm有个隐藏参数会设置响应时间的长短，ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (Doc ID 1581684.1)。

上述mos描述可以得知：

a、集群会定期检查磁盘心跳，通信是否正常，通 ocssd进程检测。

b、检测针对normal和high冗余模式，对于external冗余，不会遇到这个错误；

c、默认情况是15s超时，也就是说15s磁盘组还是没有对ASM实例响应的话，就会dismount磁盘组。

出现这种情况的几个场景：

a、多路径设备的物理路径中的某些路径丢失。

b、failover过程中。

c、存储/多路径/ OS维护过程中。

针对该问题的解决办法是修改隐藏参数_asm_hbeatiowait

SQL> set lines 300  
SQL> show parameter asm
NAME                                 TYPE                   VALUE
------------------------------------ ---------------------- ------------------------------
asm_diskgroups                       string                 DATA, CRS
asm_diskstring                       string                 /dev/asm*
asm_power_limit                      integer                1
asm_preferred_read_failure_groups    string
Enter value for par: hbea
old   1:  select  x.ksppinm  name,  y.ksppstvl  value,  y.ksppstdf  isdefault,  decode(bitand(y.ksppstvf,7),1,'MODIFIED',4,'SYSTEM_MOD','FALSE')  ismod,  decode(bitand(y.ksppstvf,2),2,'TRUE','FALSE')  isadj from  sys.x$ksppi x,  sys.x$ksppcv y where  x.inst_id = userenv('Instance') and        y.inst_id = userenv('Instance') and  x.indx = y.indx and x.ksppinm like '%_&par%' order by  translate(x.ksppinm, ' _', ' ')
new   1:  select  x.ksppinm  name,  y.ksppstvl  value,  y.ksppstdf  isdefault,  decode(bitand(y.ksppstvf,7),1,'MODIFIED',4,'SYSTEM_MOD','FALSE')  ismod,  decode(bitand(y.ksppstvf,2),2,'TRUE','FALSE')  isadj from  sys.x$ksppi x,  sys.x$ksppcv y where  x.inst_id = userenv('Instance') and        y.inst_id = userenv('Instance') and  x.indx = y.indx and x.ksppinm like '%_hbea%' order by  translate(x.ksppinm, ' _', ' ')
NAME                                     VALUE                ISDEFAULT          ISMOD                ISADJ
---------------------------------------- -------------------- ------------------ -------------------- ----------
_asm_hbeatiowait                         15                   TRUE               FALSE                FALSE
_asm_hbeatwaitquantum                    2                    TRUE               FALSE                FALSE
SQL>

设置参数：

SQL> alter system set "_asm_hbeatiowait"=120 scope=spfile;

System altered.

SQL> exit

最后重启集群服务后所有服务正常，监听服务也正常。

三、总结

1、集群时钟同步方式可以有两种，使用系统ntp或者自己的ctss。两者可以共存，如果系统中同时有NTP和CTSS，那么集群的时间是由NTP同步的，CTSS会处于观望（Observer）模式，只有当集群关闭所有的NTP服务，CTSS才会处于激活（Active）模式。在一个集群中，只要有一个节点的ntp处于活动状态，那么集群的所有节点的CTSS都会处于激活（Active）模式。

2、磁盘心跳检测超时默认是15秒，单在AIX上，rw_timeout设置会影响此设置，默认设置为30秒。

3、crs磁盘dismount后集群和数据库实例不一定宕机。

数据库

文章转载自RootLiu，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

数据库集群crs磁盘组dismount，crs服务丢失故障分析

评论