时间跳变引起ASM进程stuck实例重启[存档故障]

原创 IT泥瓦工 2023-05-08

680

对于某些系统来说数据库主机因时间滞后几分钟，业务人员录入相关单子会造成时间不对。本次故障因厂家运维人员调整主机时间而导致节点实例发生重启。调整背景购入一台NTP server，主机安装NTP客户端并进行调整，造成主机时间跳变，ASMB进程异常。

1.故障环境

操作系统：CentOS 7.6
数据库版本：Oracle RAC 12.1.0.2

2.节点ASM故障日志

Tue May 17 16:49:20 2022
WARNING: client [prod2:prod:rac-cluster] not responsive for 396s; state=0x1. killing pid 11755
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_gen0_5172.trc
WARNING: fencing client [prod2:prod:rac-cluster] after 396 seconds (mbr 2)
WARNING: client [+ASM2:+ASM:wlrac-cluster] not responsive for 396s; state=0x1. pid 5262
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_gen0_5172.trc
WARNING: client [-MGMTDB:_mgmtdb:rac-cluster] not responsive for 396s; state=0x1. killing pid 7989
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_gen0_5172.trc
WARNING: fencing client [-MGMTDB:_mgmtdb:rac-cluster] after 396 seconds (mbr 1)
WARNING: ASMB has not responded for 396 seconds
NOTE: ASM umbilicus running slower than expected, ASMB diagnostic requested after 396 seconds
NOTE: ASMB process state dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_gen0_5172.trc
ERROR: terminating instance because ASMB is stuck for 396 secondsDumping diagnostic data in directory=[cdmp_20220517164922], requested by (instance=2, osid=5172 (GEN0)), summary=[abnormal instance termination].
Tue May 17 16:49:22 2022
Instance terminated by GEN0, pid = 5172

显示prod2:prod:rac-cluster、+ASM2:+ASM:wlrac-cluster、-MGMTDB:_mgmtdb:rac-cluster、ASMB资源挂起/堵塞396s，进行kill相关资源，asm实例异常关闭。

ERROR: terminating instance because ASMB is stuck for 396 seconds

3.系统日志

系统日志显示16：49：18 NTP进行主机时间调整，时间往后推了392.8S

4.故障扩展

a.数据库资源或者资源导致ASMB堵塞

b.主机时间跳变

c.bug

5.操作建议

a.对于NTP调整时，尽量使用-x选项进行微调，需要时间来慢慢调整。

b.调整时在维护窗口，停掉集群资源

6.其他扩展信息

In pre-11gR2 clusters, system times are to be synchronized across cluster nodes using NTPD and NTPD should be configured to slew time to prevent false reboots.  Configure NTP client as per Document 759143.1 to take corrective action on this issue.

With 11gR2, Cluster Time Synchronization Daemon (CTSSD) can be used in place of NTPD. CTSSD will synchronize time with a reference node in the cluster when an NTPD is not found to be configured. Should you require synchronization from an external time source you must use NTPD which will cause CTSSD to run in "observer" mode.

最后修改时间：2023-05-08 22:45:17

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

时间跳变引起ASM进程stuck实例重启[存档故障]

评论