暂无图片
分享
四九年入国军
2022-06-01
aix 7.1 + oracle 11.2.0.4 rac ,节点2 主机自动重启原因寻找

        节点2 32.161 在今天早上10:11发生了重启,节点2只能看到重启后日志,看不到重启前的。

线索1:

      2022-06-01 10:11:59.193: [cssd(11207262)]CRS-1612:Network communication with node ncshisdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.848 seconds

        节点1有这个告警,我的理解是这个心跳不通是在节点2重启的过程中发生的

线索2:

     11.2.0.2 开始,当出现以下情况的时候,集群件(GI)会重新启动集群管理软件,而不是将节点重启。

1.当某个节点连续丢失网络心跳超过misscount时。

2.当某个节点不能访问大多数表决盘(VF)时。
3.当member kill 被升级成为node kill的时候。
在之前的版本,以上情况,集群管理软件(CRS)会直接重启节点。

     为啥我这个节点2直接机器重启,而不是节点重启,请高手解惑




   




收藏
分享
7条回答
默认
最新
四九年入国军
上传附件:rac日志.zip
暂无图片 评论
暂无图片 有用 0
暂无图片
Uncopyrightable

超时可能与网络或者共享磁盘的连接有关;

从11.2.0.2开始,由于新特性rebootless restart的原因,节点不会被重启,而是集群管理软件进行重启。

具体文案找不到了,等待其他大神补充下了~


Rebootless Node Fencing

In versions before 11.2.0.2 Oracle Clusterware tried to prevent a split-brain with a fast reboot (better: reset) of the server(s) without waiting for ongoing I/O operations or synchronization of the file systems. This mechanism has been changed in version 11.2.0.2 (first 11g Release 2 patch set). After deciding which node to evict, the clusterware:

. attempts to shut down all Oracle resources/processes on the server (especially processes generating I/Os)

. will stop itself on the node

. afterwards Oracle High Availability Service Daemon (OHASD) will try to start the Cluster Ready Services (CRS) stack again. Once the cluster interconnect is back online,all relevant cluster resources on that node will automatically start

. kill the node if stop of resources or processes generating I/O is not possible (hanging in kernel mode, I/O path, etc.)

This behavior change is particularly useful for non-cluster aware applications.


Prior to 11g R2, during voting disk failures the node will be rebooted to protect the integrity of the cluster. But rebooting cannot be necessarily just the communication issue. The node can be hanging or the IO operation can be hanging so potentially the reboot decision can be the incorrect one. So Oracle Clusterware will fence the node without rebooting. This is a big (and big) achievement and changes in the way the cluster is designed.

The reason why we will have to avoid the reboot is that during reboots resources need to re-mastered and the nodes remaining on the cluster should be re-formed. In a big cluster with many numbers of nodes, this can be potentially a very expensive operation so Oracle fences the node by killing the offending process so the cluster will shutdown but the node will not be shutdown. Once the IO path is available or the network heartbeat is available, the cluster will be started again. Be assured the data will be protected but it will be done without any pain rebooting the nodes. But in the cases where the reboot is needed to protect the integrity, the cluster will decide to reboot the node.

暂无图片 评论
暂无图片 有用 0
Root__Liu

可否提供下两个节点的asm日志、gpnp日志、gipc日志、数据库alert日志、两个节点的主机日志、OSW或者主机层面的监控信息。

image.png
image.png

从提供的附件里面看,初步判断是节点2的主机先宕机重启,集群是受害者。也就是初步判断主机因为某种原因重启,集群在正常运行时遭遇主机宕机,导致后台没有任何日志输出,节点1探测到节点2的网络不通,才有的节点1日志里面那些输出。

需要题主提供更多的信息来判断分析。

暂无图片 评论
暂无图片 有用 0


首先是心跳检测失败,要将节点2踢出,并重启节点2,但是一直无法终止节点2,重启失败,最终重启了机器

暂无图片 评论
暂无图片 有用 0
手机用户8432

检查

1.cluvfy comp clocksync -n all各个节点服务是否正常

2.crsctl start res ora.ctssd -init 看一下ctss 

运行模式如果active 要看一下两个节点时间是否一直如果

observer模式看一下ntp服务是否正常

暂无图片 评论
暂无图片 有用 0
四九年入国军

10:11:59 后节点2就已经重启了,节点1的心跳不通是10:12,这时机器已经down了

暂无图片 评论
暂无图片 有用 0
四九年入国军
问题已关闭: 未解决再开贴
暂无图片 评论
暂无图片 有用 0
回答交流
提交
问题信息
请登录之后查看
附件列表
请登录之后查看
邀请回答
暂无人订阅该标签,敬请期待~~
暂无图片墨值悬赏