暂无图片
分享
咖啡哥
2021-02-26
求救,已经一周了还没解决:Oracle 19.3 RAC,安装grid软件,在节点2执行root.sh的时候step 17 of 19: 'StartCluster'报错
暂无图片 100M

环境信息

OS:Oracle Linux 7.7,4.14.35-1902.3.2.el7uek.x86_64
软件版本:Oracle 19.3 for x86-64
CPU:48core
内存:628G

问题描述

在第二个节点执行root脚本报错:

Died at /u01/app/19.3.0/grid/crs/install/crsinstall.pm line 1970.

2021/02/26 09:56:18 CLSRSC-594: Executing installation step 17 of 19: 'StartCluster'.
2021/02/26 09:57:18 CLSRSC-4002: Successfully installed Oracle Trace File Analyzer (TFA) Collector.
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.evmd' on 'eamdb02'
CRS-2672: Attempting to start 'ora.mdnsd' on 'eamdb02'
CRS-2676: Start of 'ora.mdnsd' on 'eamdb02' succeeded
CRS-2676: Start of 'ora.evmd' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'eamdb02'
CRS-2676: Start of 'ora.gpnpd' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'eamdb02'
CRS-2676: Start of 'ora.gipcd' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'eamdb02'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'eamdb02'
CRS-2676: Start of 'ora.cssdmonitor' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'eamdb02'
CRS-2672: Attempting to start 'ora.diskmon' on 'eamdb02'
CRS-2676: Start of 'ora.diskmon' on 'eamdb02' succeeded
CRS-2676: Start of 'ora.crf' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'eamdb02'
CRS-2676: Start of 'ora.cssdmonitor' on 'eamdb02' succeeded
CRS-1609: This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00086:) in /u01/app/grid/diag/crs/eamdb02/crs/trace/ocssd.trc.
CRS-2883: Resource 'ora.cssd' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVH-0116 : Path "/u01/app/19.3.0/grid/crs/install/cmdllroot.sh" with permissions "rw-r--r--" does not have execute permissions for the owner, file's group, and others on node "eamdb02".
PRVG-2031 : Owner of file "/u01/app/19.3.0/grid/crs/install/cmdllroot.sh" did not match the expected value on node "eamdb02". [Expected = "grid(54322)" ; Found = "root(0)"]
PRVG-2032 : Group of file "/u01/app/19.3.0/grid/crs/install/cmdllroot.sh" did not match the expected value on node "eamdb02". [Expected = "oinstall(54321)" ; Found = "root(0)"]
CRS-4000: Command Start failed, or completed with errors.
2021/02/26 10:06:49 CLSRSC-117: Failed to start Oracle Clusterware stack
Died at /u01/app/19.3.0/grid/crs/install/crsinstall.pm line 1970.

ocssd.trc报has a disk HB, but no network HB

====/u01/app/grid/diag/crs/eamdb02/crs/trace/ocssd.trc

2021-02-26 09:56:38.244 :    CSSD:523484416: [     INFO] clssnmvDHBValidateNCopy: node 1, eamdb01, has a disk HB, but no network HB, DHB has rcfg 509799178, wrtcnt, 1035, LATS 1442214, lastSeqNo 0, uniqueness 1614304356, timestamp 1614304597/125793074
2021-02-26 09:56:38.244 :    CSSD:523484416: [     INFO] clssnmvDiskAvailabilityChange: voting file /dev/oracleasm/disks/CRS2 now online
2021-02-26 09:56:38.245 :    CSSD:523484416: [     INFO] clssnmvDiskAvailabilityChange: voting file /dev/oracleasm/disks/CRS1 now online
2021-02-26 09:56:38.247 :    CSSD:523484416: [     INFO] clssnmvDiskAvailabilityChange: voting file /dev/oracleasm/disks/CRS3 now online
2021-02-26 09:56:38.247 :    CSSD:523484416: [     INFO] clssnmlGetLease:Node does not have a valid lease going for lease acquistion
2021-02-26 09:56:38.247 :    CSSD:523484416: [     INFO] clssnmlpickslot:Optimize the lease acquisition for Fixed configuration slot provided by root scripts  with slot 2
2021-02-26 09:56:39.063 :    CSSD:340465408: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization
2021-02-26 09:56:40.063 :    CSSD:340465408: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization
2021-02-26 09:56:41.063 :    CSSD:340465408: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization
2021-02-26 09:56:41.248 :    CSSD:523484416: [     INFO] clssnmvDHBValidateNcopy: Saving DHB uniqueness for node(1/eamdb01), latestInfo(1614304356), readInfo(1614304356), nodeInfoDHB(0)
2021-02-26 09:56:41.248 :    CSSD:523484416: [     INFO] clssnmvDHBValidateNcopy: Setting LATS valid due to second DHB seen on disk(0x7f33b42a0450) for node(1/eamdb01) nodeStatus 0x3
2021-02-26 09:56:41.248 :    CSSD:523484416: [     INFO] clssnmvDHBValidateNcopy: Copying unique 1614304356 to node structure for node eamdb01, number 1; previous unique value was 0
2021-02-26 09:56:41.248 :    CSSD:523484416: [     INFO] clssnmvDHBValidateNCopy: node 1, eamdb01, has a disk HB, but no network HB, DHB has rcfg 509799178, wrtcnt, 1044, LATS 1445214, lastSeqNo 1035, uniqueness 1614304356, timestamp 1614304600/125796074
2021-02-26 09:56:41.250 :    CSSD:523484416: [     INFO] clssnmlpickslot:Optimizing lease acquisition with slot 2
2021-02-26 09:56:41.252 :    CSSD:4115277568: [     INFO] clssscthrdmain: Starting thread clssnmvLeaseAqIoThread
2021-02-26 09:56:41.253 :    CLSF:4115277568: Allocated CLSF context

提了sr,一直说是私有网络的问题。私有网络原来连的是交换机,现在改成直连也不行。
私网ping ssh都没问题,traceroute有时候会卡下。

收藏
分享
15条回答
默认
最新
咖啡哥
2021-03-01

多谢各位的建议。在恩墨的大牛帮助下,找到原因拉。
主要是rp_filter这个内核参数未设置导致的。

rp_filter (Reverse Path Filtering)参数定义了网卡对接收到的数据包进行反向路由验证的规则。他有三个值,0、1、2,具体含意如下:
0:关闭反向路由校验
1:开启严格的反向路由校验。对每个进来的数据包,校验其反向路由是否是最佳路由。如果反向路由不是最佳路由,则直接丢弃该数据包。
2:开启松散的反向路由校验。对每个进来的数据包,校验其源地址是否可达,即反向路由是否能通(通过任意网口),如果反向路径不通,则直接丢弃该数据包。

默认是1.

Oracle安装文档上也有相关解释:

Multiple Private Interconnects and Oracle Linux

Without these rp_filter parameter settings systems, interconnect packets can be blocked or discarded.
The rp_filter values set the Reverse Path filter to no filtering (0), to strict filtering (1), or to loose filtering (2). Set the rp_filter value for the private interconnects to either 0 or 2. Setting the private interconnect NIC to 1 can cause connection issues on the private interconnect. It is not considered unsafe to disable or relax this filtering, because the private interconnect should be on a private and isolated network.

For example, where eth1 and eth2 are the private interconnect NICs, and eth0 is the public network NIC, set the rp_filter of the private address to 2 (loose filtering), the public address to 1 (strict filtering), using the following entries in /etc/sysctl.conf:

net.ipv4.conf.eth2.rp_filter = 2 
net.ipv4.conf.eth1.rp_filter = 2 
net.ipv4.conf.eth0.rp_filter = 1

在/etc/sysctl.conf文件中做如下配置,问题解决拉。

net.ipv4.conf.bond0.rp_filter = 2
net.ipv4.conf.bond1.rp_filter = 2

至于私网在做了绑定的情况下,MTU值设置为9000就可以,用默认的1500就不行,这个还是没能理解,如果有知道的,麻烦告知下。

暂无图片 评论
暂无图片 有用 0
打赏 0
暂无图片
刘德秋
2022-03-09
我也遇见了和您同样的问题,按您的这几种方式设置了都未解决。请问一下您这边还有什么处理方案吗
lscomeon
2021-02-26

Cluvfy 检查连通行。写个循环,检查100次看看结果,检查防火墙

暂无图片 评论
暂无图片 有用 0
打赏 0
文成
2021-02-26

检查防火墙,把防火墙关闭试试
私网地址是不是跟局域网内的网段一样的?

暂无图片 评论
暂无图片 有用 0
打赏 0
始于脚下
2021-02-26

PRVH-0116 : Path “/u01/app/19.3.0/grid/crs/install/cmdllroot.sh” with permissions “rw-r–r--” does not have execute permissions for the owner, file’s group, and others on node “eamdb02”.
检查一下/u01/app/19.3.0/grid/crs/install/cmdllroot.sh文件的权限,这个文件需要执行权限的。文件组属性也顺带检查一下,目录权限也检查检查。

暂无图片 评论
暂无图片 有用 0
打赏 0
lscomeon
2021-02-26

sr里要的信息传一份看看

暂无图片 评论
暂无图片 有用 0
打赏 0
咖啡哥
2021-02-26

改cmdllroot.sh这个文件的宿主和权限的方式试过,还是会报错:

---修改权限后,再次执行还是报权限相关错误:

2021/02/20 16:55:37 CLSRSC-594: Executing installation step 17 of 19: 'StartCluster'.
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.mdnsd' on 'eamdb02'
CRS-2672: Attempting to start 'ora.evmd' on 'eamdb02'
CRS-2676: Start of 'ora.mdnsd' on 'eamdb02' succeeded
CRS-2676: Start of 'ora.evmd' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'eamdb02'
CRS-2676: Start of 'ora.gpnpd' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'eamdb02'
CRS-2676: Start of 'ora.gipcd' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'eamdb02'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'eamdb02'
CRS-2676: Start of 'ora.cssdmonitor' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'eamdb02'
CRS-2672: Attempting to start 'ora.diskmon' on 'eamdb02'
CRS-2676: Start of 'ora.diskmon' on 'eamdb02' succeeded
CRS-2676: Start of 'ora.crf' on 'eamdb02' succeeded
CRS-2676: Start of 'ora.cssd' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'eamdb02'
CRS-2672: Attempting to start 'ora.ctssd' on 'eamdb02'
CRS-2676: Start of 'ora.ctssd' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'eamdb02'
CRS-2676: Start of 'ora.crsd' on 'eamdb02' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'eamdb02' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'eamdb02'
CRS-2676: Start of 'ora.asm' on 'eamdb02' succeeded
CRS-2883: Resource 'ora.crsd' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
CRS-4000: Command Start failed, or completed with errors.
2021/02/20 17:08:48 CLSRSC-117: Failed to start Oracle Clusterware stack
Died at /u01/app/19.3.0/grid/crs/install/crsinstall.pm line 1970.

@lscomeon,SR里面上传的信息比较多,您是想看哪个?

暂无图片 评论
暂无图片 有用 0
打赏 0
咖啡哥
2021-02-26

服务器配置的yum源是centos7的,我的操作系统是Oracle Linux的,安装的包会不会有不兼容的情况?

暂无图片 评论
暂无图片 有用 0
打赏 0
始于脚下
2021-02-27

建议你把安装过程的配置全部仔细检查一遍,然后根据错误提示一步一步解决吧。应该跟兼容性关系不大,毕竟你已经完成依赖包安装,且在节点1完成了部署,日志也没有兼容性相关提示。如果时间紧迫,最快的方法应该是铲掉重做。

暂无图片 评论
暂无图片 有用 0
打赏 0
始于脚下
2021-02-27

权限应该全乱套了,根据提示,你跟节点一正常权限做个对比,然后修改成节点一一样的权限。

暂无图片 评论
暂无图片 有用 0
打赏 0
咖啡哥
2021-02-27

@始于脚下
权限跟节点1一样的。也咨询了一些大佬,执行root的时候报的错误不一定是准的。出问题的是启动cluster的时候,ocssd.trc报has a disk HB, but no network HB
问题点应该在这。
现在准备先将双网卡绑定去掉再做下测试。
除了安装的依赖包没有卸载重启安装,grid已经卸载过N次测试拉。

暂无图片 评论
暂无图片 有用 0
打赏 0
咖啡哥
2021-03-01

根据我的测试,现在有两种情况下是正常的。

  1. 第一:私网不做绑定,
  2. 第二:私网做了绑定,但是需要把MTU值设置为9000。

有人知道这是什么原因吗?

MOS上看到过一些文档,建议将私网MTU设置为9000,但是交换机要一起设置,我们的网络工程师说我们并没设置,说我们的私网只是做了个van,是二层的,设置不了MTU。

暂无图片 评论
暂无图片 有用 0
打赏 0
lscomeon
2021-03-01

看看私网拓扑吧,问题应该就在这了

现在新的交换机都支持巨型帧,
mtu是三层上的概念,所以没有mtu的设置,除非你vlan发起路由,但是又没有必要。

私网不做绑定就咩有问题,那就看看怎么的做的绑定,配置看看,team还是bond,模式是什么,拓扑怎么样

私网绑定和mtu 9000 没有必然联系,很可能是配置不一致导致的冲突

暂无图片 评论
暂无图片 有用 0
打赏 0
咖啡哥
2021-03-01

至于私网在做了绑定的情况下,MTU值设置为9000就可以,用默认的1500就不行,这个还是没能理解,如果有知道的,麻烦告知下。

这个问题应该还是跟rp_filter=1有关系。
可能是因为Oracle发送到私网的数据包有一部分大于1500,但是小于9000的。
做了绑定的情况,数据包被拆分为2个或更多,从不同的端口发送过去的,导致反向路由验证的时候认为路由不是最佳的,直接丢弃该数据包。

暂无图片 评论
暂无图片 有用 0
打赏 0
文成
2021-03-01

感谢分享!

暂无图片 评论
暂无图片 有用 0
打赏 0
咖啡哥
2021-03-03
问题已关闭: 问题已经得到解决. 感谢大家的积极回答
暂无图片 评论
暂无图片 有用 0
打赏 0
W
2022-10-12
请问你是怎么解决的呀,我试了好多方法都不行
回答交流
提交
问题信息
请登录之后查看
邀请回答
暂无人订阅该标签,敬请期待~~
暂无图片墨值悬赏