RAC的心跳机制
心跳类型
- 集群心跳 cssdagent cssdmnoitor
-
- 节点间的连通性
-
- 用共享的位置保持节点的连通信息,及时记录和更新
-
- 本地节点的自我监控
-
- 网络心跳(Network HeartBeat,NHB)
- 保证节点之间的连通性,以便确认状态
- ocssd.bin进程每秒向其他节点发送网络心跳,当心跳出现问题时做出处理
- 相关线程ocssd.bin
- 发送线程 每秒向其他节点发送网络心跳
- 分析线程 分析心跳信息,有节点持续丢失,通知集群进行重新配置
- 派遣线程,接受消息 并且投递给相应线程
- 集群重新配置线程 收到分析线程发来的重新配置通知,线程启动重新配置。
- 磁盘心跳(Disk HeartBeat,DHB)
- 来自vote disk
- 解决脑裂
- 一旦发生脑裂,重新配置线程会通过表决盘的信息了解集群节点之间的连通性,从而决定集群会分裂成几个子集群
- 相关线程
- 磁盘心跳线程,向表决盘发送磁盘心跳,同时也负责读取表决盘中的kill block信息,确定本节点是否重启
- 磁盘心跳监控线程 监控磁盘心跳线程是否能够正常地发送心跳,是否能正确读取kill block的信息
- kill block线程:负责监控VF的kill block信息
- 奇数个 保证一半以上可以被访问
- 本地心跳(Local HeartBeat,LHB)
- 监控ocssd.bin以及本地节点的状态
- 每秒发送网络心跳的同时,向本地cssdagent 和cssdmonitor发送本地ocssd.bin的状态
- 相关线程
- 发送线程
- 11.2+ 本地状态被整合进整体心跳
日志解析
- ocssd.trc日志分析
1:41:2
2022-11-18 09:45:24.016 : CSSD:3559126784: [ INFO] clssnmSendingThread: sending status msg to all nodes
2022-11-18 09:45:24.017 : CSSD:3559126784: [ INFO] clssnmSendingThread: sent 5 status msgs to all nodes --->【本地心跳】
2022-11-18 09:45:25.548 : CSSD:3584321280: [ INFO] clssgmcpGroupDataResp: sending type 5, size 164, status 0 to clientID 1:23:0
2022-11-18 09:45:25.809 : CSSD:3587475200: [ INFO] : Processing member data change type 1, size 4 for group HB+ASM, memberID 17:2:1 --->【ASM心跳】
2022-11-18 09:45:25.809 : CSSD:3587475200: [ INFO] : Sending member data change to GMP for group HB+ASM, memberID 17:2:1
2022-11-18 09:45:25.810 : CSSD:3599832832: [ INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:1, datatype 1 datasize 4 --->【ASM心跳更新】
2022-11-18 09:45:25.810 : CSSD:3584321280: [ INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:1 from clientID 1:41:2
2022-11-18 09:45:26.450 : CSSD:3584321280: [ INFO] clssgmcpGroupDataResp: Completed request with sequence number(201) for clientID 1:42:0
2022-11-18 09:45:26.450 : CSSD:3584321280: [ INFO] clssgmcpGroupDataResp: sending type 5, size 167, status 0 to clientID 1:42:0
2022-11-18 09:45:27.866 : CSSD:3587475200: [ INFO] : Processing member data change type 1, size 4 for group HB+ASM, memberID 17:2:1
2022-11-18 09:45:27.866 : CSSD:3587475200: [ INFO] : Sending member data change to GMP for group HB+ASM, memberID 17:2:1
2022-11-18 09:45:27.866 : CSSD:3599832832: [ INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:1, datatype 1 datasize 4
2022-11-18 09:45:27.866 : CSSD:3584321280: [ INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:1 from clientID 1:41:2
2022-11-18 09:45:28.370 : CSSD:3591948032: [ INFO] clssgmpcGMCReqWorkerThread: processing msg (0x7f9cc40414f0) type 2, msg size 76, payload (0x7f9cc404151c) size 32, sequence 2232, for clientID 1:41:2
2022-11-18 09:45:28.639 : CSSD:3584321280: [ INFO] clssgmcpGroupDataResp: Completed request with sequence number(202) for clientID 1:42:0
2022-11-18 09:45:28.639 : CSSD:3584321280: [ INFO] clssgmcpGroupDataResp: sending type 5, size 167, status 0 to clientID 1:42:0
- NHB
[root@oel7n01 trace]# cat ocssd.trc |grep NHB
- DHB
[root@oel7n01 trace]# cat ocssd.trc |grep DHB
2022-12-06 21:42:57.386 : CSSD:1122473728: [ INFO] clssnmvReadDskHeartbeat: Reading DHBs to get the latest info for node(2/oel7n02), LATSvalid(0), nodeInfoDHB uniqueness(0)
2022-12-06 21:42:57.386 : CSSD:1122473728: [ INFO] clssnmvDHBValidateNcopy: Saving DHB uniqueness for node(2/oel7n02), latestInfo(1670334162), readInfo(1670334162), nodeInfoDHB(0)
2022-12-06 21:42:57.386 : CSSD:1122473728: [ INFO] clssnmvDHBValidateNcopy: Setting LATS valid due to second DHB seen on disk(0x7fc33c0fa110) for node(2/oel7n02) nodeStatus 0x1
2022-12-06 21:49:34.317 : CSSD:1122473728: [ INFO] clssnmvReadDskHeartbeat: Reading DHBs to get the latest info for node(2/oel7n02), LATSvalid(0), nodeInfoDHB uniqueness(1670334162)
2022-12-06 21:49:34.317 : CSSD:1122473728: [ INFO] clssnmvDHBValidateNcopy: Setting LATS valid due to uniqueness change for node(2/oel7n02), nodeInfoDHB(1670334162), readInfo(1670334565)
2022-12-06 21:49:34.317 : CSSD:1122473728: [ INFO] clssnmvDHBValidateNcopy: Saving DHB uniqueness for node(2/oel7n02), latestInfo(1670334162), readInfo(1670334565), nodeInfoDHB(1670334162)
- LHB
[root@oel7n01 trace]# cat ocssd.trc |grep LHB
默认值和修改方式
- 查询NHB和DHB默认值
#NHB
[root@oel7n01 ~]# crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.
#DHB
[root@oel7n01 ~]# crsctl get css disktimeout
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services.
#可以看到网络心跳初始阈值为30s 磁盘心跳初始阈值为200s
- 修改NHB和DHB默认值
#NHB
[root@oel7n02 ~]# crsctl set css misscount 50
CRS-4678: Successful set of parameter misscount to 50 for Cluster Synchronization Services.
[root@oel7n01 ~]# crsctl get css misscount
CRS-4678: Successful get misscount 50 for Cluster Synchronization Services.
#注意此处在2节点修改后,在1节点查询发现节点的心跳检测时间是一致的!
#DHB
[root@oel7n01 ~]# crsctl get css disktimeout
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 50
CRS-4696: Failed to set parameter disktimeout to 50 due to conflicting parameter misscount; the new value for disktimeout must be greater than 50.
[root@oel7n01 trace]# crsctl set css disktimeout 51
CRS-4684: Successful set of parameter disktimeout to 51 for Cluster Synchronization Services.
#####disktimeout的最小值为51
[root@oel7n01 trace]# crsctl set css disktimeout 1000
CRS-4684: Successful set of parameter disktimeout to 1000 for Cluster Synchronization Services.
####################彩蛋
#试试能扩充的最大值
[root@oel7n01 trace]# crsctl set css disktimeout 100000000
CRS-4684: Successful set of parameter disktimeout to 100000000 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 10000000000000000000000
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]# crsctl set css disktimeout 1000000000000000000
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]# crsctl set css disktimeout 1000000000000000
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]# crsctl set css disktimeout 10000000000000
CRS-4684: Successful set of parameter disktimeout to 1316134912 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 1316134913
CRS-4684: Successful set of parameter disktimeout to 1316134913 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 2000000000
CRS-4684: Successful set of parameter disktimeout to 2000000000 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 20000000000
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]# crsctl set css disktimeout 9000000000
CRS-4684: Successful set of parameter disktimeout to 410065408 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 9000000000
CRS-4684: Successful set of parameter disktimeout to 410065408 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 9000000000
CRS-4684: Successful set of parameter disktimeout to 410065408 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 9000099999
CRS-4684: Successful set of parameter disktimeout to 410165407 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 499999999
CRS-4684: Successful set of parameter disktimeout to 499999999 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 999999999
CRS-4684: Successful set of parameter disktimeout to 999999999 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 9999999999
CRS-4684: Successful set of parameter disktimeout to 1410065407 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 99999999999
CRS-4684: Successful set of parameter disktimeout to 1215752191 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 999999999999
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]# crsctl set css disktimeout 99999999999
CRS-4684: Successful set of parameter disktimeout to 1215752191 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 1215759999
CRS-4684: Successful set of parameter disktimeout to 1215759999 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 1219999999
CRS-4684: Successful set of parameter disktimeout to 1219999999 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 9999999999
CRS-4684: Successful set of parameter disktimeout to 1410065407 for Cluster Synchronization Services.
。。。。。。
[root@oel7n01 trace]# crsctl set css disktimeout 2200099999
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]# crsctl set css disktimeout 2109999999
CRS-4684: Successful set of parameter disktimeout to 2109999999 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 2109999999
[root@oel7n01 trace]# crsctl set css disktimeout 1000
CRS-4684: Successful set of parameter disktimeout to 1000 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 2109999999
CRS-4684: Successful set of parameter disktimeout to 2109999999 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 1000
CRS-4684: Successful set of parameter disktimeout to 1000 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 200
CRS-4684: Successful set of parameter disktimeout to 200 for Cluster Synchronization Services.
最后修改时间:2022-12-21 15:14:23
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




