问题
- 集群部署后状态异常,三个节点的 cm 分别认为自己是 standby,其余两个是 down;dn状态异常;
- 节点1:
[omm@xxx cm_server]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
---------------------------------------------------------------------------------
1 xxx xxx 1 /panweidb/database/panweidb/cm/cm_server Standby
2 xxx xxx 2 /panweidb/database/panweidb/cm/cm_server Down
3 xxx xxx 3 /panweidb/database/panweidb/cm/cm_server Down
cm_ctl: can't connect to cm_server.
Maybe cm_server is not running, or timeout expired. Please try again.
[omm@xxx cm_server]$ gs_om -t query
[ Cluster State ]
cluster_state : Unavailable
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
------------------------------------------------------------------------
1 xxx xxx 15400 6001 P Pending Need repair(Disconnected)
2 xxx xxx 15400 6002 S Pending Need repair(Disconnected)
3 xxx xxx 15400 6003 S Pending Need repair(Disconnected)
- 节点2:
[omm@xxx ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
---------------------------------------------------------------------------------
1 xxx xxx 1 /panweidb/database/panweidb/cm/cm_server Down
2 xxx xxx 2 /panweidb/database/panweidb/cm/cm_server Standby
3 xxx xxx 3 /panweidb/database/panweidb/cm/cm_server Down
cm_ctl: can't connect to cm_server.
Maybe cm_server is not running, or timeout expired. Please try again.
- 节点3:
[omm@xxx ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
---------------------------------------------------------------------------------
1 xxx xxx 1 /panweidb/database/panweidb/cm/cm_server Down
2 xxx xxx 2 /panweidb/database/panweidb/cm/cm_server Down
3 xxx xxx 3 /panweidb/database/panweidb/cm/cm_server Standby
cm_ctl: can't connect to cm_server.
Maybe cm_server is not running, or timeout expired. Please try again.
分析
- 开始怀疑是 ssh 或 pssh 互信异常,检查后发现正常;
- 使用 cm_ctl stop -n 1 停止节点后,使用 gs_ctl start 启动数据库,数据库状态正常;
- 怀疑是选主异常,查看 dcc 相关日志时,发现 dcc.dlog 中有以下报错:
UTC+8 2024-09-02 11:39:27.614|DCF|8703|ERROR>[MEC]cs_connect fail,peer_url=xxx:15001, err code 501, err msg Failed to establish tcp connection to [xxx]:[15001], errno 113. [/root/component/dcf/DCF/src/network/mec/mec_func.c:463]
- 检查防火墙状态,发现未关闭。
解决
- 关闭三个节点的防火墙后,集群恢复正常。
原因
- 怀疑是部署时未永久关闭防火墙。
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




