ALM-37024 集群平衡状态异常
告警解释
当集群中存在主备关系的实例,主备关系发生变化,并且该变化与集群初始状态不一致时,产生该告警。
告警属性
告警ID |
告警级别 |
是否自动清除 |
|---|---|---|
37024 |
重要 |
是 |
告警参数
参数名称 |
参数含义 |
|---|---|
Source |
产生告警的集群名称 |
ServiceName |
产生告警的服务名称 |
RoleName |
产生告警的角色名称 |
对系统的影响
若发生此告警,说明集群中有GTM或者Datanode的主备关系发生变化,且变化后的主备关系与初始安装时不一致。此时集群中的主实例可能过多地被切换到一个节点上,集群压力会集中到这个节点上,会导致集群负载不均衡,影响集群的性能。
可能原因
Datanode实例主备关系异常:
- Datanode主实例失效,无法对外提供服务。
- Datanode主备实例断连。
- 人为手动切换Datanode主备实例。
GTM实例主备关系异常:
- GTM主实例失效,无法对外提供服务。
- GTM主备实例断连。
- 人为手动切换GTM主备实例。
处理步骤
查看告警原因。
- 在FusionInsight Manager界面,选择“运维 > 告警 > 告警”,在告警列表中单击此告警所在行的
。从“定位信息”中获取产生该告警的集群名称、节点主机名称以及实例名称。 - 选择“集群 > 产生告警的集群名称 > 服务 > MPPDB > 实例”,获取安装了MPPDB服务的节点。
- 以omm用户登录安装MPPDB服务的任意节点,执行命令source环境变量,并用gs_om -t status --detail查看集群状态(假如集群安装目录是“/opt/huawei/Bigdata”。)。
source /opt/huawei/Bigdata/mppdb/.mppdbgs_profile
gs_om -t status --detail
- 如果集群状态如下所示,cluster_state为Normal,集群平衡状态balanced为No,说明主备实例发生切换(如下回显Datanode State区域内的粗体部分,P代表其初始状态为主DN,当前切换成了备DN,状态变成 Standby Normal),请参考产品文档“重置实例状态”完成修复。
[ CMServer State ] node node_ip instance state ------------------------------------------------------------------------------------------- 1 SZX1000071373 10.90.57.221 1 /opt/huawei/Bigdata/mppdb/cm/cm_server Primary 2 SZX1000071374 10.90.57.222 2 /opt/huawei/Bigdata/mppdb/cm/cm_server Standby [ Cluster State ] cluster_state : Normal redistributing : No balanced : No [ Coordinator State ] node node_ip instance state ------------------------------------------------------------------------------------------ 1 SZX1000071373 10.90.57.221 5001 /srv/BigData/mppdb/data1/coordinator Normal 2 SZX1000071374 10.90.57.222 5002 /srv/BigData/mppdb/data1/coordinator Normal 3 SZX1000071375 10.90.57.223 5003 /srv/BigData/mppdb/data1/coordinator Normal [ Central Coordinator State ] node node_ip instance state -------------------------------------------------------------------------------- 2 SZX1000071374 10.90.57.222 5002 /srv/BigData/mppdb/data1/coordinator Normal [ GTM State ] node node_ip instance state sync_state ------------------------------------------------------------------------------------------------------------ 2 SZX1000071374 10.90.57.222 1001 /opt/huawei/Bigdata/mppdb/gtm P Primary Connection ok Sync 1 SZX1000071373 10.90.57.221 1002 /opt/huawei/Bigdata/mppdb/gtm S Standby Connection ok Sync [ Datanode State ] node node_ip instance state | node node_ip instance state | node node_ip instance state ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 SZX1000071373 10.90.57.221 6001 /srv/BigData/mppdb/data1/master1 P Primary Normal | 2 SZX1000071374 10.90.57.222 6002 /srv/BigData/mppdb/data1/slave1 S Standby Normal | 3 SZX1000071375 10.90.57.223 3002 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 1 SZX1000071373 10.90.57.221 6003 /srv/BigData/mppdb/data2/master2 P Primary Normal | 3 SZX1000071375 10.90.57.223 6004 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 2 SZX1000071374 10.90.57.222 3003 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 1 SZX1000071373 10.90.57.221 6005 /srv/BigData/mppdb/data3/master3 P Primary Normal | 2 SZX1000071374 10.90.57.222 6006 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 3 SZX1000071375 10.90.57.223 3004 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 1 SZX1000071373 10.90.57.221 6007 /srv/BigData/mppdb/data4/master4 P Primary Normal | 3 SZX1000071375 10.90.57.223 6008 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 2 SZX1000071374 10.90.57.222 3005 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal 2 SZX1000071374 10.90.57.222 6009 /srv/BigData/mppdb/data1/master1 P Primary Normal | 3 SZX1000071375 10.90.57.223 6010 /srv/BigData/mppdb/data1/slave1 S Standby Normal | 1 SZX1000071373 10.90.57.221 3006 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 2 SZX1000071374 10.90.57.222 6011 /srv/BigData/mppdb/data2/master2 P Standby Normal | 1 SZX1000071373 10.90.57.221 6012 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 3 SZX1000071375 10.90.57.223 3007 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 2 SZX1000071374 10.90.57.222 6013 /srv/BigData/mppdb/data3/master3 P Primary Normal | 3 SZX1000071375 10.90.57.223 6014 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 1 SZX1000071373 10.90.57.221 3008 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 2 SZX1000071374 10.90.57.222 6015 /srv/BigData/mppdb/data4/master4 P Primary Normal | 1 SZX1000071373 10.90.57.221 6016 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 3 SZX1000071375 10.90.57.223 3009 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal 3 SZX1000071375 10.90.57.223 6017 /srv/BigData/mppdb/data1/master1 P Primary Normal | 1 SZX1000071373 10.90.57.221 6018 /srv/BigData/mppdb/data1/slave1 S Standby Normal | 2 SZX1000071374 10.90.57.222 3010 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 3 SZX1000071375 10.90.57.223 6019 /srv/BigData/mppdb/data2/master2 P Primary Normal | 2 SZX1000071374 10.90.57.222 6020 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 1 SZX1000071373 10.90.57.221 3011 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 3 SZX1000071375 10.90.57.223 6021 /srv/BigData/mppdb/data3/master3 P Primary Normal | 1 SZX1000071373 10.90.57.221 6022 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 2 SZX1000071374 10.90.57.222 3012 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 3 SZX1000071375 10.90.57.223 6023 /srv/BigData/mppdb/data4/master4 P Primary Normal | 2 SZX1000071374 10.90.57.222 6024 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 1 SZX1000071373 10.90.57.221 3013 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal- 如果集群状态如下所示,cluster_state为Degraded,执行6。
[ CMServer State ] node node_ip instance state ------------------------------------------------------------------------------------------- 1 SZX1000071373 10.90.57.221 1 /opt/huawei/Bigdata/mppdb/cm/cm_server Primary 2 SZX1000071374 10.90.57.222 2 /opt/huawei/Bigdata/mppdb/cm/cm_server Standby [ Cluster State ] cluster_state : Degraded redistributing : No balanced : No [ Coordinator State ] node node_ip instance state ------------------------------------------------------------------------------------------ 1 SZX1000071373 10.90.57.221 5001 /srv/BigData/mppdb/data1/coordinator Normal 2 SZX1000071374 10.90.57.222 5002 /srv/BigData/mppdb/data1/coordinator Normal 3 SZX1000071375 10.90.57.223 5003 /srv/BigData/mppdb/data1/coordinator Normal [ Central Coordinator State ] node node_ip instance state -------------------------------------------------------------------------------- 2 SZX1000071374 10.90.57.222 5002 /srv/BigData/mppdb/data1/coordinator Normal [ GTM State ] node node_ip instance state sync_state ------------------------------------------------------------------------------------------------------------ 2 SZX1000071374 10.90.57.222 1001 /opt/huawei/Bigdata/mppdb/gtm P Primary Connection ok Sync 1 SZX1000071373 10.90.57.221 1002 /opt/huawei/Bigdata/mppdb/gtm S Standby Connection ok Sync [ Datanode State ] node node_ip instance state | node node_ip instance state | node node_ip instance state ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 SZX1000071373 10.90.57.221 6001 /srv/BigData/mppdb/data1/master1 P Primary Normal | 2 SZX1000071374 10.90.57.222 6002 /srv/BigData/mppdb/data1/slave1 S Standby Normal | 3 SZX1000071375 10.90.57.223 3002 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 1 SZX1000071373 10.90.57.221 6003 /srv/BigData/mppdb/data2/master2 P Primary Normal | 3 SZX1000071375 10.90.57.223 6004 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 2 SZX1000071374 10.90.57.222 3003 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 1 SZX1000071373 10.90.57.221 6005 /srv/BigData/mppdb/data3/master3 P Primary Normal | 2 SZX1000071374 10.90.57.222 6006 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 3 SZX1000071375 10.90.57.223 3004 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 1 SZX1000071373 10.90.57.221 6007 /srv/BigData/mppdb/data4/master4 P Primary Normal | 3 SZX1000071375 10.90.57.223 6008 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 2 SZX1000071374 10.90.57.222 3005 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal 2 SZX1000071374 10.90.57.222 6009 /srv/BigData/mppdb/data1/master1 P Down Disk damaged | 3 SZX1000071375 10.90.57.223 6010 /srv/BigData/mppdb/data1/slave1 S Primary Normal | 1 SZX1000071373 10.90.57.221 3006 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 2 SZX1000071374 10.90.57.222 6011 /srv/BigData/mppdb/data2/master2 P Primary Normal | 1 SZX1000071373 10.90.57.221 6012 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 3 SZX1000071375 10.90.57.223 3007 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 2 SZX1000071374 10.90.57.222 6013 /srv/BigData/mppdb/data3/master3 P Primary Normal | 3 SZX1000071375 10.90.57.223 6014 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 1 SZX1000071373 10.90.57.221 3008 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 2 SZX1000071374 10.90.57.222 6015 /srv/BigData/mppdb/data4/master4 P Primary Normal | 1 SZX1000071373 10.90.57.221 6016 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 3 SZX1000071375 10.90.57.223 3009 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal 3 SZX1000071375 10.90.57.223 6017 /srv/BigData/mppdb/data1/master1 P Primary Normal | 1 SZX1000071373 10.90.57.221 6018 /srv/BigData/mppdb/data1/slave1 S Standby Normal | 2 SZX1000071374 10.90.57.222 3010 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 3 SZX1000071375 10.90.57.223 6019 /srv/BigData/mppdb/data2/master2 P Primary Normal | 2 SZX1000071374 10.90.57.222 6020 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 1 SZX1000071373 10.90.57.221 3011 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 3 SZX1000071375 10.90.57.223 6021 /srv/BigData/mppdb/data3/master3 P Primary Normal | 1 SZX1000071373 10.90.57.221 6022 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 2 SZX1000071374 10.90.57.222 3012 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 3 SZX1000071375 10.90.57.223 6023 /srv/BigData/mppdb/data4/master4 P Primary Normal | 2 SZX1000071374 10.90.57.222 6024 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 1 SZX1000071373 10.90.57.221 3013 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal- 如上所示加粗斜体部分,dn_6009状态是Down,备dn_6010升主,导致节点SZX1000071374上dn主实例增多,首先使用gs_replace修复损坏的dn_6009。
说明:
以Datanode实例切换异常为例,如果是GTM实例切换异常,处理方法相同。
omm@SZX1000071374:/srv/BigData/mppdb/data2> gs_replace -t config -h SZX1000071374 Fixing all the CMAgents instances. There are [0] CMAgents need to be repaired in cluster. Configuring replacement instances. Successfully configured replacement instances. Successfully fixed all the CMAgents instances. Configuring Waiting for promote peer instances. . Successfully upgraded standby instances. Deleting failed CN from pgxc_node. No CN needs to be fixed. Configuring replacement instances. Successfully configured replacement instances. Setting the SCTP. Successfully set the SCTP. Configuration succeeded.- 执行以下命令在需要替换实例的主机上完成启动操作。
omm@SZX1000071374:/srv/BigData/mppdb/data2> gs_replace -t start -h SZX1000071374 Starting. ====================================================================== Successfully started instance process. Waiting to become Normal. ====================================================================== . ====================================================================== Start succeeded on all nodes. Start succeeded.- 再重置实例状态。
omm@SZX1000071374:/srv/BigData/mppdb/data2> gs_om -t switch --reset Operating: Switch reset. cm_ctl: cmserver is rebalancing the cluster automatically. ..... cm_ctl: switchover successfully. Operation succeeded: Switch reset.- 等待一段时间,查看告警是否仍然存在。
- 是,执行10。
- 否,处理完毕。
收集故障信息。
- 在FusionInsight Manager界面,选择“运维 > 日志 > 下载”。
- 在“服务”列表框中勾选“MPPDB ”。
- 单击右上角的
设置日志收集的“开始时间”和“结束时间”分别为告警产生时间的前后1小时,单击“下载”。 - 请联系技术支持,并发送已收集的故障日志信息。
告警清除
此告警修复后,系统会自动清除此告警,无需手工清除。
参考信息
无。
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」关注作者【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。评论
- 如果集群状态如下所示,cluster_state为Degraded,执行6。




