暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

华为GaussDB A ALM-37024 集群平衡状态异常

墨天轮 2019-10-12
1088

ALM-37024 集群平衡状态异常

告警解释

当集群中存在主备关系的实例,主备关系发生变化,并且该变化与集群初始状态不一致时,产生该告警。

告警属性

告警ID

告警级别

是否自动清除

37024

重要

告警参数

参数名称

参数含义

Source

产生告警的集群名称

ServiceName

产生告警的服务名称

RoleName

产生告警的角色名称

对系统的影响

若发生此告警,说明集群中有GTM或者Datanode的主备关系发生变化,且变化后的主备关系与初始安装时不一致。此时集群中的主实例可能过多地被切换到一个节点上,集群压力会集中到这个节点上,会导致集群负载不均衡,影响集群的性能。

可能原因

Datanode实例主备关系异常:

  • Datanode主实例失效,无法对外提供服务。
  • Datanode主备实例断连。
  • 人为手动切换Datanode主备实例。

GTM实例主备关系异常:

  • GTM主实例失效,无法对外提供服务。
  • GTM主备实例断连。
  • 人为手动切换GTM主备实例。

处理步骤

查看告警原因。

  • 在FusionInsight Manager界面,选择“运维 > 告警 > 告警”,在告警列表中单击此告警所在行的。从“定位信息”中获取产生该告警的集群名称、节点主机名称以及实例名称。
  • 选择“集群 > 产生告警的集群名称 > 服务 > MPPDB > 实例”,获取安装了MPPDB服务的节点。
  • omm用户登录安装MPPDB服务的任意节点,执行命令source环境变量,并用gs_om -t status --detail查看集群状态(假如集群安装目录是“/opt/huawei/Bigdata”。)。

    source /opt/huawei/Bigdata/mppdb/.mppdbgs_profile

    gs_om -t status --detail

  • 如果集群状态如下所示,cluster_state为Normal,集群平衡状态balanced为No,说明主备实例发生切换(如下回显Datanode State区域内的粗体部分,P代表其初始状态为主DN,当前切换成了备DN,状态变成 Standby Normal),请参考产品文档“重置实例状态”完成修复。

    [ CMServer State ] node node_ip instance state ------------------------------------------------------------------------------------------- 1 SZX1000071373 10.90.57.221 1 /opt/huawei/Bigdata/mppdb/cm/cm_server Primary 2 SZX1000071374 10.90.57.222 2 /opt/huawei/Bigdata/mppdb/cm/cm_server Standby [ Cluster State ] cluster_state : Normal redistributing : No balanced : No [ Coordinator State ] node node_ip instance state ------------------------------------------------------------------------------------------ 1 SZX1000071373 10.90.57.221 5001 /srv/BigData/mppdb/data1/coordinator Normal 2 SZX1000071374 10.90.57.222 5002 /srv/BigData/mppdb/data1/coordinator Normal 3 SZX1000071375 10.90.57.223 5003 /srv/BigData/mppdb/data1/coordinator Normal [ Central Coordinator State ] node node_ip instance state -------------------------------------------------------------------------------- 2 SZX1000071374 10.90.57.222 5002 /srv/BigData/mppdb/data1/coordinator Normal [ GTM State ] node node_ip instance state sync_state ------------------------------------------------------------------------------------------------------------ 2 SZX1000071374 10.90.57.222 1001 /opt/huawei/Bigdata/mppdb/gtm P Primary Connection ok Sync 1 SZX1000071373 10.90.57.221 1002 /opt/huawei/Bigdata/mppdb/gtm S Standby Connection ok Sync [ Datanode State ] node node_ip instance state | node node_ip instance state | node node_ip instance state ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 SZX1000071373 10.90.57.221 6001 /srv/BigData/mppdb/data1/master1 P Primary Normal | 2 SZX1000071374 10.90.57.222 6002 /srv/BigData/mppdb/data1/slave1 S Standby Normal | 3 SZX1000071375 10.90.57.223 3002 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 1 SZX1000071373 10.90.57.221 6003 /srv/BigData/mppdb/data2/master2 P Primary Normal | 3 SZX1000071375 10.90.57.223 6004 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 2 SZX1000071374 10.90.57.222 3003 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 1 SZX1000071373 10.90.57.221 6005 /srv/BigData/mppdb/data3/master3 P Primary Normal | 2 SZX1000071374 10.90.57.222 6006 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 3 SZX1000071375 10.90.57.223 3004 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 1 SZX1000071373 10.90.57.221 6007 /srv/BigData/mppdb/data4/master4 P Primary Normal | 3 SZX1000071375 10.90.57.223 6008 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 2 SZX1000071374 10.90.57.222 3005 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal 2 SZX1000071374 10.90.57.222 6009 /srv/BigData/mppdb/data1/master1 P Primary Normal | 3 SZX1000071375 10.90.57.223 6010 /srv/BigData/mppdb/data1/slave1 S Standby Normal | 1 SZX1000071373 10.90.57.221 3006 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 2 SZX1000071374 10.90.57.222 6011 /srv/BigData/mppdb/data2/master2 P Standby Normal | 1 SZX1000071373 10.90.57.221 6012 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 3 SZX1000071375 10.90.57.223 3007 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 2 SZX1000071374 10.90.57.222 6013 /srv/BigData/mppdb/data3/master3 P Primary Normal | 3 SZX1000071375 10.90.57.223 6014 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 1 SZX1000071373 10.90.57.221 3008 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 2 SZX1000071374 10.90.57.222 6015 /srv/BigData/mppdb/data4/master4 P Primary Normal | 1 SZX1000071373 10.90.57.221 6016 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 3 SZX1000071375 10.90.57.223 3009 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal 3 SZX1000071375 10.90.57.223 6017 /srv/BigData/mppdb/data1/master1 P Primary Normal | 1 SZX1000071373 10.90.57.221 6018 /srv/BigData/mppdb/data1/slave1 S Standby Normal | 2 SZX1000071374 10.90.57.222 3010 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 3 SZX1000071375 10.90.57.223 6019 /srv/BigData/mppdb/data2/master2 P Primary Normal | 2 SZX1000071374 10.90.57.222 6020 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 1 SZX1000071373 10.90.57.221 3011 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 3 SZX1000071375 10.90.57.223 6021 /srv/BigData/mppdb/data3/master3 P Primary Normal | 1 SZX1000071373 10.90.57.221 6022 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 2 SZX1000071374 10.90.57.222 3012 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 3 SZX1000071375 10.90.57.223 6023 /srv/BigData/mppdb/data4/master4 P Primary Normal | 2 SZX1000071374 10.90.57.222 6024 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 1 SZX1000071373 10.90.57.221 3013 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal

  • 如果集群状态如下所示,cluster_state为Degraded,执行6。

    [ CMServer State ] node node_ip instance state ------------------------------------------------------------------------------------------- 1 SZX1000071373 10.90.57.221 1 /opt/huawei/Bigdata/mppdb/cm/cm_server Primary 2 SZX1000071374 10.90.57.222 2 /opt/huawei/Bigdata/mppdb/cm/cm_server Standby [ Cluster State ] cluster_state : Degraded redistributing : No balanced : No [ Coordinator State ] node node_ip instance state ------------------------------------------------------------------------------------------ 1 SZX1000071373 10.90.57.221 5001 /srv/BigData/mppdb/data1/coordinator Normal 2 SZX1000071374 10.90.57.222 5002 /srv/BigData/mppdb/data1/coordinator Normal 3 SZX1000071375 10.90.57.223 5003 /srv/BigData/mppdb/data1/coordinator Normal [ Central Coordinator State ] node node_ip instance state -------------------------------------------------------------------------------- 2 SZX1000071374 10.90.57.222 5002 /srv/BigData/mppdb/data1/coordinator Normal [ GTM State ] node node_ip instance state sync_state ------------------------------------------------------------------------------------------------------------ 2 SZX1000071374 10.90.57.222 1001 /opt/huawei/Bigdata/mppdb/gtm P Primary Connection ok Sync 1 SZX1000071373 10.90.57.221 1002 /opt/huawei/Bigdata/mppdb/gtm S Standby Connection ok Sync [ Datanode State ] node node_ip instance state | node node_ip instance state | node node_ip instance state ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 SZX1000071373 10.90.57.221 6001 /srv/BigData/mppdb/data1/master1 P Primary Normal | 2 SZX1000071374 10.90.57.222 6002 /srv/BigData/mppdb/data1/slave1 S Standby Normal | 3 SZX1000071375 10.90.57.223 3002 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 1 SZX1000071373 10.90.57.221 6003 /srv/BigData/mppdb/data2/master2 P Primary Normal | 3 SZX1000071375 10.90.57.223 6004 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 2 SZX1000071374 10.90.57.222 3003 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 1 SZX1000071373 10.90.57.221 6005 /srv/BigData/mppdb/data3/master3 P Primary Normal | 2 SZX1000071374 10.90.57.222 6006 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 3 SZX1000071375 10.90.57.223 3004 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 1 SZX1000071373 10.90.57.221 6007 /srv/BigData/mppdb/data4/master4 P Primary Normal | 3 SZX1000071375 10.90.57.223 6008 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 2 SZX1000071374 10.90.57.222 3005 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal 2 SZX1000071374 10.90.57.222 6009 /srv/BigData/mppdb/data1/master1 P Down Disk damaged | 3 SZX1000071375 10.90.57.223 6010 /srv/BigData/mppdb/data1/slave1 S Primary Normal | 1 SZX1000071373 10.90.57.221 3006 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 2 SZX1000071374 10.90.57.222 6011 /srv/BigData/mppdb/data2/master2 P Primary Normal | 1 SZX1000071373 10.90.57.221 6012 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 3 SZX1000071375 10.90.57.223 3007 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 2 SZX1000071374 10.90.57.222 6013 /srv/BigData/mppdb/data3/master3 P Primary Normal | 3 SZX1000071375 10.90.57.223 6014 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 1 SZX1000071373 10.90.57.221 3008 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 2 SZX1000071374 10.90.57.222 6015 /srv/BigData/mppdb/data4/master4 P Primary Normal | 1 SZX1000071373 10.90.57.221 6016 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 3 SZX1000071375 10.90.57.223 3009 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal 3 SZX1000071375 10.90.57.223 6017 /srv/BigData/mppdb/data1/master1 P Primary Normal | 1 SZX1000071373 10.90.57.221 6018 /srv/BigData/mppdb/data1/slave1 S Standby Normal | 2 SZX1000071374 10.90.57.222 3010 /srv/BigData/mppdb/data1/dummyslave1 R Secondary Normal 3 SZX1000071375 10.90.57.223 6019 /srv/BigData/mppdb/data2/master2 P Primary Normal | 2 SZX1000071374 10.90.57.222 6020 /srv/BigData/mppdb/data2/slave2 S Standby Normal | 1 SZX1000071373 10.90.57.221 3011 /srv/BigData/mppdb/data2/dummyslave2 R Secondary Normal 3 SZX1000071375 10.90.57.223 6021 /srv/BigData/mppdb/data3/master3 P Primary Normal | 1 SZX1000071373 10.90.57.221 6022 /srv/BigData/mppdb/data3/slave3 S Standby Normal | 2 SZX1000071374 10.90.57.222 3012 /srv/BigData/mppdb/data3/dummyslave3 R Secondary Normal 3 SZX1000071375 10.90.57.223 6023 /srv/BigData/mppdb/data4/master4 P Primary Normal | 2 SZX1000071374 10.90.57.222 6024 /srv/BigData/mppdb/data4/slave4 S Standby Normal | 1 SZX1000071373 10.90.57.221 3013 /srv/BigData/mppdb/data4/dummyslave4 R Secondary Normal

  • 如上所示加粗斜体部分,dn_6009状态是Down,备dn_6010升主,导致节点SZX1000071374上dn主实例增多,首先使用gs_replace修复损坏的dn_6009。

    说明:

    以Datanode实例切换异常为例,如果是GTM实例切换异常,处理方法相同。

    omm@SZX1000071374:/srv/BigData/mppdb/data2> gs_replace -t config -h SZX1000071374 Fixing all the CMAgents instances. There are [0] CMAgents need to be repaired in cluster. Configuring replacement instances. Successfully configured replacement instances. Successfully fixed all the CMAgents instances. Configuring Waiting for promote peer instances. . Successfully upgraded standby instances. Deleting failed CN from pgxc_node. No CN needs to be fixed. Configuring replacement instances. Successfully configured replacement instances. Setting the SCTP. Successfully set the SCTP. Configuration succeeded.

  • 执行以下命令在需要替换实例的主机上完成启动操作。

    omm@SZX1000071374:/srv/BigData/mppdb/data2> gs_replace -t start -h SZX1000071374 Starting. ====================================================================== Successfully started instance process. Waiting to become Normal. ====================================================================== . ====================================================================== Start succeeded on all nodes. Start succeeded.

  • 再重置实例状态。

    omm@SZX1000071374:/srv/BigData/mppdb/data2> gs_om -t switch --reset Operating: Switch reset. cm_ctl: cmserver is rebalancing the cluster automatically. ..... cm_ctl: switchover successfully. Operation succeeded: Switch reset.

  • 等待一段时间,查看告警是否仍然存在。

    • 是,执行10。
    • 否,处理完毕。

收集故障信息。

  • 在FusionInsight Manager界面,选择运维 > 日志 > 下载
  • “服务”列表框中勾选“MPPDB ”
  • 单击右上角的设置日志收集的“开始时间”和“结束时间”分别为告警产生时间的前后1小时,单击“下载”。
  • 请联系技术支持,并发送已收集的故障日志信息。

告警清除

此告警修复后,系统会自动清除此告警,无需手工清除。

参考信息

无。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论