需求:磐维一主两从节点,当前主节点出现RAID卡配置异常,需要重新安装操作系统,且该节点为当前主节点,需要将该节点切换为从节点,然后将该节点从集群中剔除
1、查看当前集群信息
postgres=# select pw_version();
pw_version
-----------------------------------------------------------------------------
(PanWeiDB_V2.0-S3.0.1_B01) compiled at 2024-09-29 19:47:53 commit d086caf +
product name:PanWeiDB +
version:V2.0-S3.0.1_B01 +
commit:d086caf +
openGauss version:5.0.0 +
host:x86_64-pc-linux-gnu
(1 row)
[omm@pwtest1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
----------------------------------------------------------------------
1 pwtest1 192.168.61.57 1 /panwei/database/cm/cm_server Primary
2 pwtest2 192.168.61.71 2 /panwei/database/cm/cm_server Standby
3 pwtest3 192.168.53.74 3 /panwei/database/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------
1 pwtest1 192.168.61.57 6001 /panwei/database/data P Primary Normal
2 pwtest2 192.168.61.71 6002 /panwei/database/data S Standby Normal
3 pwtest3 192.168.53.74 6003 /panwei/database/data S Standby Normal
2、主从切换
将主节点切换至pwtest2 节点
[omm@pwtest1 ~]$ cm_ctl switchover -n 2 -D /panwei/database/data
.......................
cm_ctl: switchover successfully.
[omm@pwtest1 ~]$
确认集群状态:
[omm@pwtest1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
----------------------------------------------------------------------
1 pwtest1 192.168.61.57 1 /panwei/database/cm/cm_server Primary
2 pwtest2 192.168.61.71 2 /panwei/database/cm/cm_server Standby
3 pwtest3 192.168.53.74 3 /panwei/database/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------
1 pwtest1 192.168.61.57 6001 /panwei/database/data P Standby Normal
2 pwtest2 192.168.61.71 6002 /panwei/database/data S Primary Normal
3 pwtest3 192.168.53.74 6003 /panwei/database/data S Standby Normal
可以看到主节点已经成功切换至2节点,且集群状态正常。
3、cm_server 主从切换
保证需要剔除的节点不受影响,将cm_server的主节点也进行切换,此操作不影响数据库集群,切换速度快。
[omm@pwtest2 script]$ cm_ctl set --cmsPromoteMode=PRIMARY_F -I 2
cm_ctl: set CMS promote mode(1), nodeid: 2, instanceid 2.
cm_ctl: set CMS promote mode successfully.
[omm@pwtest2 script]$ cm_ctl set --cmsPromoteMode=AUTO -I 2
cm_ctl: set CMS promote mode(0), nodeid: 2, instanceid 2.
cm_ctl: set CMS promote mode successfully.
[omm@pwtest2 script]$
#确认集群信息:
[omm@pwtest1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
----------------------------------------------------------------------
1 pwtest1 192.168.61.57 1 /panwei/database/cm/cm_server Standby
2 pwtest2 192.168.61.71 2 /panwei/database/cm/cm_server Primary
3 pwtest3 192.168.53.74 3 /panwei/database/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------
1 pwtest1 192.168.61.57 6001 /panwei/database/data P Standby Normal
2 pwtest2 192.168.61.71 6002 /panwei/database/data S Primary Normal
3 pwtest3 192.168.53.74 6003 /panwei/database/data S Standby Normal
4、节点剔除
注意:节点的剔除,必须是在当前主节点上操作。
切换到当前主节点:
[omm@pwtest1 ~]$ ssh pwtest2
You have logged onto a secured server..All accesses logged
Authorized users only. All activity may be monitored and reported
Last login: Wed Apr 16 14:36:03 2025 from 192.168.53.68
[omm@pwtest2 ~]$
在缩容前,需要保证如下参数配置:
logging_collector=on
uppercase_attribute_name=off
postgres=# show logging_collector;
logging_collector
-------------------
on
(1 row)
postgres=# show uppercase_attribute_name;
uppercase_attribute_name
--------------------------
off
(1 row)
gs_dropnode 会读取操作系统的字符集,因此在执行 gs_dropnode 命令前,需要执行以下操作保证操作系统的字符集为 UTF-8。
export LANG=en_US.UTF-8
查看帮助命令:
[omm@pwtest2 ~]$ gs_dropnode --help
gs_dropnode is a utility to delete the standby node from a cluster, streaming cluster does not yet support.
Usage:
gs_dropnode -? | --help
gs_dropnode -V | --version
gs_dropnode -U USER -G GROUP -h nodeList
General options:
-U Cluster user.
-G Group of the cluster user.
-h The standby node backip list which need to be deleted
Separate multiple nodes with commas (,).
such as '-h 192.168.0.1,192.168.0.2'
-?, --help Show help information for this
utility, and exit the command line mode.
-V, --version Show version information.
开始节点剔除:
[omm@pwtest2 ~]$ id omm
uid=1101(omm) gid=1101(dbgrp) groups=1101(dbgrp)
[omm@pwtest2 script]$ gs_dropnode -U omm -G dbgrp -h 192.168.61.57
The target node to be dropped is (['pwtest1'])
Do you want to continue to drop the target node (yes/no)?yes
Drop node start with CM node.
Drop node with CM node is running.
[gs_dropnode]Start to drop nodes of the cluster.
[gs_dropnode]Start to stop the target node pwtest1.
[gs_dropnode]End of stop the target node pwtest1.
[gs_dropnode]Start to backup parameter config file on pwtest2.
[gs_dropnode]End to backup parameter config file on pwtest2.
[gs_dropnode]The backup file of pwtest2 is /panwei/database/tmp/gs_dropnode_backup20250416160851/parameter_pwtest2.tar
[gs_dropnode]Start to parse parameter config file on pwtest2.
Command for Checking VIP mode: cm_ctl res --list | awk -F "|" '{print $2}' | grep -w ***
The current cluster does not support VIP.
[gs_dropnode]End to parse parameter config file on pwtest2.
[gs_dropnode]Start to parse backup parameter config file on pwtest2.
[gs_dropnode]End to parse backup parameter config file pwtest2.
[gs_dropnode]Start to set openGauss config file on pwtest2.
[gs_dropnode]End of set openGauss config file on pwtest2.
[gs_dropnode]Start adjusting replconninfo parameters on pwtest2.
[gs_dropnode]Successfully adjusted replconninfo parameters on pwtest2.
[gs_dropnode]Start to backup parameter config file on pwtest3.
[gs_dropnode]End to backup parameter config file on pwtest3.
[gs_dropnode]The backup file of pwtest3 is /panwei/database/tmp/gs_dropnode_backup20250416160852/parameter_pwtest3.tar
[gs_dropnode]Start to parse parameter config file on pwtest3.
Command for Checking VIP mode: cm_ctl res --list | awk -F "|" '{print $2}' | grep -w ***
The current cluster does not support VIP.
[gs_dropnode]End to parse parameter config file on pwtest3.
[gs_dropnode]Start to parse backup parameter config file on pwtest3.
[gs_dropnode]End to parse backup parameter config file pwtest3.
[gs_dropnode]Start to set openGauss config file on pwtest3.
[gs_dropnode]End of set openGauss config file on pwtest3.
[gs_dropnode]Start adjusting replconninfo parameters on pwtest3.
[gs_dropnode]Successfully adjusted replconninfo parameters on pwtest3.
[gs_dropnode]Start of set pg_hba config file on pwtest2.
[gs_dropnode]End of set pg_hba config file on pwtest2.
[gs_dropnode]Start of set pg_hba config file on pwtest3.
[gs_dropnode]End of set pg_hba config file on pwtest3.
[gs_dropnode]Start to set repl slot on pwtest2.
[gs_dropnode]Start to get repl slot on pwtest2.
[gs_dropnode]End of set repl slot on pwtest2.
Command for Checking VIP mode: cm_ctl res --list | awk -F "|" '{print $2}' | grep -w ***
The current cluster does not support VIP.
Stopping node.
=========================================
Successfully stopped node.
=========================================
End stop node.
Generate drop flag file on drop node pwtest1 successfully.
[gs_dropnode]Start to modify the cluster static conf.
[gs_dropnode]End of modify the cluster static conf.
Restarting cm_server cluster ...
Remove dynamic_config_file and CM metadata directory on all nodes.
All steps of drop have finished, but failed to wait cluster to be normal in 600s!
HINT: Maybe the cluster is continually being started in the background.
You can wait for a while and check whether the cluster starts.
# 从结果显示: 节点删除成功,但是 cm_server 启动超时失败。
删除节点期间集群状态:
# 对应节点cm_server和datanode 宕机
[omm@pwtest3 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
----------------------------------------------------------------------
1 pwtest1 192.168.61.57 1 /panwei/database/cm/cm_server Down
2 pwtest2 192.168.61.71 2 /panwei/database/cm/cm_server Primary
3 pwtest3 192.168.53.74 3 /panwei/database/cm/cm_server Standby
[ Cluster State ]
cluster_state : Degraded
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------
1 pwtest1 192.168.61.57 6001 /panwei/database/data P Down Unknown
2 pwtest2 192.168.61.71 6002 /panwei/database/data S Primary Normal
3 pwtest3 192.168.53.74 6003 /panwei/database/data S Standby Normal
# 集群出现宕机:
[omm@pwtest3 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
----------------------------------------------------------------------
2 pwtest2 192.168.61.71 2 /panwei/database/cm/cm_server Down
3 pwtest3 192.168.53.74 3 /panwei/database/cm/cm_server Down
cm_ctl: can't connect to cm_server.
Maybe cm_server is not running, or timeout expired. Please try again.
此时cm_server 集群宕机。
查看现有节点状态:
#主节点:
[omm@pwtest2 ~]$ gs_ctl query
[2025-04-16 16:47:58.671][1782839][][gs_ctl]: gs_ctl query ,datadir is /panwei/database/data
HA state:
local_role : Primary
static_connections : 1
db_state : Normal
detail_information : Normal
Senders info:
No information
Receiver info:
No information
[omm@pwtest2 ~]$
当前主节点数据库读写正常:
postgres=# create table test12 as select * from test11;
INSERT 0 0
postgres=# insert into test12 values (1);
INSERT 0 1
postgres=#
#从节点:
[omm@pwtest3 ~]$ gs_ctl query
[2025-04-16 17:05:14.021][344448][][gs_ctl]: gs_ctl query ,datadir is /panwei/database/data
HA state:
local_role : Pending
static_connections : 1
db_state : Need repair
detail_information : Disconnected
Senders info:
No information
Receiver info:
No information
[omm@pwtest3 ~]$
当前从节点处于Pending状态。
5、问题解决
5.1、修改cm_server 配置文件参数
[omm@pwtest2 cm]$ cat /panwei/database/cm/cm_server/cm_server.conf
# 修改如下参数:
third_party_gateway_ip= 192.168.61.126
cms_enable_failover_on2nodes=true
#注意:third_party_gateway_ip 地址通过路由及主机IP绑定的网卡获取:
当前主机网卡为:bond1
[omm@pwtest2 ~]$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.61.126 0.0.0.0 UG 0 0 0 bond1
0.0.0.0 10.126.44.62 0.0.0.0 UG 1000 0 0 bond0
5.2、重新应用xml文件(将已经剔除的节点信息删除)
备份并修改后的配置文件如下:
[omm@pwtest2 ~]$ cd /panwei/database/tool/script
[omm@pwtest2 ~]$ cp cluster_config.xml cluster_config_20250416.xml
[omm@pwtest2 ~]$ cat /panwei/database/tool/script/cluster_config.xml
<?xml version="1.0" encoding="utf-8"?>
<ROOT>
<CLUSTER>
<PARAM name="clusterName" value="panweidb" />
<PARAM name="nodeNames" value="pwtest2,pwtest3"/>
<PARAM name="gaussdbAppPath" value="/panwei/database/app" />
<PARAM name="gaussdbLogPath" value="/panwei/database/log" />
<PARAM name="tmpMppdbPath" value="/panwei/database/tmp"/>
<PARAM name="gaussdbToolPath" value="/panwei/database/tool" />
<PARAM name="corePath" value="/panwei/database/corefile"/>
<PARAM name="backIp1s" value="192.168.61.71,192.168.53.74"/>
</CLUSTER>
<DEVICELIST>
<DEVICE sn="pwtest2">
<PARAM name="name" value="pwtest2"/>
<PARAM name="azName" value="AZ1"/>
<PARAM name="azPriority" value="1"/>
<PARAM name="backIp1" value="192.168.61.71"/>
<PARAM name="sshIp1" value="192.168.61.71"/>
<PARAM name="cmsNum" value="1"/>
<PARAM name="cmServerPortBase" value="18800"/>
<PARAM name="cmServerListenIp1" value="192.168.61.71,192.168.53.74"/>
<PARAM name="cmServerHaIp1" value="192.168.61.71,192.168.53.74"/>
<PARAM name="cmServerlevel" value="1"/>
<PARAM name="cmServerRelation" value="pwtest2,pwtest3"/>
<PARAM name="cmDir" value="/panwei/database/cm"/>
<PARAM name="dataNum" value="1"/>
<PARAM name="dataPortBase" value="17700"/>
<PARAM name="dataNode1" value="/panwei/database/data,pwtest3,/panwei/database/data" />
<PARAM name="dataNode1_syncNum" value="1"/>
</DEVICE>
<DEVICE sn="pwtest3">
<PARAM name="name" value="pwtest3"/>
<PARAM name="azName" value="AZ1"/>
<PARAM name="azPriority" value="1"/>
<PARAM name="backIp1" value="192.168.53.74"/>
<PARAM name="sshIp1" value="192.168.53.74"/>
<PARAM name="cmServerPortStandby" value="18800"/>
<PARAM name="cmDir" value="/panwei/database/cm"/>
</DEVICE>
</DEVICELIST>
</ROOT>
备份并删除集群动态配置文件:(所有节点)
[omm@pwtest2 ~]$ cd $GAUSSHOME
[omm@pwtest2 app]$ cd bin/
[omm@pwtest2 bin]$ mv cluster_dynamic_config cluster_dynamic_config_bak
[omm@pwtest3 ~]$ cd $GAUSSHOME
[omm@pwtest3 app]$ cd bin/
[omm@pwtest3 bin]$ mv cluster_dynamic_config cluster_dynamic_config_bak
重新应用配置文件:
此操作主要是为了更新集群静态文件:$GAUSSHOME/bin/cluster_static_config。
[omm@pwtest2 script]$ gs_om -t generateconf -X /panwei/database/tool/script/cluster_config.xml --distribute
Generating static configuration files for all nodes.
Creating temp directory to store static configuration files.
Successfully created the temp directory.
Generating static configuration files.
Successfully generated static configuration files.
Static configuration files for all nodes are saved in /panwei/database/tool/script/static_config_files.
Distributing static configuration files to all nodes.
Successfully distributed static configuration files.
5.3、注释计划任务
注释所有节点的计划任务
[omm@pwtest2 ~]$ crontab -l
#*/1 * * * * source ~/.bashrc;python3 /panwei/database/tool/script/local/CheckSshAgent.py >>/dev/null 2>&1 &
#*/1 * * * * source /etc/profile;(if [ -f ~/.profile ];then source ~/.profile;fi);source ~/.bashrc;nohup /panwei/database/app/bin/om_monitor -L /panwei/database/log/omm/cm/om_monitor >>/dev/null 2>&1 &
[omm@pwtest3 ~]$ crontab -l
#*/1 * * * * source ~/.bashrc;python3 /panwei/database/tool/script/local/CheckSshAgent.py >>/dev/null 2>&1 &
#*/1 * * * * source /etc/profile;(if [ -f ~/.profile ];then source ~/.profile;fi);source ~/.bashrc;nohup /panwei/database/app/bin/om_monitor -L /panwei/database/log/omm/cm/om_monitor >>/dev/null 2>&1 &
5.4、停止所有节点CM相关进程
可以通过gs_ssh脚本,在主节点统一执行:
[omm@pwtest2 ~]$ gs_ssh -c "ps -xo pid,command | grep -E 'om_monitor|cm_agent|cm_server|fenced UDF' | grep -v grep | awk '{print \$1}' | xargs kill -9"
5.5、删除所有节点dcf_data目录和gstor目录
[omm@pwtest3 bin]$ cd /panwei/database/cm/
[omm@pwtest3 cm]$ ls -lrt
total 0
drwx------ 6 omm dbgrp 100 Nov 13 14:21 gstor
drwx------ 2 omm dbgrp 85 Apr 16 16:09 cm_agent
drwx------ 4 omm dbgrp 50 Apr 16 16:09 dcf_data
drwx------ 2 omm dbgrp 91 Apr 16 17:56 cm_server
[omm@pwtest3 cm]$ mv dcf_data dcf_data_20250416
[omm@pwtest3 cm]$ mv gstor gstor_20250416
## 也可以在主节点上使用gs_ssh 统一执行:
[omm@pwtest2 cm]$ gs_ssh -c "rm -rf /panwei/database/cm/dcf_data /panwei/database/cm//gstor"
5.6、启动所有节点om_monitor
将所有节点的计划任务取消注释
[omm@pwtest2 ~]$ crontab -l
*/1 * * * * source ~/.bashrc;python3 /panwei/database/tool/script/local/CheckSshAgent.py >>/dev/null 2>&1 &
*/1 * * * * source /etc/profile;(if [ -f ~/.profile ];then source ~/.profile;fi);source ~/.bashrc;nohup /panwei/database/app/bin/om_monitor -L /panwei/database/log/omm/cm/om_monitor >>/dev/null 2>&1 &
[omm@pwtest3 ~]$ crontab -l
*/1 * * * * source ~/.bashrc;python3 /panwei/database/tool/script/local/CheckSshAgent.py >>/dev/null 2>&1 &
*/1 * * * * source /etc/profile;(if [ -f ~/.profile ];then source ~/.profile;fi);source ~/.bashrc;nohup /panwei/database/app/bin/om_monitor -L /panwei/database/log/omm/cm/om_monitor >>/dev/null 2>&1 &
5.7、再次查看集群状态:
[omm@pwtest2 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
----------------------------------------------------------------------
1 pwtest2 192.168.61.71 1 /panwei/database/cm/cm_server Primary
2 pwtest3 192.168.53.74 2 /panwei/database/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------
1 pwtest2 192.168.61.71 6001 /panwei/database/data P Primary Normal
2 pwtest3 192.168.53.74 6002 /panwei/database/data S Standby Normal
此时集群恢复正常。
6、故障总结
此次节点删除发生的故障主要是因为:删除的是1节点(初始化集群时的主节点)。磐维不建议将初始化主节点删除,如果必须删除,那么需要先把第一个节点的序号,通过改写xml的节点顺序,重新生成cm的静态配置文件,再删除,如果是初始化从节点则不会发生如上故障。
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




