一、磁盘满
1、备用满
初始状态
[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Primary Normal
[omm@testnode1 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 12K 16G 1% /dev/shm
tmpfs 6.2G 218M 6.0G 4% /run
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
/dev/mapper/hdvg-rootlv 56G 29G 28G 52% /
tmpfs 16G 5.3G 11G 34% /tmp
/dev/sda2 1014M 129M 886M 13% /boot
/dev/sda3 1022M 11M 1012M 2% /boot/efi
tmpfs 3.1G 0 3.1G 0% /run/user/0
tmpfs 3.1G 0 3.1G 0% /run/user/1000
tmpfs 3.1G 0 3.1G 0% /run/user/1006
tmpfs 3.1G 0 3.1G 0% /run/user/1003
在testnode1上使用fio生成一个文件,占用22G,使/占用达到90%
fio -filename=/testfile -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=2k -size=22G -numjobs=100 -runtime=5 -group_reporting -name=mytest
[root@testnode1 yum.repos.d]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 12K 16G 1% /dev/shm
tmpfs 6.2G 218M 6.0G 4% /run
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
/dev/mapper/hdvg-rootlv 56G 51G 5.2G 91% /
tmpfs 16G 5.3G 11G 34% /tmp
/dev/sda2 1014M 129M 886M 13% /boot
/dev/sda3 1022M 11M 1012M 2% /boot/efi
tmpfs 3.1G 0 3.1G 0% /run/user/0
tmpfs 3.1G 0 3.1G 0% /run/user/1000
tmpfs 3.1G 0 3.1G 0% /run/user/1006
tmpfs 3.1G 0 3.1G 0% /run/user/1003
在观察testnode1的状态为只读
[omm@testnode1 om]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary
[ Cluster State ]
cluster_state : Degraded
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby ReadOnly
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Primary Normal
此时数据库变为只读,虽然还能从主库同步数据,但是他如果被切为主库的话,仍然是无法写入数据的,所以现网要注意这个问题
2、主用满
初始状态
[root@testnode1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 12K 16G 1% /dev/shm
tmpfs 6.2G 218M 6.0G 4% /run
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
/dev/mapper/hdvg-rootlv 56G 29G 28G 52% /
tmpfs 16G 5.3G 11G 34% /tmp
/dev/sda2 1014M 129M 886M 13% /boot
/dev/sda3 1022M 11M 1012M 2% /boot/efi
tmpfs 3.1G 0 3.1G 0% /run/user/0
tmpfs 3.1G 0 3.1G 0% /run/user/1000
tmpfs 3.1G 0 3.1G 0% /run/user/1006
tmpfs 3.1G 0 3.1G 0% /run/user/1003
填充磁盘后使磁盘空间超过90,那么主用就无法写入,只能等待清理空间后才可以
[omm@testnode1 om]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary
[ Cluster State ]
cluster_state : Degraded
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary ReadOnly
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal
[omm@testnode1 om]$ gsql -d zxc4 -p 15400 -r
gsql ((openGauss 5.0.0 build a07d57c3) compiled at 2023-03-29 03:37:13 commit 0 last mr )
Non-SSL connection (SSL connection is recommended when requiring high-security)
Type "help" for help.
zxc4=# insert into t2 values(10,10);
ERROR: cannot execute INSERT in a read-only transaction
zxc4=#
等待一会数据库会自动将主库切换到剩余空间正常的机器上,在此期间没有做任何操作
[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary
[ Cluster State ]
cluster_state : Degraded
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby ReadOnly
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Primary Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal
如果所有主机都磁盘满了,那么就无法切换,整个集群没有可以写入的主用节点了
[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary
[ Cluster State ]
cluster_state : Degraded
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby ReadOnly
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby ReadOnly
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Primary ReadOnly
二、主机宕机
1、节点异常宕机,能自启动数据库,针对异常情况
初始状态
[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Primary Normal
在testnode3上执行杀内核的命令echo 1 > /proc/sys/kernel/sysrq;echo c > /proc/sysrq-trigger
再观察集群状态
[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Primary
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Down
[ Cluster State ]
cluster_state : Unavailable
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Down Unknown
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Down Normal
自动发生切换
过一会等待旧的主库启动后观察
[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Primary
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal
testnode主机已经重启完成,数据库也自动起来了,并自动加入集群,所以无需人工操作,除非无法加入集群或者数据库无法正常启动,才考虑手工操作
2、节点手工宕机,需要手工启动数据库,针对人工操作
手工状态
[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Primary
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal
如果有需要手工重启主库,建议先将主用切换到其他主机,比如测试环境就先将主用切换到testnode2,在testnode2上执行如下命令:
gs_ctl switchover -D /home/omm/huawei/install/data/dn
[omm@testnode2 ~]$ gs_ctl switchover -D /home/omm/huawei/install/data/dn
[2023-08-02 16:05:47.130][1641788][][gs_ctl]: gs_ctl switchover ,datadir is /home/omm/huawei/install/data/dn
[2023-08-02 16:05:47.130][1641788][][gs_ctl]: switchover term (1)
[2023-08-02 16:05:47.138][1641788][][gs_ctl]: waiting for server to switchover........
[2023-08-02 16:05:52.186][1641788][][gs_ctl]: done
[2023-08-02 16:05:52.186][1641788][][gs_ctl]: switchover completed (/home/omm/huawei/install/data/dn)
[omm@testnode2 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Primary
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Primary Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal
三、网络不通
在时间很短的情况下能自动恢复,但是如果时间很长可能还是需要人工介入
1、自动恢复
初始环境:
[omm@testnode3 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Primary Normal
在testnode3上使用iptables -I INPUT -s 10.1.62.240 -j DROP;iptables -I INPUT -s 10.1.62.241 -j DROP
在另外两台执行iptables -I INPUT -s 10.1.62.217 -j DROP
中间状态:
[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Primary
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Down
[ Cluster State ]
cluster_state : Unavailable
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby Need repair(Disconnected)
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Need repair(Disconnected)
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Down Unknown
最终状态
[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Primary
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Down
[ Cluster State ]
cluster_state : Degraded
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Down Unknown
会重新选取主用出来,这时所有主机iptables -F,清除限制,db3会重新加入进来
[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Primary
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal
在这里需要注意一点,在db3和db1/2互相不通的时候,db1/2选出一个主可以写,db3虽然从他本机上查看状态,也是cm管理器异常,但是实际也是可以写的。所以如果出现这种网络分片的问题,可能会导致新旧主库同时能提供业务(现网不太可能出现这种情况,只是数据库间不通但是对外都通)。如果出现这种情况,网络恢复后,旧的主会自动和新的主同步数据,并作为从库加入数据库集群中。
2、无法自动恢复
继续上述测试,如果没有db3没有成功加入进来,则需要通过备份恢复的方式,假设本次db3没有加入进集群
[omm@testnode3 ~]$ cm_ctl stop -n 3 -D /home/omm/huawei/install/data/dn
cm_ctl: stop the node: 3, datapath: /home/omm/huawei/install/data/dn.
..
cm_ctl: stop instance successfully.
[omm@testnode3 ~]$ rm -rf /home/omm/gs_basebak
[omm@testnode3 ~]$ mkdir -p /home/omm/gs_basebak
[omm@testnode3 ~]$ gs_basebackup -D /home/omm/gs_basebak -X fetch -F t -p 15400
INFO: The starting position of the xlog copy of the full build is: 0/5EC9CD8. The slot minimum LSN is: 0/0. The disaster slot minimum LSN is: 0/0. The logical slot minimum LSN is: 0/0.
[2023-08-02 17:09:44]:begin build tablespace list
[2023-08-02 17:09:44]:finish build tablespace list
[2023-08-02 17:09:45]:gs_basebackup: base backup successfully
[omm@testnode3 ~]$ mkdir /home/omm/testnode1bak
[omm@testnode3 ~]$ gs_basebackup -D /home/omm/testnode1bak -X fetch -F t -h testnode1 -p 15400
INFO: The starting position of the xlog copy of the full build is: 0/6000028. The slot minimum LSN is: 0/6000148. The disaster slot minimum LSN is: 0/0. The logical slot minimum LSN is: 0/0.
[2023-08-02 17:10:50]:begin build tablespace list
[2023-08-02 17:10:50]:finish build tablespace list
[2023-08-02 17:10:53]:gs_basebackup: base backup successfully
[omm@testnode3 ~]$ gs_om -t stop -h testnode3
Stopping node.
=========================================
Successfully stopped node.
=========================================
End stop node.
[omm@testnode3 ~]$ rm -rf /home/omm/huawei/install/data/dn/*
[omm@testnode3 ~]$ gs_tar -D /home/omm/huawei/install/data/dn/ -F /home/omm/testnode1bak/base.tar
[omm@testnode3 ~]$ gs_backup -t restore --backup-dir=/home/omm/gs_basebak --parameter -h testnode3
Parsing configuration files.
Successfully parsed the configuration file.
Performing remote restoration.
Successfully restored cluster files.
[omm@testnode3 ~]$ cm_ctl start -n 3 -D /home/omm/huawei/install/data/dn
cm_ctl: start the node:3,datapath:/home/omm/huawei/install/data/dn.
.....
cm_ctl: start instance successfully.
[omm@testnode3 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Primary
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal
四、虚机重拉(没有备份本地配置文件)
配置文件备份务必在每台上都是备份所有主机,这样就不会出现没有配置文件的情况,只需要通过主库恢复全量,再通过其他库的备份配置拿过来恢复即可。




