暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

opengauss的一些 案例

是赐赐啊!🦄 2024-11-04
189


一、磁盘满

1、备用满

初始状态

[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary

[ Cluster State ]

cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Primary Normal

[omm@testnode1 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 12K 16G 1% /dev/shm
tmpfs 6.2G 218M 6.0G 4% /run
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
/dev/mapper/hdvg-rootlv 56G 29G 28G 52% /
tmpfs 16G 5.3G 11G 34% /tmp
/dev/sda2 1014M 129M 886M 13% /boot
/dev/sda3 1022M 11M 1012M 2% /boot/efi
tmpfs 3.1G 0 3.1G 0% /run/user/0
tmpfs 3.1G 0 3.1G 0% /run/user/1000
tmpfs 3.1G 0 3.1G 0% /run/user/1006
tmpfs 3.1G 0 3.1G 0% /run/user/1003

在testnode1上使用fio生成一个文件,占用22G,使/占用达到90%

fio -filename=/testfile -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=2k -size=22G -numjobs=100 -runtime=5 -group_reporting -name=mytest

[root@testnode1 yum.repos.d]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 12K 16G 1% /dev/shm
tmpfs 6.2G 218M 6.0G 4% /run
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
/dev/mapper/hdvg-rootlv 56G 51G 5.2G 91% /
tmpfs 16G 5.3G 11G 34% /tmp
/dev/sda2 1014M 129M 886M 13% /boot
/dev/sda3 1022M 11M 1012M 2% /boot/efi
tmpfs 3.1G 0 3.1G 0% /run/user/0
tmpfs 3.1G 0 3.1G 0% /run/user/1000
tmpfs 3.1G 0 3.1G 0% /run/user/1006
tmpfs 3.1G 0 3.1G 0% /run/user/1003

在观察testnode1的状态为只读

[omm@testnode1 om]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary

[ Cluster State ]

cluster_state : Degraded
redistributing : No
balanced : No
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby ReadOnly
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Primary Normal

此时数据库变为只读,虽然还能从主库同步数据,但是他如果被切为主库的话,仍然是无法写入数据的,所以现网要注意这个问题

2、主用满

初始状态

[root@testnode1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 12K 16G 1% /dev/shm
tmpfs 6.2G 218M 6.0G 4% /run
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
/dev/mapper/hdvg-rootlv 56G 29G 28G 52% /
tmpfs 16G 5.3G 11G 34% /tmp
/dev/sda2 1014M 129M 886M 13% /boot
/dev/sda3 1022M 11M 1012M 2% /boot/efi
tmpfs 3.1G 0 3.1G 0% /run/user/0
tmpfs 3.1G 0 3.1G 0% /run/user/1000
tmpfs 3.1G 0 3.1G 0% /run/user/1006
tmpfs 3.1G 0 3.1G 0% /run/user/1003

填充磁盘后使磁盘空间超过90,那么主用就无法写入,只能等待清理空间后才可以

[omm@testnode1 om]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary

[ Cluster State ]

cluster_state : Degraded
redistributing : No
balanced : Yes
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary ReadOnly
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal

[omm@testnode1 om]$ gsql -d zxc4 -p 15400 -r
gsql ((openGauss 5.0.0 build a07d57c3) compiled at 2023-03-29 03:37:13 commit 0 last mr )
Non-SSL connection (SSL connection is recommended when requiring high-security)
Type "help" for help.

zxc4=# insert into t2 values(10,10);
ERROR: cannot execute INSERT in a read-only transaction
zxc4=#

等待一会数据库会自动将主库切换到剩余空间正常的机器上,在此期间没有做任何操作

[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary

[ Cluster State ]

cluster_state : Degraded
redistributing : No
balanced : No
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby ReadOnly
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Primary Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal

如果所有主机都磁盘满了,那么就无法切换,整个集群没有可以写入的主用节点了

[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary

[ Cluster State ]

cluster_state : Degraded
redistributing : No
balanced : No
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby ReadOnly
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby ReadOnly
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Primary ReadOnly

二、主机宕机

1、节点异常宕机,能自启动数据库,针对异常情况

初始状态

[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary

[ Cluster State ]

cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Primary Normal

在testnode3上执行杀内核的命令echo 1 > /proc/sys/kernel/sysrq;echo c > /proc/sysrq-trigger

再观察集群状态

[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Primary
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Down

[ Cluster State ]

cluster_state : Unavailable
redistributing : No
balanced : Yes
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Down Unknown
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Down Normal

自动发生切换

过一会等待旧的主库启动后观察

[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Primary
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Standby

[ Cluster State ]

cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal

testnode主机已经重启完成,数据库也自动起来了,并自动加入集群,所以无需人工操作,除非无法加入集群或者数据库无法正常启动,才考虑手工操作

2、节点手工宕机,需要手工启动数据库,针对人工操作

手工状态

[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Primary
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Standby

[ Cluster State ]

cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal

如果有需要手工重启主库,建议先将主用切换到其他主机,比如测试环境就先将主用切换到testnode2,在testnode2上执行如下命令:

gs_ctl switchover -D /home/omm/huawei/install/data/dn

[omm@testnode2 ~]$ gs_ctl switchover -D /home/omm/huawei/install/data/dn
[2023-08-02 16:05:47.130][1641788][][gs_ctl]: gs_ctl switchover ,datadir is /home/omm/huawei/install/data/dn
[2023-08-02 16:05:47.130][1641788][][gs_ctl]: switchover term (1)
[2023-08-02 16:05:47.138][1641788][][gs_ctl]: waiting for server to switchover........
[2023-08-02 16:05:52.186][1641788][][gs_ctl]: done
[2023-08-02 16:05:52.186][1641788][][gs_ctl]: switchover completed (/home/omm/huawei/install/data/dn)
[omm@testnode2 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Primary
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Standby

[ Cluster State ]

cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Primary Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal

三、网络不通

在时间很短的情况下能自动恢复,但是如果时间很长可能还是需要人工介入

1、自动恢复

初始环境:

[omm@testnode3 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Standby
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Primary

[ Cluster State ]

cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Primary Normal

在testnode3上使用iptables -I INPUT -s 10.1.62.240 -j DROP;iptables -I INPUT -s 10.1.62.241 -j DROP

在另外两台执行iptables -I INPUT -s 10.1.62.217 -j DROP

中间状态:

[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Primary
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Down

[ Cluster State ]

cluster_state : Unavailable
redistributing : No
balanced : No
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Standby Need repair(Disconnected)
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Need repair(Disconnected)
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Down Unknown

最终状态

[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Primary
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Down

[ Cluster State ]

cluster_state : Degraded
redistributing : No
balanced : Yes
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Down Unknown

会重新选取主用出来,这时所有主机iptables -F,清除限制,db3会重新加入进来

[omm@testnode1 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Primary
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Standby

[ Cluster State ]

cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal

在这里需要注意一点,在db3和db1/2互相不通的时候,db1/2选出一个主可以写,db3虽然从他本机上查看状态,也是cm管理器异常,但是实际也是可以写的。所以如果出现这种网络分片的问题,可能会导致新旧主库同时能提供业务(现网不太可能出现这种情况,只是数据库间不通但是对外都通)。如果出现这种情况,网络恢复后,旧的主会自动和新的主同步数据,并作为从库加入数据库集群中。

2、无法自动恢复

继续上述测试,如果没有db3没有成功加入进来,则需要通过备份恢复的方式,假设本次db3没有加入进集群

[omm@testnode3 ~]$ cm_ctl stop -n 3 -D /home/omm/huawei/install/data/dn
cm_ctl: stop the node: 3, datapath: /home/omm/huawei/install/data/dn.
..
cm_ctl: stop instance successfully.


[omm@testnode3 ~]$ rm -rf /home/omm/gs_basebak
[omm@testnode3 ~]$ mkdir -p /home/omm/gs_basebak
[omm@testnode3 ~]$ gs_basebackup -D /home/omm/gs_basebak -X fetch -F t -p 15400
INFO: The starting position of the xlog copy of the full build is: 0/5EC9CD8. The slot minimum LSN is: 0/0. The disaster slot minimum LSN is: 0/0. The logical slot minimum LSN is: 0/0.
[2023-08-02 17:09:44]:begin build tablespace list
[2023-08-02 17:09:44]:finish build tablespace list
[2023-08-02 17:09:45]:gs_basebackup: base backup successfully

[omm@testnode3 ~]$ mkdir /home/omm/testnode1bak
[omm@testnode3 ~]$ gs_basebackup -D /home/omm/testnode1bak -X fetch -F t -h testnode1 -p 15400
INFO: The starting position of the xlog copy of the full build is: 0/6000028. The slot minimum LSN is: 0/6000148. The disaster slot minimum LSN is: 0/0. The logical slot minimum LSN is: 0/0.
[2023-08-02 17:10:50]:begin build tablespace list
[2023-08-02 17:10:50]:finish build tablespace list
[2023-08-02 17:10:53]:gs_basebackup: base backup successfully

[omm@testnode3 ~]$ gs_om -t stop -h testnode3
Stopping node.
=========================================
Successfully stopped node.
=========================================
End stop node.
[omm@testnode3 ~]$ rm -rf /home/omm/huawei/install/data/dn/*

[omm@testnode3 ~]$ gs_tar -D /home/omm/huawei/install/data/dn/ -F /home/omm/testnode1bak/base.tar
[omm@testnode3 ~]$ gs_backup -t restore --backup-dir=/home/omm/gs_basebak --parameter -h testnode3
Parsing configuration files.
Successfully parsed the configuration file.
Performing remote restoration.
Successfully restored cluster files.

[omm@testnode3 ~]$ cm_ctl start -n 3 -D /home/omm/huawei/install/data/dn
cm_ctl: start the node:3,datapath:/home/omm/huawei/install/data/dn.
.....
cm_ctl: start instance successfully.

[omm@testnode3 ~]$ gs_om -t status --detail
[ CMServer State ]

node node_ip instance state
-------------------------------------------------------------------------------
1 testnode1 10.1.62.240 1 /home/omm/huawei/install/cm/cm_server Standby
2 testnode2 10.1.62.241 2 /home/omm/huawei/install/cm/cm_server Primary
3 testnode3 10.1.60.217 3 /home/omm/huawei/install/cm/cm_server Standby

[ Cluster State ]

cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL

[ Datanode State ]

node node_ip instance state
------------------------------------------------------------------------------------
1 testnode1 10.1.62.240 6001 /home/omm/huawei/install/data/dn P Primary Normal
2 testnode2 10.1.62.241 6002 /home/omm/huawei/install/data/dn S Standby Normal
3 testnode3 10.1.60.217 6003 /home/omm/huawei/install/data/dn S Standby Normal

四、虚机重拉(没有备份本地配置文件)

配置文件备份务必在每台上都是备份所有主机,这样就不会出现没有配置文件的情况,只需要通过主库恢复全量,再通过其他库的备份配置拿过来恢复即可。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论