原作者:肖杰
- 问题概述
- 主备环境描述
- 修复步骤
问题概述
MogDB主备环境网络故障,导致脑裂
主备环境描述
[root@mogdb1 ~]# ptk cluster -n mogdb_cluster status
[ Cluster State ]
cluster_name : mogdb_cluster
cluster_state : Normal
database_version : MogDB 3.0.4 (build cc068866)
[ Datanode State ]
cluster_name | id | ip | port | user | nodename | db_role | state | upstream
----------------+------+----------------+-------+------+----------+---------+--------+-----------
mogdb_cluster | 6001 | 192.168.56.180 | 26000 | omm | dn_6001 | primary | Normal | -
| 6002 | 192.168.56.181 | 26000 | omm | dn_6002 | standby | Normal | -
[root@mogdb1 ~]# systemctl status mogha
● mogha.service - MogHA High Available Service
Loaded: loaded (/usr/lib/systemd/system/mogha.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2023-06-01 16:58:01 CST; 5min ago
Docs: https://docs.mogdb.io/zh/mogha/latest/overview
Main PID: 3334 (mogha)
CGroup: /system.slice/mogha.service
├─3334 /mogha/mogha/mogha -c /mogha/mogha/node.conf
├─3523 mogha: watchdog
├─3708 mogha: http-server
├─3709 mogha: heartbeat
├─3816 /opt/mogdb/app/bin/mogdb -D /opt/mogdb/data -M primary
├─7085 ping -c 3 -i 0.5 192.168.56.181
└─7087 ping -c 3 -i 0.5 192.168.56.1
Jun 01 16:58:01 mogdb1 systemd[1]: Started MogHA High Available Service.
Jun 01 16:58:02 mogdb1 mogha[3334]: MogHA Version: Version: 2.4.8
Jun 01 16:58:02 mogdb1 mogha[3334]: GitHash: 56c62c1
Jun 01 16:58:02 mogdb1 mogha[3334]: config loaded successfully
mogdb2:192.168.56.181为主库,mogdb1:192.168.56.180为备库,MogHA运行正常
模拟脑裂
[root@mogdb2 mogha]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 08:00:27:32:ad:72 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic enp0s3
valid_lft 78213sec preferred_lft 78213sec
inet6 fe80::16f5:b021:40d7:590b/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 08:00:27:a3:00:d2 brd ff:ff:ff:ff:ff:ff
inet 192.168.56.181/24 brd 192.168.56.255 scope global noprefixroute enp0s8
valid_lft forever preferred_lft forever
inet6 fe80::1481:4086:efcb:617b/64 scope link noprefixroute
valid_lft forever preferred_lft forever
[root@mogdb2 mogha]# ifdown enp0s8
查看主备MogHA日志
--mogdb1:
2023-05-06 16:37:34,270 ERROR [standby.py:34]: not found primary in cluster
2023-05-06 16:37:34,638 WARNING [client.py:88]: maybe mogha on node 192.168.56.181 not started, please check
2023-05-06 16:37:34,638 ERROR [standby.py:243]: get standbys failed from primary 192.168.56.181. err: [10004] connection error: request /db/standbys failed, errs: {'192.168.56.181': <ConnectError '[10004] connection error: HTTPConnectionPool(host='192.168.56.181', port=8081): Max retries exceeded with url: /db/standbys (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3890897e50>: Failed to establish a new connection: [Errno 113] No route to host'))'>}
2023-05-06 16:37:39,648 INFO [__init__.py:90]: ping result: {'192.168.56.1': True, '192.168.56.181': False}
2023-05-06 16:37:39,692 INFO [__init__.py:100]: local instance is alive Standby, state: Need repair
2023-05-06 16:37:40,840 INFO [standby.py:78]: check if primary lost by ping: {'192.168.56.181': False, '192.168.56.1': True}
2023-05-06 16:37:40,841 INFO [standby.py:93]: primary lost check :1s
2023-05-06 16:37:43,843 INFO [standby.py:78]: check if primary lost by ping: {'192.168.56.181': False, '192.168.56.1': True}
2023-05-06 16:37:43,843 INFO [standby.py:93]: primary lost check :4s
2023-05-06 16:37:46,847 INFO [standby.py:78]: check if primary lost by ping: {'192.168.56.181': False, '192.168.56.1': True}
2023-05-06 16:37:46,847 INFO [standby.py:93]: primary lost check :7s
2023-05-06 16:37:49,858 INFO [standby.py:78]: check if primary lost by ping: {'192.168.56.181': False, '192.168.56.1': True}
2023-05-06 16:37:49,859 INFO [standby.py:93]: primary lost check :10s
2023-05-06 16:37:50,863 ERROR [standby.py:300]: primary lost confirmed
2023-05-06 16:37:50,863 INFO [standby.py:191]: start failover...
2023-05-06 16:37:50,914 INFO [standby.py:197]: current lsn: (1,21/9A852568)
2023-05-06 16:37:54,003 INFO [standby.py:203]: [2023-05-06 16:37:50.951][1172][][gs_ctl]: gs_ctl failover ,datadir is /opt/mogdb/data
[2023-05-06 16:37:50.951][1172][][gs_ctl]: failover term (1)
[2023-05-06 16:37:50.954][1172][][gs_ctl]: waiting for server to failover...
...[2023-05-06 16:37:54.002][1172][][gs_ctl]: done
[2023-05-06 16:37:54.002][1172][][gs_ctl]: failover completed (/opt/mogdb/data)
2023-05-06 16:37:54,029 INFO [standby.py:216]: alter system set most_available_sync on
2023-05-06 16:37:54,044 INFO [standby.py:219]: confirm switch to primary, mount vip
2023-05-06 16:37:54,045 INFO [standby.py:222]: failover success
2023-05-06 16:37:54,045 INFO [standby.py:135]: write primary info to /mogha/mogha/primary_info
2023-05-06 16:37:54,045 INFO [standby.py:141]: write primary info success
2023-05-06 16:37:59,051 INFO [__init__.py:90]: ping result: {'192.168.56.1': True, '192.168.56.181': False}
2023-05-06 16:37:59,110 INFO [__init__.py:100]: local instance is alive Primary, state: Normal
2023-05-06 16:38:03,128 WARNING [client.py:88]: maybe mogha on node 192.168.56.181 not started, please check
2023-05-06 16:38:03,143 ERROR [primary.py:189]: not found any sync backup instance. []
2023-05-06 16:38:08,163 INFO [__init__.py:90]: ping result: {'192.168.56.1': True, '192.168.56.181': False}
2023-05-06 16:38:08,211 INFO [__init__.py:100]: local instance is alive Primary, state: Normal
--mogdb2:
2023-05-06 16:37:35,719 ERROR [__init__.py:55]: failed to get host1 status, err: [20002] heartbeat error: request /node/status failed, errs: {'192.168.56.180
': "HTTPConnectionPool(host='192.168.56.180', port=8081): Read timed out. (read timeout=60)"}
2023-05-06 16:37:40,757 INFO [__init__.py:100]: local instance is alive Primary, state: Normal
2023-05-06 16:37:46,850 ERROR [primary.py:189]: not found any sync backup instance. []
2023-05-06 16:37:50,880 INFO [__init__.py:90]: ping result: {'192.168.56.1': True, '192.168.56.180': True}
2023-05-06 16:37:56,758 INFO [__init__.py:100]: local instance is alive Primary, state: Normal
2023-05-06 16:38:02,822 ERROR [primary.py:130]: other primaries found: ['192.168.56.180']
2023-05-06 16:38:08,875 ERROR [primary.py:130]: other primaries found: ['192.168.56.180']
2023-05-06 16:38:09,884 INFO [primary.py:286]: real primary is local instance: 192.168.56.181
2023-05-06 16:38:13,901 INFO [__init__.py:90]: ping result: {'192.168.56.1': True, '192.168.56.180': True}
2023-05-06 16:38:18,949 INFO [__init__.py:100]: local instance is alive Primary, state: Normal
2023-05-06 16:38:25,034 ERROR [primary.py:130]: other primaries found: ['192.168.56.180']
2023-05-06 16:38:31,101 ERROR [primary.py:130]: other primaries found: ['192.168.56.180']
2023-05-06 16:38:32,109 INFO [primary.py:286]: real primary is local instance: 192.168.56.181
2023-05-06 16:38:36,120 INFO [__init__.py:90]: ping result: {'192.168.56.1': True, '192.168.56.180': True}
可以看到备库自动升级为主库,同时原备库也还是primary角色,发生脑裂
修复步骤
脑裂修复步骤可参考官方文档:https://docs.mogdb.io/zh/mogdb/v3.1/primary-and-standby-management
注:修复时关闭MogHA,否则MogHA会一直尝试自动拉起
[omm@mogdb2 ~]$ sudo systemctl stop mogha
[omm@mogdb2 ~]$ gs_ctl stop -D /opt/mogdb/data/
[2023-05-06 16:44:44.199][23979][][gs_ctl]: gs_ctl stopped ,datadir is /opt/mogdb/data
waiting for server to shut down................ done
server stopped
[omm@mogdb2 ~]$ ps -ef | grep mogdb
omm 23982 22047 0 16:44 pts/0 00:00:00 grep --color=auto mogdb
以standby模式启动mogdb2:
[omm@mogdb2 ~]$ gs_ctl start -D /opt/mogdb/data/ -M standby
[2023-05-06 16:45:50.529][23983][][gs_ctl]: gs_ctl started,datadir is /opt/mogdb/data
[2023-05-06 16:45:50.575][23983][][gs_ctl]: waiting for server to start...
.0 LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env.
0 LOG: [Alarm Module]Host Name: mogdb2
0 LOG: [Alarm Module]Host IP: 192.168.56.181
0 LOG: [Alarm Module]Cluster Name: mogdb_cluster
重新保存主备信息:
[2023-05-06 16:46:36] [omm@mogdb2 ~]$ gs_om -t refreshconf
[2023-05-06 16:47:01] [GAUSS-50204] : Failed to read /opt/mogdb/app/bin/cluster_static_config. Error:
[2023-05-06 16:47:02] The content is not correct.
refresh步骤报错,经过排查,是因为PTK版本问题,此版本存在bug,升级ptk即可:
[root@mogdb1 mogha]# ptk self upgrade -V 0.7.4
INFO[2023-05-06T16:50:53.551] current version: 0.7.0 release, target version: v0.7.4
INFO[2023-05-06T16:50:53.551] download package from http://cdn-mogdb.enmotech.com/ptk/v0.7.4/ptk_0.7.4_linux_x86_64.tar.gz
INFO[2023-05-06T16:50:53.551] downloading ptk_0.7.4_linux_x86_64.tar.gz ...
> ptk_0.7.4_linux_x86_64.tar....: 6.08 MiB / 6.27 MiB [----------------------------------------------------------------------->__] 96.96% 3.95 MiB p/s ETA 0s
> ptk_0.7.4_linux_x86_64.tar....: 6.27 MiB / 6.27 MiB [---------------------------------------------------------------------------] 100.00% 4.81 MiB p/s 1.5s
INFO[2023-05-06T16:50:55.803] upgrade ptk successfully
更新两节点yaml文件:
[root@mogdb1 ~]# ptk distribute -f config.yaml -H 192.168.56.180
[root@mogdb1 ~]#
[root@mogdb1 ~]#
[root@mogdb1 ~]#
[root@mogdb1 ~]# ptk distribute -f config.yaml -H 192.168.56.181
然后重复官方文档的脑裂修复步骤即可:
gs_ctl stop -D /opt/mogdb/data/
gs_ctl start -D /opt/mogdb/data/ -M standby
gs_om -t refreshconf
gs_ctl build -b full ---注:当前版本增量rebuild存在bug,这里选择全量rebuild,待修复
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




