原作者:陈坤
数据库环境
os Kylin linux V10SP1
Mogdb Mogdb3.0.2
数据库架构 一主二备二级联备
PTK 0.5.8
问题概述
事件时间回顾
2023-10-07 09:02 某客户进行数据库定期重启维护,停止应用并检查没有应用连接之后,使用ptk cluster stop -n aaadb关闭数据库,日志显示只关闭了备库和级联备库,主库172.18.40.111并未关闭,ptk日志如下。
time=2023-10-07T09:02:02.971 level=info msg=operation: stop
time=2023-10-07T09:02:02.971 level=info msg=========================================
time=2023-10-07T09:02:02.971 level=info msg=stop db [192.168.90.112:26000] …
time=2023-10-07T09:02:04.019 level=info msg=stop db [192.168.90.112:26000] successfully
time=2023-10-07T09:02:04.019 level=info msg=stop db [172.23.40.111:26000] …
time=2023-10-07T09:02:05.107 level=info msg=stop db [172.23.40.111:26000] successfully
time=2023-10-07T09:02:05.107 level=info msg=stop db [192.168.90.111:26000] …
time=2023-10-07T09:02:06.158 level=info msg=stop db [192.168.90.111:26000] successfully
time=2023-10-07T09:02:06.158 level=info msg=stop db [172.18.40.112:26000] …
time=2023-10-07T09:02:07.204 level=info msg=stop db [172.18.40.112:26000] successfully
time=2023-10-07T09:02:07.204 level=info msg=========================================
time=2023-10-07T09:02:07.204 level=info msg=stop successfully
2023-10-07 09:07 使用gs_om -t stop 关闭主库节点172.18.40.111,gs_om日志如下:
[2023-10-07 09:07:23.571625][2169922][OmImpl.py(doStop:96)][stop][DEBUG]:Operating: Stopping.
[2023-10-07 09:07:23.571794][2169922][OmImplOLAP.py(doStopCluster:288)][stop][DEBUG]:Operating: Stopping.
[2023-10-07 09:07:23.571949][2169922][OmImplOLAP.py(doStopCluster:297)][stop][LOG]:Stopping cluster.
[2023-10-07 09:07:23.572060][2169922][OmImplOLAP.py(doStopCluster:298)][stop][LOG]:=========================================
[2023-10-07 09:07:33.480631][2169922][OmImplOLAP.py(doStopCluster:323)][stop][LOG]:Successfully stopped cluster.
[2023-10-07 09:07:33.480828][2169922][OmImplOLAP.py(doStopCluster:325)][stop][LOG]:=========================================
[2023-10-07 09:07:33.480941][2169922][OmImplOLAP.py(doStopCluster:326)][stop][LOG]:End stop cluster.
[2023-10-07 09:07:33.481058][2169922][OmImplOLAP.py(doStopCluster:327)][stop][DEBUG]:Operation succeeded: Stop.
[ Cluster State ]cluster_state : Unavailable
redistributing : No
current_az : AZ_ALL[ Datanode State ]
node node_ip port instance state
------------------------------------------------------------------------------------
1 aaadb1zj 172.18.40.111 26000 6001 /data/mogdb/data P Down Manually stopped
2 aaadb2zj 172.18.40.112 26000 6002 /data/mogdb/data P Down Manually stopped
3 aaadb1pd 192.168.90.111 26000 6003 /data/mogdb/data S Down Manually stopped
4 aaadb2pd 192.168.90.112 26000 6004 /data/mogdb/data C Down Manually stopped
5 aaadb1bj 172.23.40.111 26000 6005 /data/mogdb/data C Down Manually stopped
[2023-10-07 09:07:36.923902][2170580][OmImpl.py(doStatus:259)][status][DEBUG]:Successfully obtained the cluster status.
2023-10-07 09:18 使用ptk cluster start -n aaadb启动数据库,发现只有启动命令只发给了除了主库节点172.18.40.111以外的三个库,并且备库172.18.40.112现在成为了主库。
time=2023-10-07T09:18:52.084 level=info msg=operation: start
time=2023-10-07T09:18:52.084 level=info msg=========================================
time=2023-10-07T09:18:52.084 level=info msg=start db [172.18.40.112:26000] …
time=2023-10-07T09:19:00.235 level=info msg=start db [172.18.40.112:26000] successfully
time=2023-10-07T09:19:00.235 level=info msg=start db [192.168.90.111:26000] …
time=2023-10-07T09:19:08.386 level=info msg=start db [192.168.90.111:26000] successfully
time=2023-10-07T09:19:08.386 level=info msg=start db [192.168.90.112:26000] …
time=2023-10-07T09:19:16.548 level=info msg=start db [192.168.90.112:26000] successfully
time=2023-10-07T09:19:16.548 level=info msg=start db [172.23.40.111:26000] …
time=2023-10-07T09:19:24.749 level=info msg=start db [172.23.40.111:26000] successfully
time=2023-10-07T09:19:24.937 level=info msg=waiting for check cluster state…
time=2023-10-07T09:19:30.131 level=info msg=waiting for check cluster state…
time=2023-10-07T09:19:35.325 level=info msg=waiting for check cluster state…
time=2023-10-07T09:19:40.520 level=info msg=waiting for check cluster state…
time=2023-10-07T09:19:45.713 level=info msg=waiting for check cluster state…
time=2023-10-07T09:19:50.910 level=info msg=waiting for check cluster state…
2023-10-07 09:20 使用gs_om -t start 拉起了主库172.18.40.111。并且北京级联备无法拉起,并报错该数据库不在集群中。
[2023-10-07 09:20:44.712782][18765][OmImplOLAP.py(doStartCluster:180)][start][DEBUG]:Operating: Starting.
[2023-10-07 09:20:44.713012][18765][OmImplOLAP.py(doStartCluster:189)][start][LOG]:Starting cluster.
[2023-10-07 09:20:44.713127][18765][OmImplOLAP.py(doStartCluster:190)][start][LOG]:=========================================
[2023-10-07 09:20:54.396143][18765][OmImplOLAP.py(doStartCluster:228)][start][LOG]:[SUCCESS] aaadb1zj
2023-10-07 09:20:46.454 [unknown] [unknown] localhost 23318227040448 0[0:0#0] 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets
[2023-10-07 09:21:03.249690][18765][OmImplOLAP.py(doStartCluster:228)][start][LOG]:[SUCCESS] aaadb2zj
2023-10-07 09:20:55.182 [unknown] [unknown] localhost 23438950308032 0[0:0#0] 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets
[2023-10-07 09:21:12.045905][18765][OmImplOLAP.py(doStartCluster:228)][start][LOG]:[SUCCESS] aaadb1pd
2023-10-07 09:21:04.020 [unknown] [unknown] localhost 22662416832704 0[0:0#0] 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets
[2023-10-07 09:21:20.988588][18765][OmImplOLAP.py(doStartCluster:228)][start][LOG]:[SUCCESS] aaadb2pd
2023-10-07 09:21:12.959 [unknown] [unknown] localhost 22727149168832 0[0:0#0] 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets
[2023-10-07 09:21:22.121234][18765][OmImplOLAP.py(doStartCluster:230)][start][LOG]:=========================================
[2023-10-07 09:21:22.121492][18765][gs_om(main:823)][start][ERROR]:[GAUSS-53600]: Can not start the database, the cmd is . /home/omm/.bashrc; python3 ‘/data/mogdb/mogdb/tool/script/local/StartInstance.py’ -U omm -R /data/mogdb/app -t 300 --security-mode=off, Error:
[FAILURE] aaadb1bj:
[GAUSS-51619] : The host name [aaadb1bj] is not in the cluster.
检查数据库状态,发现两个备库需要重新build,级联备并没有启动。
[ Cluster State ]
cluster_state : Degraded
redistributing : No
current_az : AZ_ALL[ Datanode State ]
node node_ip port instance state
------------------------------------------------------------------------------------
1 aaadb1zj 172.18.40.111 26000 6001 /data/mogdb/data P Primary Normal
2 aaadb2zj 172.18.40.112 26000 6002 /data/mogdb/data P Standby Need repair(WAL)
3 aaadb1pd 192.168.90.111 26000 6003 /data/mogdb/data S Standby Need repair(WAL)
4 aaadb2pd 192.168.90.112 26000 6004 /data/mogdb/data C Cascade Normal
5 aaadb1bj 172.23.40.111 26000 6005 /data/mogdb/data C Down Manually stopped
2023-10-07 09:23 开始逐个备库进行gs_ctl build 并在登录到aaadb1bj上使用gs_ctl start启动数据库,数据库打开后同样执行build的操作。
2023-10-07 10:20 所有备库都重新build完成,数据库集群状态正常。
问题原因
故障1:PTK启停数据库跳过了aaadb1zj
此故障原因为ptk配置文件/root/.ptk/data/aaadb/topology.yml中aaadb1zj,aaadb2zj两个库的角色都是primary。而在ptk0.5.8中,没有判断拓扑文件中主库的个数,在启动的时候是从配置文件中读取实例信息后,分为主、备、级联备进行分别启动的。对于主仅会启动一个,当拓扑文件中记录的实例角色存在多个主库的场景时,导致了使用ptk进行关闭集群时,先后读取到了两个主库信息,以第二个读取的主库覆盖了第一个,从而跳过了aaadb1zj只关闭了aaadb2zj,使用ptk启动时亦然,而启动时因为又跳过了aaadb1zj,将aaadb2zj以主库启动,造成主备关系错乱。
故总结此故障发生的两个条件:
- /root/.ptk/data/aaadb/topology.yml文件中配置错误,有两个primary角色。
- ptk0.5.8版本存在缺陷。
故障2:gs_om 启动数据库跳过了级联备库aaadb1bj
此故障原因为在早期安装数据库时,使用gs_ctl执行过switchover,并在切换完成后执行了gs_om -t refreshconf命令,此命令会在mogdb软件目录的app/bin/中生成cluster_dynamic_config文件。当存在cluster_dynamic_config文件时,gs_om会以cluster_dynamic_config 替代 cluster_static_config 文件从中获取cluster的配置信息。而之后使用ptk扩容出aaadb1bj后,ptk只会更新cluster_static_config静态文件,而不会更新cluster_dynamic_config,结果导致再次用gs_om -t start启动数据库时,无法找到ptk扩容添加的aaadb1bj主机,从而报错hostname aaadb1bj is not in cluster。
解决方案
PTK故障解决方案:
- 升级ptk,在 ptk v0.6.0 版本及以上查询状态会自动更新拓扑文件yml,这样也会及时修复拓扑文件中的错误,在1.0以上版本中如果ptk启动后发现topology.yml文件中存在两个primary角色,会以交互界面提示选择一个作为主库启动,并根据选择修改topology.yml文件。建议升级到1.0以上或最新版本。
- 此次手动build修复备库后,保持当前ptk版本不变,手动将yml中的角色信息改正,今后在执行重启操作区,检查topology.yml文件内容。
gs_om故障解决方案:
1,ptk扩容之后再次执行gs_om refreshconf,手动更新动态配置文件。
2,此问题待内核研发确认核实后,将在后续版本中修复
复现问题流程
安装ptk 0.5.8
使用ptk安装一套一主一备的mogdb3.0.2环境进行模拟
将拓扑文件修改为为两个primary
vi topology.yml
global:
cluster_name: testdb
user: test
group: test
…
db_servers:
- inst_id: 6001
node_id: 1
…
role: primary
- inst_id: 6002
node_id: 2
…
role: primary
…
检查集群状态
[root@mogdbt02 testdb]# ptk cluster status -n testdb
[ Cluster State ]
database_version : MogDB 3.0.2 (build 9bc79be5)
cluster_name : testdb
cluster_state : Normal[ Datanode State ]
cluster_name | id | ip | port | user | nodename | db_role | state | upstream
---------------±-----±----------------±------±-----±---------±--------±-------±----------
testdb | 6001 | 192.168.182.136 | 27000 | test | dn_6001 | primary | Normal | -
| 6002 | 192.168.182.137 | 27000 | test | dn_6002 | standby | Normal | -
使用ptk关闭集群,并检查集群状态。发现只关闭了备库
[root@mogdbt02 testdb]# ptk cluster stop -n testdb
INFO[2023-10-08T11:56:56.253] operation: stop
INFO[2023-10-08T11:56:56.254] ========================================
INFO[2023-10-08T11:56:56.254] stop db [192.168.182.137:27000] …
INFO[2023-10-08T11:56:57.320] stop db [192.168.182.137:27000] successfully
INFO[2023-10-08T11:56:57.320] ========================================
INFO[2023-10-08T11:56:57.320] stop successfully
[root@mogdbt02 testdb]# ptk cluster status -n testdb
[ Cluster State ]
database_version : MogDB 3.0.2 (build 9bc79be5)
cluster_name : testdb
cluster_state : Degraded[ Datanode State ]
cluster_name | id | ip | port | user | nodename | db_role | state | upstream
---------------±-----±----------------±------±-----±---------±------------------±--------±----------
testdb | 6001 | 192.168.182.136 | 27000 | test | dn_6001 | primary | Normal | -
| 6002 | 192.168.182.137 | 27000 | test | dn_6002 | primary(previous) | Stopped | -
此时使用gs_om关闭主库,然后检查,两个节点已全部关闭
[root@mogdbt02 testdb]# su - test
Last login: Sun Oct 8 11:57:11 EDT 2023
[test@mogdbt02 ~]$ gs_om -t stop
Stopping cluster.Successfully stopped cluster.
End stop cluster.
[test@mogdbt02 ~]$ exit
logout
[root@mogdbt02 testdb]# ptk cluster status -n testdb
[ Cluster State ]
database_version : MogDB 3.0.2 (build 9bc79be5)
cluster_name : testdb
cluster_state : Stopped[ Datanode State ]
cluster_name | id | ip | port | user | nodename | db_role | state | upstream
---------------±-----±----------------±------±-----±---------±------------------±--------±----------
testdb | 6001 | 192.168.182.136 | 27000 | test | dn_6001 | primary(previous) | Stopped | -
| 6002 | 192.168.182.137 | 27000 | test | dn_6002 | primary(previous) | Stopped | -
使用ptk启动集群,发现只拉起了192.168.182.137这台主机,也就是原备库,并且现在成为了主库。
[root@mogdbt02 testdb]# ptk cluster start -n testdb
INFO[2023-10-08T12:08:43.806] operation: start
INFO[2023-10-08T12:08:43.806] ========================================
INFO[2023-10-08T12:08:43.806] start db [192.168.182.137:27000] …
INFO[2023-10-08T12:08:46.073] start db [192.168.182.137:27000] successfully
INFO[2023-10-08T12:08:46.212] waiting for check cluster state…
INFO[2023-10-08T12:08:51.347] waiting for check cluster state…
INFO[2023-10-08T12:08:56.484] waiting for check cluster state…
INFO[2023-10-08T12:09:01.622] waiting for check cluster state…
INFO[2023-10-08T12:09:06.765] waiting for check cluster state…
INFO[2023-10-08T12:09:11.908] waiting for check cluster state…
cluster latest state is Degraded, please check manually
[root@mogdbt02 testdb]# ptk cluster status -n testdb
[ Cluster State ]
database_version : MogDB 3.0.2 (build 9bc79be5)
cluster_name : testdb
cluster_state : Degraded[ Datanode State ]
cluster_name | id | ip | port | user | nodename | db_role | state | upstream
---------------±-----±----------------±------±-----±---------±------------------±--------±----------
testdb | 6001 | 192.168.182.136 | 27000 | test | dn_6001 | primary(previous) | Stopped | -
| 6002 | 192.168.182.137 | 27000 | test | dn_6002 | primary | Normal | -
重现了问题。
手动修改拓扑文件修复问题
手动将topology.yml中192.168.182.137这台主机上的role改为正确的配置standby
然后使用ptk重启,此时可以正常重启集群。




