暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

mogdb ptk拓扑文件错误导致重启故障处理

由迪 2024-03-22
406

原作者:陈坤

数据库环境

os Kylin linux V10SP1
Mogdb Mogdb3.0.2
数据库架构 一主二备二级联备
PTK 0.5.8

问题概述

事件时间回顾

2023-10-07 09:02 某客户进行数据库定期重启维护,停止应用并检查没有应用连接之后,使用ptk cluster stop -n aaadb关闭数据库,日志显示只关闭了备库和级联备库,主库172.18.40.111并未关闭,ptk日志如下。

time=2023-10-07T09:02:02.971 level=info msg=operation: stop
time=2023-10-07T09:02:02.971 level=info msg=========================================
time=2023-10-07T09:02:02.971 level=info msg=stop db [192.168.90.112:26000] …
time=2023-10-07T09:02:04.019 level=info msg=stop db [192.168.90.112:26000] successfully
time=2023-10-07T09:02:04.019 level=info msg=stop db [172.23.40.111:26000] …
time=2023-10-07T09:02:05.107 level=info msg=stop db [172.23.40.111:26000] successfully
time=2023-10-07T09:02:05.107 level=info msg=stop db [192.168.90.111:26000] …
time=2023-10-07T09:02:06.158 level=info msg=stop db [192.168.90.111:26000] successfully
time=2023-10-07T09:02:06.158 level=info msg=stop db [172.18.40.112:26000] …
time=2023-10-07T09:02:07.204 level=info msg=stop db [172.18.40.112:26000] successfully
time=2023-10-07T09:02:07.204 level=info msg=========================================
time=2023-10-07T09:02:07.204 level=info msg=stop successfully

2023-10-07 09:07 使用gs_om -t stop 关闭主库节点172.18.40.111,gs_om日志如下:

[2023-10-07 09:07:23.571625][2169922][OmImpl.py(doStop:96)][stop][DEBUG]:Operating: Stopping.
[2023-10-07 09:07:23.571794][2169922][OmImplOLAP.py(doStopCluster:288)][stop][DEBUG]:Operating: Stopping.
[2023-10-07 09:07:23.571949][2169922][OmImplOLAP.py(doStopCluster:297)][stop][LOG]:Stopping cluster.
[2023-10-07 09:07:23.572060][2169922][OmImplOLAP.py(doStopCluster:298)][stop][LOG]:=========================================
[2023-10-07 09:07:33.480631][2169922][OmImplOLAP.py(doStopCluster:323)][stop][LOG]:Successfully stopped cluster.
[2023-10-07 09:07:33.480828][2169922][OmImplOLAP.py(doStopCluster:325)][stop][LOG]:=========================================
[2023-10-07 09:07:33.480941][2169922][OmImplOLAP.py(doStopCluster:326)][stop][LOG]:End stop cluster.
[2023-10-07 09:07:33.481058][2169922][OmImplOLAP.py(doStopCluster:327)][stop][DEBUG]:Operation succeeded: Stop.
[ Cluster State ]

cluster_state : Unavailable
redistributing : No
current_az : AZ_ALL

[ Datanode State ]

node node_ip port instance state
------------------------------------------------------------------------------------
1 aaadb1zj 172.18.40.111 26000 6001 /data/mogdb/data P Down Manually stopped
2 aaadb2zj 172.18.40.112 26000 6002 /data/mogdb/data P Down Manually stopped
3 aaadb1pd 192.168.90.111 26000 6003 /data/mogdb/data S Down Manually stopped
4 aaadb2pd 192.168.90.112 26000 6004 /data/mogdb/data C Down Manually stopped
5 aaadb1bj 172.23.40.111 26000 6005 /data/mogdb/data C Down Manually stopped
[2023-10-07 09:07:36.923902][2170580][OmImpl.py(doStatus:259)][status][DEBUG]:Successfully obtained the cluster status.

2023-10-07 09:18 使用ptk cluster start -n aaadb启动数据库,发现只有启动命令只发给了除了主库节点172.18.40.111以外的三个库,并且备库172.18.40.112现在成为了主库。

time=2023-10-07T09:18:52.084 level=info msg=operation: start
time=2023-10-07T09:18:52.084 level=info msg=========================================
time=2023-10-07T09:18:52.084 level=info msg=start db [172.18.40.112:26000] …
time=2023-10-07T09:19:00.235 level=info msg=start db [172.18.40.112:26000] successfully
time=2023-10-07T09:19:00.235 level=info msg=start db [192.168.90.111:26000] …
time=2023-10-07T09:19:08.386 level=info msg=start db [192.168.90.111:26000] successfully
time=2023-10-07T09:19:08.386 level=info msg=start db [192.168.90.112:26000] …
time=2023-10-07T09:19:16.548 level=info msg=start db [192.168.90.112:26000] successfully
time=2023-10-07T09:19:16.548 level=info msg=start db [172.23.40.111:26000] …
time=2023-10-07T09:19:24.749 level=info msg=start db [172.23.40.111:26000] successfully
time=2023-10-07T09:19:24.937 level=info msg=waiting for check cluster state…
time=2023-10-07T09:19:30.131 level=info msg=waiting for check cluster state…
time=2023-10-07T09:19:35.325 level=info msg=waiting for check cluster state…
time=2023-10-07T09:19:40.520 level=info msg=waiting for check cluster state…
time=2023-10-07T09:19:45.713 level=info msg=waiting for check cluster state…
time=2023-10-07T09:19:50.910 level=info msg=waiting for check cluster state…

2023-10-07 09:20 使用gs_om -t start 拉起了主库172.18.40.111。并且北京级联备无法拉起,并报错该数据库不在集群中。

[2023-10-07 09:20:44.712782][18765][OmImplOLAP.py(doStartCluster:180)][start][DEBUG]:Operating: Starting.
[2023-10-07 09:20:44.713012][18765][OmImplOLAP.py(doStartCluster:189)][start][LOG]:Starting cluster.
[2023-10-07 09:20:44.713127][18765][OmImplOLAP.py(doStartCluster:190)][start][LOG]:=========================================
[2023-10-07 09:20:54.396143][18765][OmImplOLAP.py(doStartCluster:228)][start][LOG]:[SUCCESS] aaadb1zj
2023-10-07 09:20:46.454 [unknown] [unknown] localhost 23318227040448 0[0:0#0] 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets
[2023-10-07 09:21:03.249690][18765][OmImplOLAP.py(doStartCluster:228)][start][LOG]:[SUCCESS] aaadb2zj
2023-10-07 09:20:55.182 [unknown] [unknown] localhost 23438950308032 0[0:0#0] 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets
[2023-10-07 09:21:12.045905][18765][OmImplOLAP.py(doStartCluster:228)][start][LOG]:[SUCCESS] aaadb1pd
2023-10-07 09:21:04.020 [unknown] [unknown] localhost 22662416832704 0[0:0#0] 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets
[2023-10-07 09:21:20.988588][18765][OmImplOLAP.py(doStartCluster:228)][start][LOG]:[SUCCESS] aaadb2pd
2023-10-07 09:21:12.959 [unknown] [unknown] localhost 22727149168832 0[0:0#0] 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets
[2023-10-07 09:21:22.121234][18765][OmImplOLAP.py(doStartCluster:230)][start][LOG]:=========================================
[2023-10-07 09:21:22.121492][18765][gs_om(main:823)][start][ERROR]:[GAUSS-53600]: Can not start the database, the cmd is . /home/omm/.bashrc; python3 ‘/data/mogdb/mogdb/tool/script/local/StartInstance.py’ -U omm -R /data/mogdb/app -t 300 --security-mode=off, Error:
[FAILURE] aaadb1bj:
[GAUSS-51619] : The host name [aaadb1bj] is not in the cluster.

检查数据库状态,发现两个备库需要重新build,级联备并没有启动。

[ Cluster State ]

cluster_state : Degraded
redistributing : No
current_az : AZ_ALL

[ Datanode State ]

node node_ip port instance state
------------------------------------------------------------------------------------
1 aaadb1zj 172.18.40.111 26000 6001 /data/mogdb/data P Primary Normal
2 aaadb2zj 172.18.40.112 26000 6002 /data/mogdb/data P Standby Need repair(WAL)
3 aaadb1pd 192.168.90.111 26000 6003 /data/mogdb/data S Standby Need repair(WAL)
4 aaadb2pd 192.168.90.112 26000 6004 /data/mogdb/data C Cascade Normal
5 aaadb1bj 172.23.40.111 26000 6005 /data/mogdb/data C Down Manually stopped

2023-10-07 09:23 开始逐个备库进行gs_ctl build 并在登录到aaadb1bj上使用gs_ctl start启动数据库,数据库打开后同样执行build的操作。

2023-10-07 10:20 所有备库都重新build完成,数据库集群状态正常。

问题原因

故障1:PTK启停数据库跳过了aaadb1zj

此故障原因为ptk配置文件/root/.ptk/data/aaadb/topology.yml中aaadb1zj,aaadb2zj两个库的角色都是primary。而在ptk0.5.8中,没有判断拓扑文件中主库的个数,在启动的时候是从配置文件中读取实例信息后,分为主、备、级联备进行分别启动的。对于主仅会启动一个,当拓扑文件中记录的实例角色存在多个主库的场景时,导致了使用ptk进行关闭集群时,先后读取到了两个主库信息,以第二个读取的主库覆盖了第一个,从而跳过了aaadb1zj只关闭了aaadb2zj,使用ptk启动时亦然,而启动时因为又跳过了aaadb1zj,将aaadb2zj以主库启动,造成主备关系错乱。

故总结此故障发生的两个条件:

  • /root/.ptk/data/aaadb/topology.yml文件中配置错误,有两个primary角色。
  • ptk0.5.8版本存在缺陷。

故障2:gs_om 启动数据库跳过了级联备库aaadb1bj

此故障原因为在早期安装数据库时,使用gs_ctl执行过switchover,并在切换完成后执行了gs_om -t refreshconf命令,此命令会在mogdb软件目录的app/bin/中生成cluster_dynamic_config文件。当存在cluster_dynamic_config文件时,gs_om会以cluster_dynamic_config 替代 cluster_static_config 文件从中获取cluster的配置信息。而之后使用ptk扩容出aaadb1bj后,ptk只会更新cluster_static_config静态文件,而不会更新cluster_dynamic_config,结果导致再次用gs_om -t start启动数据库时,无法找到ptk扩容添加的aaadb1bj主机,从而报错hostname aaadb1bj is not in cluster。

解决方案

PTK故障解决方案:

  1. 升级ptk,在 ptk v0.6.0 版本及以上查询状态会自动更新拓扑文件yml,这样也会及时修复拓扑文件中的错误,在1.0以上版本中如果ptk启动后发现topology.yml文件中存在两个primary角色,会以交互界面提示选择一个作为主库启动,并根据选择修改topology.yml文件。建议升级到1.0以上或最新版本。
  2. 此次手动build修复备库后,保持当前ptk版本不变,手动将yml中的角色信息改正,今后在执行重启操作区,检查topology.yml文件内容。

gs_om故障解决方案:

1,ptk扩容之后再次执行gs_om refreshconf,手动更新动态配置文件。

2,此问题待内核研发确认核实后,将在后续版本中修复

复现问题流程

安装ptk 0.5.8

使用ptk安装一套一主一备的mogdb3.0.2环境进行模拟

将拓扑文件修改为为两个primary

vi topology.yml

global:
cluster_name: testdb
user: test
group: test

db_servers:
- inst_id: 6001
node_id: 1

role: primary
- inst_id: 6002
node_id: 2

role: primary

检查集群状态

[root@mogdbt02 testdb]# ptk cluster status -n testdb
[ Cluster State ]
database_version : MogDB 3.0.2 (build 9bc79be5)
cluster_name : testdb
cluster_state : Normal

[ Datanode State ]
cluster_name | id | ip | port | user | nodename | db_role | state | upstream
---------------±-----±----------------±------±-----±---------±--------±-------±----------
testdb | 6001 | 192.168.182.136 | 27000 | test | dn_6001 | primary | Normal | -
| 6002 | 192.168.182.137 | 27000 | test | dn_6002 | standby | Normal | -

使用ptk关闭集群,并检查集群状态。发现只关闭了备库

[root@mogdbt02 testdb]# ptk cluster stop -n testdb
INFO[2023-10-08T11:56:56.253] operation: stop
INFO[2023-10-08T11:56:56.254] ========================================
INFO[2023-10-08T11:56:56.254] stop db [192.168.182.137:27000] …
INFO[2023-10-08T11:56:57.320] stop db [192.168.182.137:27000] successfully
INFO[2023-10-08T11:56:57.320] ========================================
INFO[2023-10-08T11:56:57.320] stop successfully
[root@mogdbt02 testdb]# ptk cluster status -n testdb
[ Cluster State ]
database_version : MogDB 3.0.2 (build 9bc79be5)
cluster_name : testdb
cluster_state : Degraded

[ Datanode State ]
cluster_name | id | ip | port | user | nodename | db_role | state | upstream
---------------±-----±----------------±------±-----±---------±------------------±--------±----------
testdb | 6001 | 192.168.182.136 | 27000 | test | dn_6001 | primary | Normal | -
| 6002 | 192.168.182.137 | 27000 | test | dn_6002 | primary(previous) | Stopped | -

此时使用gs_om关闭主库,然后检查,两个节点已全部关闭

[root@mogdbt02 testdb]# su - test
Last login: Sun Oct 8 11:57:11 EDT 2023
[test@mogdbt02 ~]$ gs_om -t stop
Stopping cluster.

Successfully stopped cluster.

End stop cluster.
[test@mogdbt02 ~]$ exit
logout
[root@mogdbt02 testdb]# ptk cluster status -n testdb
[ Cluster State ]
database_version : MogDB 3.0.2 (build 9bc79be5)
cluster_name : testdb
cluster_state : Stopped

[ Datanode State ]
cluster_name | id | ip | port | user | nodename | db_role | state | upstream
---------------±-----±----------------±------±-----±---------±------------------±--------±----------
testdb | 6001 | 192.168.182.136 | 27000 | test | dn_6001 | primary(previous) | Stopped | -
| 6002 | 192.168.182.137 | 27000 | test | dn_6002 | primary(previous) | Stopped | -

使用ptk启动集群,发现只拉起了192.168.182.137这台主机,也就是原备库,并且现在成为了主库。

[root@mogdbt02 testdb]# ptk cluster start -n testdb
INFO[2023-10-08T12:08:43.806] operation: start
INFO[2023-10-08T12:08:43.806] ========================================
INFO[2023-10-08T12:08:43.806] start db [192.168.182.137:27000] …
INFO[2023-10-08T12:08:46.073] start db [192.168.182.137:27000] successfully
INFO[2023-10-08T12:08:46.212] waiting for check cluster state…
INFO[2023-10-08T12:08:51.347] waiting for check cluster state…
INFO[2023-10-08T12:08:56.484] waiting for check cluster state…
INFO[2023-10-08T12:09:01.622] waiting for check cluster state…
INFO[2023-10-08T12:09:06.765] waiting for check cluster state…
INFO[2023-10-08T12:09:11.908] waiting for check cluster state…
cluster latest state is Degraded, please check manually
[root@mogdbt02 testdb]# ptk cluster status -n testdb
[ Cluster State ]
database_version : MogDB 3.0.2 (build 9bc79be5)
cluster_name : testdb
cluster_state : Degraded

[ Datanode State ]
cluster_name | id | ip | port | user | nodename | db_role | state | upstream
---------------±-----±----------------±------±-----±---------±------------------±--------±----------
testdb | 6001 | 192.168.182.136 | 27000 | test | dn_6001 | primary(previous) | Stopped | -
| 6002 | 192.168.182.137 | 27000 | test | dn_6002 | primary | Normal | -

重现了问题。

手动修改拓扑文件修复问题

手动将topology.yml中192.168.182.137这台主机上的role改为正确的配置standby

然后使用ptk重启,此时可以正常重启集群。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论