快速恢复备节点
如果 master 出现故障,执行了 failover 命令,slave 升级为 master 后,用户想给新 master 重新添加新 slave 时,可以有两个选择:
方法一:使用 append 命令添加一个全新的 slave 节点,需要进行数据同步,保证主备一致;
方法二:将原来移除的 master 以 slave 节点的形式加入到集群中。
第二种方式由于原 master 节点已经有大量的数据,重新加入到集群中后只需要进行少量的数据同步即可,时间短。rewind 的功能正是如此,可以实现快速恢复备节点。具体可以参考 rewind 章节。
注意:要使用 rewind 功能,需要在 datanode 上将 wal_log_hints 和 full_page_writes 设置为 on。
SET DATANODE ALL(wal_log_hints=on, full_page_writes=on);
REWIND DATANODE SLAVE dm1;
举例:
antdb=# ADD DATANODE SLAVE dn1_1 for dn1_3 (host=adb01,port=52531,path='/data/antdb/data/adb50/d1/dn1_1');
ADD NODE
antdb=# REWIND DATANODE SLAVE dn1_1;
NOTICE: pg_ctl restart datanode slave "dn1_1"
NOTICE: 10.21.20.175, pg_ctl restart -D /data/antdb/data/adb50/d1/dn1_1 -Z datanode -m fast -o -i -w -c -l /data/antdb/data/adb50/d1/dn1_1/logfile
NOTICE: wait max 90 seconds to check datanode slave "dn1_1" running normal
NOTICE: pg_ctl stop datanode slave "dn1_1" with fast mode
NOTICE: 10.21.20.175, pg_ctl stop -D /data/antdb/data/adb50/d1/dn1_1 -Z datanode -m fast -o -i -w -c
NOTICE: wait max 90 seconds to check datanode slave "dn1_1" stop complete
NOTICE: update gtmcoord master "gcn1" pg_hba.conf for the rewind node dn1_1
NOTICE: update gtmcoord slave "gcn2" pg_hba.conf for the rewind node dn1_1
NOTICE: update datanode master "dn1_3" pg_hba.conf for the rewind node dn1_1
NOTICE: update datanode slave "dn1_2" pg_hba.conf for the rewind node dn1_1
NOTICE: on datanode master "dn1_3" execute "checkpoint"
NOTICE: 10.21.20.175, /data/antdb/app/adb50/bin/pg_controldata '/data/antdb/data/adb50/d2/dn1_3' | grep 'Minimum recovery ending location:' |awk '{print $5}'
NOTICE: receive msg: {"result":"0/0"}
NOTICE: 10.21.20.175, /data/antdb/app/adb50/bin/pg_controldata '/data/antdb/data/adb50/d2/dn1_3' |grep 'Min recovery ending loc' |awk '{print $6}'
NOTICE: receive msg: {"result":"0"}
NOTICE: 10.21.20.175, adb_rewind --target-pgdata /data/antdb/data/adb50/d1/dn1_1 --source-server='host=10.21.20.175 port=52533 user=antdb dbname=postgres' -T dn1_1 -S dn1_3
NOTICE: receive msg: servers diverged at WAL location 0/40001B0 on timeline 1
rewinding from last common checkpoint at 0/4000140 on timeline 1
Done!
NOTICE: refresh mastername of datanode slave "dn1_1" in the node table
NOTICE: set parameters in postgresql.conf of datanode slave "dn1_1"
NOTICE: refresh recovery.conf of datanode slave "dn1_1"
NOTICE: pg_ctl start -Z datanode -D /data/antdb/data/adb50/d1/dn1_1 -o -i -w -c -l /data/antdb/data/adb50/d1/dn1_1/logfile
NOTICE: 10.21.20.175, pg_ctl start -Z datanode -D /data/antdb/data/adb50/d1/dn1_1 -o -i -w -c -l /data/antdb/data/adb50/d1/dn1_1/logfile
NOTICE: refresh datanode master "dn1_3" synchronous_standby_names='1 (dn1_2,dn1_1)'
mgr_failover_manual_rewind_func
---------------------------------
t
(1 row)
antdb=# list node dn1_3;
name | host | type | mastername | port | sync_state | path | initialized | incluster | readonly | zone
-------+-------+-----------------+------------+-------+------------+----------------------------------+-------------+-----------+----------+-------
dn1_3 | adb01 | datanode master | | 52533 | | /data/antdb/data/adb50/d2/dn1_3 | t | t | f | local
(1 row)
antdb=# list node dn1_1;
name | host | type | mastername | port | sync_state | path | initialized | incluster | readonly | zone
-------+-------+----------------+------------+-------+------------+----------------------------------+-------------+-----------+----------+-------
dn1_1 | adb01 | datanode slave | dn1_3 | 52531 | potential | /data/antdb/data/adb50/d1/dn1_1 | t | t | f | local
(1 row)
再次说明,在最新版本中,如果配置了自愈doctor,以上操作均无需人工干预。
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




