PostgreSQL高可用之Repmgr自动failover

原创多米爸比 2020-02-05

5348

repmgr搭建参考上一篇 https://www.modb.pro/db/15358

1.每台服务器修改repmgr文件,增加以下内容

$ vi /opt/repmgr.conf
failover='automatic'
connection_check_type=ping
reconnect_attempts=6
reconnect_interval=10 
#如果主节点失败，则要进行切换的工作，并记录到日志
promote_command='/opt/pgsql/bin/repmgr standby promote -f /opt/repmgr.conf --log-to-file'
#如果有连接到主节点的其他从节点，会根据命令连接到新主的节点
follow_command='/opt/pgsql/bin/repmgr standby follow -f /opt/repmgr.conf --log-to-file --upstream-node-id=%n'

2.每台服务器postgresql.conf增加配置

$ vi data6000/postgresql.conf
shared_preload_libraries = 'repmgr'

3.每台服务器启动repmgrd守护进程

$ /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf
--通过事件记录查看repmgrd的开启情况
[postgres@ha1 ~]$  /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster event --event=repmgrd_start
 Node ID | Name  | Event         | OK | Timestamp           | Details                                               
---------+-------+---------------+----+---------------------+--------------------------------------------------------
 1       | node1 | repmgrd_start | t  | 2020-02-02 22:33:30 | monitoring cluster primary "node1" (ID: 1)            
 2       | node2 | repmgrd_start | t  | 2020-02-02 22:33:25 | monitoring connection to upstream node "node1" (ID: 1)

--查看repmgr和repmgrd进程
[postgres@ha1 ~]$ ps f -u postgres
  PID TTY      STAT   TIME COMMAND
13124 pts/0    S      0:00 -bash
13389 pts/0    R+     0:00  \_ ps f -u postgres
13191 ?        S      0:00 /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf
 3247 ?        Ss     0:00 /opt/pg12/bin/postgres -D /opt/data6000
 3272 ?        Ss     0:00  \_ postgres: logger   
 3680 ?        Ss     0:00  \_ postgres: checkpointer   
 3681 ?        Ss     0:00  \_ postgres: background writer   
 3682 ?        Ss     0:00  \_ postgres: walwriter   
 3683 ?        Ss     0:00  \_ postgres: autovacuum launcher   
 3684 ?        Ss     0:00  \_ postgres: archiver   
 3685 ?        Ss     0:00  \_ postgres: stats collector   
 3686 ?        Ss     0:00  \_ postgres: logical replication launcher  
13184 ?        Ss     0:00  \_ postgres: repmgr repmgr 192.168.99.202(41551) idle
13189 ?        Ss     0:00  \_ postgres: repmgr repmgr 192.168.99.200(49369) idle
13289 ?        Ss     0:00  \_ postgres: walsender repmgr 192.168.99.202(41735) streaming 0/17003308

[postgres@ha2 ~]$ ps f -u postgres
  PID TTY      STAT   TIME COMMAND
 3926 pts/0    S      0:00 -bash
 4056 pts/0    R+     0:00  \_ ps f -u postgres
 3950 ?        S      0:00 /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf
 3241 ?        Ss     0:00 /opt/pg12/bin/postgres -D /opt/data6000
 3259 ?        Ss     0:00  \_ postgres: logger   
 3260 ?        Ss     0:00  \_ postgres: startup   recovering 000000070000000000000017
 3674 ?        Ss     0:00  \_ postgres: checkpointer   
 3675 ?        Ss     0:00  \_ postgres: background writer   
 3676 ?        Ss     0:00  \_ postgres: stats collector   
 3948 ?        Ss     0:00  \_ postgres: repmgr repmgr 192.168.99.202(62092) idle
 4002 ?        Ss     0:00  \_ postgres: walreceiver   streaming 0/17003308

4.关闭主节点HA1

[postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ stop
waiting for server to shut down.... done
server stopped

--查看备库HA2日志，可以看到1 分钟后，无法连接到主库HA1，程序已经开始了切库的服务
[postgres@ha2 ~]$ cat /opt/data6000/repmgr.log
...
[2020-02-03 11:25:34] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in normal state
[2020-02-03 11:25:36] [WARNING] unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2"
[2020-02-03 11:25:36] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-03 11:25:36] [WARNING] unable to connect to upstream node "node1" (ID: 1)
[2020-02-03 11:25:36] [INFO] checking state of node 1, 1 of 6 attempts
[2020-02-03 11:25:36] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:25:36] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-03 11:25:36] [INFO] sleeping 10 seconds until next reconnection attempt
[2020-02-03 11:25:46] [INFO] checking state of node 1, 2 of 6 attempts
[2020-02-03 11:25:46] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:25:46] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-03 11:25:46] [INFO] sleeping 10 seconds until next reconnection attempt
[2020-02-03 11:25:56] [INFO] checking state of node 1, 3 of 6 attempts
[2020-02-03 11:25:56] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:25:56] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-03 11:25:56] [INFO] sleeping 10 seconds until next reconnection attempt
[2020-02-03 11:26:06] [INFO] checking state of node 1, 4 of 6 attempts
[2020-02-03 11:26:06] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:26:06] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-03 11:26:06] [INFO] sleeping 10 seconds until next reconnection attempt
[2020-02-03 11:26:16] [INFO] checking state of node 1, 5 of 6 attempts
[2020-02-03 11:26:16] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:26:16] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-03 11:26:16] [INFO] sleeping 10 seconds until next reconnection attempt
[2020-02-03 11:26:26] [INFO] checking state of node 1, 6 of 6 attempts
[2020-02-03 11:26:26] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:26:26] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-03 11:26:26] [WARNING] unable to reconnect to node 1 after 6 attempts
[2020-02-03 11:26:26] [INFO] 0 active sibling nodes registered
[2020-02-03 11:26:26] [INFO] primary and this node have the same location ("default")
[2020-02-03 11:26:26] [INFO] no other sibling nodes - we win by default
[2020-02-03 11:26:26] [NOTICE] this node is the only available candidate and will now promote itself
[2020-02-03 11:26:26] [INFO] promote_command is:
  "/opt/pgsql/bin/repmgr standby promote -f /opt/repmgr.conf --log-to-file"
[2020-02-03 11:26:26] [NOTICE] redirecting logging output to "/opt/data6000/repmgr.log"

[2020-02-03 11:26:26] [NOTICE] promoting standby to primary
[2020-02-03 11:26:26] [DETAIL] promoting server "node2" (ID: 2) using pg_promote()
[2020-02-03 11:26:26] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2020-02-03 11:26:27] [NOTICE] STANDBY PROMOTE successful
[2020-02-03 11:26:27] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2020-02-03 11:26:27] [INFO] 0 followers to notify
[2020-02-03 11:26:27] [INFO] switching to primary monitoring mode
[2020-02-03 11:26:27] [NOTICE] monitoring cluster primary "node2" (ID: 2)

--HA2查看集群状态，HA2已经promote成功
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                        
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------
 1  | node1 | primary | - failed  |          | default  | 100      | ?        | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 14       | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "node1" (ID: 1)

5.HA1(故障服务未启动)重新加入集群

[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2'
INFO: local node 1 can attach to rejoin target node 2
DETAIL: local node's recovery point: 0/20000028; rejoin target node's fork point: 0/200000A0
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/opt/pgsql/bin/pg_ctl  -w -D '/opt/data6000' start"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

--HA1查看集群状态
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                        
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 13       | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 14       | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 
 --备库HA1执行checkpoint确保时间基线一致
 [postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres
postgres=# delete from t;
ERROR:  cannot execute DELETE in a read-only transaction
postgres=# checkpoint;
CHECKPOINT
postgres=# \q
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                        
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 14       | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 14       | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2

从上面可以看出通过node rejoin将刚才已经关闭的主库再次上线，并以备库的身份连到新主，加入集群成功。

6.再次测试

[postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres
postgres=# insert into t values(5,'from node2');
INSERT 0 1
postgres=# select * from t;
 id |             info              
----+-------------------------------
  1 | 2020-02-02 21:11:51.66735+08
  2 | 2020-02-02 22:05:21.757823+08
  3 | 2020-02-02 22:07:03.546043+08
  4 | 2020-02-03 11:03:42.050227+08
  5 | from node2
(5 rows)

--关闭主库HA2
[postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ stop

--1分钟后查看备库HA1,可以看到已经切换为主库
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                        
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 15       | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | - failed  |          | default  | 100      | ?        | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "node2" (ID: 2)

--此时HA2先不直接使用node rejoin

--HA1主库插入一些数据，并执行checkpoint
[postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres
postgres=# insert into t values(6,'from node1');
INSERT 0 1
postgres=# select * from t;
 id |             info              
----+-------------------------------
  1 | 2020-02-02 21:11:51.66735+08
  2 | 2020-02-02 22:05:21.757823+08
  3 | 2020-02-02 22:07:03.546043+08
  4 | 2020-02-03 11:03:42.050227+08
  5 | from node2
  6 | from node1
(6 rows)
postgres=# checkpoint;
CHECKPOINT


--HA2手工启动服务
[postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ start &
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
 ID | Name  | Role    | Status               | Upstream | Location | Priority | Timeline | Connection string                                                        
----+-------+---------+----------------------+----------+----------+----------+----------+---------------------------------------------------------------------------
 1  | node1 | standby | ! running as primary |          | default  | 100      | 15       | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running            |          | default  | 100      | 14       | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
  - node "node1" (ID: 1) is registered as standby but running as primary

--HA2插入一些数据，并执行checkpoint
[postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres
postgres=# insert into t values(7,'from node2');
INSERT 0 1
postgres=# select * from t;
 id |             info              
----+-------------------------------
  1 | 2020-02-02 21:11:51.66735+08
  2 | 2020-02-02 22:05:21.757823+08
  3 | 2020-02-02 22:07:03.546043+08
  4 | 2020-02-03 11:03:42.050227+08
  5 | from node2
  7 | from node2
(6 rows)

--HA2关闭服务
[postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ stop

--HA2执行node rejoin
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2'
ERROR: this node cannot attach to rejoin target node 1
DETAIL: rejoin target server's timeline 15 forked off current database system timeline 14 before current recovery point 0/22000028
HINT: use --force-rewind to execute pg_rewind

[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 1
DETAIL: rejoin target server's timeline 15 forked off current database system timeline 14 before current recovery point 0/22000028
ERROR: --force-rewind specified but pg_rewind cannot be used
DETAIL: "wal_log_hints" is set to "off" and data checksums are disabled

从上面可以看出node rejoin操作失败，因为两边库的数据已经不一致。

7.增加wal_log_hints并恢复环境

[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf primary unregister --node-id=2
INFO: node "node2" (ID: 2) was successfully unregistered

[postgres@ha1 ~]$ vi /opt/data6000/postgresql.conf
wal_log_hints=on

[postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ restart &

[postgres@ha2 ~]$ rm -rf /opt/archive6000/*
[postgres@ha2 ~]$ rm -rf /opt/data6000/*

[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -h 192.168.99.200 -p6000 -U repmgr -d repmgr -f /opt/repmgr.conf standby clone

[postgres@ha2 ~]$ vi /opt/data6000/postgresql.conf 
primary_conninfo='host=192.168.99.200 port=6000 user=repuser'

[postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ start &

[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf standby register

[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                        
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 15       | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 15       | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 
 [postgres@ha1 ~]$ > /opt/data6000/repmgr.log
 [postgres@ha2 ~]$ > /opt/data6000/repmgr.log
 
 [postgres@ha1 ~]$ /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf
 [postgres@ha2 ~]$ /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf

8.再次测试

[postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres
postgres=# insert into t values(101,'from node1 '||now());
INSERT 0 1
postgres=# select * from t;
 id  |                   info                   
-----+------------------------------------------
   1 | 2020-02-02 21:11:51.66735+08
   2 | 2020-02-02 22:05:21.757823+08
   3 | 2020-02-02 22:07:03.546043+08
   4 | 2020-02-03 11:03:42.050227+08
   5 | from node2
   6 | from node1
 100 | from node1 2020-02-03 12:20:27.444394+08
 101 | from node1 2020-02-03 12:33:26.365621+08
(8 rows)

--关闭主库HA1
[postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ stop

--1分钟后查看备库HA2,可以看到已经切换为主库
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                        
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------
 1  | node1 | primary | - failed  |          | default  | 100      | ?        | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 16       | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
  - unable to connect to node "node1" (ID: 1)

--HA2新主库插入一些数据，并执行checkpoint
[postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres
postgres=# insert into t values(200,'from node2 '||now());
INSERT 0 1
postgres=# select * from t;
 id  |                   info                   
-----+------------------------------------------
   1 | 2020-02-02 21:11:51.66735+08
   2 | 2020-02-02 22:05:21.757823+08
   3 | 2020-02-02 22:07:03.546043+08
   4 | 2020-02-03 11:03:42.050227+08
   5 | from node2
   6 | from node1
 100 | from node1 2020-02-03 12:20:27.444394+08
 101 | from node1 2020-02-03 12:33:26.365621+08
 200 | from node2 2020-02-03 12:39:04.007665+08
(9 rows)
postgres=# checkpoint;
CHECKPOINT

--HA1手工启动服务
[postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ start &
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
 ID | Name  | Role    | Status               | Upstream | Location | Priority | Timeline | Connection string                                                        
----+-------+---------+----------------------+----------+----------+----------+----------+---------------------------------------------------------------------------
 1  | node1 | primary | * running            |          | default  | 100      | 15       | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby | ! running as primary |          | default  | 100      | 16       | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
  - node "node2" (ID: 2) is registered as standby but running as primary

--HA1插入一些数据，并执行checkpoint
[postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres
postgres=# insert into t values(300,'from node1 '||now());
INSERT 0 1
postgres=# select * from t;
 id  |                   info                   
-----+------------------------------------------
   1 | 2020-02-02 21:11:51.66735+08
   2 | 2020-02-02 22:05:21.757823+08
   3 | 2020-02-02 22:07:03.546043+08
   4 | 2020-02-03 11:03:42.050227+08
   5 | from node2
   6 | from node1
 100 | from node1 2020-02-03 12:20:27.444394+08
 101 | from node1 2020-02-03 12:33:26.365621+08
 300 | from node1 2020-02-03 12:42:21.438444+08
(9 rows)

--HA1关闭服务
[postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ stop

--HA1执行node rejoin
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 16 forked off current database system timeline 15 before current recovery point 0/27000028
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/opt/pgsql/bin/pg_rewind -D '/opt/data6000' --source-server='host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2'"
pg_rewind: servers diverged at WAL location 0/260000A0 on timeline 15
pg_rewind: rewinding from last common checkpoint at 0/26000028 on timeline 15
pg_rewind: Done!
NOTICE: 0 files copied to /opt/data6000
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/opt/pgsql/bin/pg_ctl  -w -D '/opt/data6000' start"
WARNING: unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

--HA1查看集群状态
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                        
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 15       | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 16       | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2

--HA1执行checkpoint或者服务restart,timeline保持一致

--验证数据
[postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres
postgres=# select * from t;
 id  |                   info                   
-----+------------------------------------------
   1 | 2020-02-02 21:11:51.66735+08
   2 | 2020-02-02 22:05:21.757823+08
   3 | 2020-02-02 22:07:03.546043+08
   4 | 2020-02-03 11:03:42.050227+08
   5 | from node2
   6 | from node1
 100 | from node1 2020-02-03 12:20:27.444394+08
 101 | from node1 2020-02-03 12:33:26.365621+08
 200 | from node2 2020-02-03 12:39:04.007665+08
(9 rows)
postgres=# delete from t;
ERROR:  cannot execute DELETE in a read-only transaction

[postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres
postgres=# select * from t;
 id  |                   info                   
-----+------------------------------------------
   1 | 2020-02-02 21:11:51.66735+08
   2 | 2020-02-02 22:05:21.757823+08
   3 | 2020-02-02 22:07:03.546043+08
   4 | 2020-02-03 11:03:42.050227+08
   5 | from node2
   6 | from node1
 100 | from node1 2020-02-03 12:20:27.444394+08
 101 | from node1 2020-02-03 12:33:26.365621+08
 200 | from node2 2020-02-03 12:39:04.007665+08
(9 rows)
postgres=# insert into t values(400,'from node2 '||now());
INSERT 0 1

从上面测试可以看出加上wal_log_hints后，主备库数据如果不一致时仍然可以使用node rejoin操作成功，新主库上线至旧主库rejoin之间时间段内，旧主库的写数据会丢失。

常见错误1

[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d’host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2’
ERROR: this node cannot attach to rejoin target node 2
DETAIL: rejoin target server’s timeline 8 forked off current database system timeline 7 before current recovery point 0/1A000028
HINT: use --force-rewind to execute pg_rewind
–分析：主备库时间线不一致

常见错误2

[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d’host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2’ --force-rewind
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server’s timeline 8 forked off current database system timeline 7 before current recovery point 0/1A000028
ERROR: --force-rewind specified but pg_rewind cannot be used
DETAIL: “wal_log_hints” is set to “off” and data checksums are disabled
–分析：数据库未开启wal_log_hints,数据目录初始化时也未打开checksums