repmgr测试

原创郭鑫 2020-03-01

4196

1 概述
repmgr是一套开源工具，用于管理PostgreSQL服务器集群中的复制管理和故障转移，它扩展了PostgresSQL内建的hot-standby能力，可以监控复制和执行管理任务（自动故障转移，手工切换）。
repmgr提供两个主要工具：
repmgr是一个用于执行管理任务的命令行工具，主要用来设置备用服务器，切换主服务器和备服务器，显示复制群集中的服务器状态。
repmgrd 是一个守护程序，它主动监视复制集群中的服务器并监控和记录复制性能，通过检测主服务器的故障并选择最合适的备用服务器来执行自动主备库切换。
2 repmgr安装
2.1 源码编译安装PostgreSQL
主备库ip如下：
197.0.1.104（主） HWFBSB5
197.0.1.105（备） HWFBSB6
需要在主库安装11.3版本PostgresSQL数据库，备库可仅安装数据库软件，不用建库，此步过程略过。
2.2 源码编译安装repmgr工具
以下操作需要在主备库两个节点执行
postgres@HWFBSB6:/>cd /pgsql/
postgres@HWFBSB6:/pgsql>unzip repmgr-master.zip
postgres@HWFBSB6:/pgsql>cd repmgr-master
postgres@HWFBSB6:/pgsql/repmgr-master>time ./configure --prefix=/pgsql/pg113 && make install
两节点下验证编译安装：
postgres@HWFBSB6:/pgsql/repmgr-master> repmgr --version
repmgr 4.5dev
postgres@HWFBSB6:/pgsql/repmgr-master> repmgrd --version
repmgrd 4.5dev
可以看到repmgr工具已经编译安装完成了。

2.3 两台主机配置免密互信
分别在主备库执行如下命令
ssh-keygen -t rsa
主库拷贝秘钥文件到备库：
ssh-copy-id postgres@197.0.1.105
备库拷贝秘钥文件到主库：
ssh-copy-id postgres@197.0.1.104
主库连接备库验证互信：
ssh 197.0.1.105 hostname
备库连接主库验证互信：
ssh 197.0.1.104 hostname

2.4 主库修改配置文件
在主库197.0.1.104上修改postgresql.conf配置文件
postgres@HWFBSB5:/pgdata/pg107> vi /pgdata/pg113/postgresql.conf
修改如下参数：
max_wal_senders = 10
wal_level = ‘hot_standby’
hot_standby = on
archive_mode = on
archive_command = ‘/bin/true’

配置pg_hba.conf
postgres@HWFBSB5:/pgdata/pg107> vi /pgdata/pg113/pg_hba.conf
在文件末尾添加如下：
local repmgr repmgr trust
host repmgr repmgr 127.0.0.1/32 trust
host repmgr repmgr 197.0.1.105/32 trust
host repmgr repmgr 197.0.1.104/32 trust
host postgres repmgr 197.0.1.105/32 trust

local replication repmgr trust
host replication repmgr 127.0.0.1/32 trust
host replication repmgr 197.0.1.105/32 trust
host replication repmgr 197.0.1.104/32 trust

备注：md5是常用的密码认证方式，较安全，且不需建立同名的操作系统用户，trust是只要知道数据库用户名就不需要密码或ident就能登录。

2.5 主库创建repmgr用户和repmgr数据库
在主库postgres用户家目录下执行如下创建用户和数据库命令
postgres@HWFBSB5:~> createuser -s repmgr
postgres@HWFBSB5:~> createdb repmgr -O repmgr
postgres@HWFBSB5:~> psql -c “alter user repmgr with password ‘repmgr’”
ALTER ROLE

利用备库远程登录主库验证：
postgres@HWFBSB6:/pgdata> psql ‘host=197.0.1.104 user=repmgr dbname=repmgr connect_timeout=2’
psql (11.3)
Type “help” for help.

2.6 主库创建repmgr.conf配置文件
postgres@HWFBSB5:~> pwd
/home/postgres
postgres@HWFBSB5:~> cat repmgr.conf
cluster=test
node_id=1
node_name=HWFBSB5
conninfo=‘host=HWFBSB5 user=repmgr dbname=repmgr’
data_directory=/pgdata/pg113

2.7 主库修改repmgr的搜索路径
postgres=# ALTER USER repmgr SET search_path TO repmgr_test, “$user”, public;
ALTER ROLE

2.8 主库节点初始化
postgres@HWFBSB5:~> repmgr -f repmgr.conf master register
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
INFO: connecting to primary database…
NOTICE: attempting to install extension “repmgr”
NOTICE: “repmgr” extension successfully installed
NOTICE: primary node record (ID: 1) registered

2.9 备库节点创建repmgr.conf
postgres@HWFBSB6:~> pwd
/home/postgres
postgres@HWFBSB6:~> cat repmgr.conf
cluster=test
node_id=2
node_name=HWFBSB6
conninfo=‘host=HWFBSB6 user=repmgr dbname=repmgr’
data_directory=/pgdata/pg113

2.10 克隆备库
postgres@HWFBSB6:~> repmgr -h 197.0.1.104 -U repmgr -d repmgr -D /pgdata/pg113 -f /home/postgres/repmgr.conf standby clone
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
NOTICE: destination directory “/pgdata/pg113” provided
INFO: connecting to source node
DETAIL: connection string is: host=197.0.1.104 user=repmgr dbname=repmgr
DETAIL: current installation size is 30 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: creating directory “/pgdata/pg113”…
NOTICE: starting backup (using pg_basebackup)…
HINT: this may take some time; consider using the -c/–fast-checkpoint option
INFO: executing:
pg_basebackup -l “repmgr base backup” -D /pgdata/pg113 -h 197.0.1.104 -p 5432 -U repmgr -X stream
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /pgdata/pg113 start
HINT: after starting the server, you need to register this standby with “repmgr standby register”
启动备库
postgres@HWFBSB6:~>pg_ctl -D /pgdata/pg113 -l /pgdata/pg113/server.log start

2.11 备库节点初始化
postgres@HWFBSB6:~> repmgr -f repmgr.conf standby register
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
INFO: connecting to local node “HWFBSB6” (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
INFO: standby registration complete
NOTICE: standby node “HWFBSB6” (ID: 2) successfully registered

3 附录一（repmgr相关测试）
3.1 服务命令设置
repmgr命令附加选项说明
–dry-run
预检查选项，显示将要执行的命令，记录将要执行命令的日志，但不会实际执行命令。

-action
显示操作命令（启动，停止，重启，切换等），如果使用–list-actions参数，则输出具体的命令。

–checkpoint
在停止或重新启动节点之前发出CHECKPOINT。

例：
postgres@HWFBSB5:~> repmgr -f /home/postgres/repmgr.conf node service --list-actions
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
Following commands would be executed for each action:

 start: "pg_ctl  -w -D '/pgdata/pg113' start"
  stop: "pg_ctl  -D '/pgdata/pg113' -W -m fast stop"

restart: “pg_ctl -w -D ‘/pgdata/pg113’ restart”
reload: “pg_ctl -w -D ‘/pgdata/pg113’ reload”
promote: “pg_ctl -w -D ‘/pgdata/pg113’ promote”

postgres@HWFBSB5:~> repmgr -f /home/postgres/repmgr.conf node service --action=restart --checkpoint --dry-run
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
INFO: a CHECKPOINT would be issued here
INFO: would execute server command “pg_ctl -w -D ‘/pgdata/pg113’ restart”
3.2 使用repmgr执行switch over切换

– siblings-follow
自动指定所有备库节点的新同步源，即新主库。

以下命令为执行测试切换，并不实际更改任意节点状态
postgres@HWFBSB6:~> repmgr -f repmgr.conf standby switchover --siblings-follow --dry-run
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
NOTICE: checking switchover on node “HWFBSB6” (ID: 2) in --dry-run mode
INFO: SSH connection to host “HWFBSB5” succeeded
INFO: able to execute “repmgr” on remote host “localhost”
WARNING: option “–sibling-nodes” specified, but no sibling nodes exist
INFO: 1 walsenders required, 10 available
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: local node “HWFBSB6” (ID: 2) would be promoted to primary; current primary “HWFBSB5” (ID: 1) would be demoted to standby
INFO: following shutdown command would be run on node “HWFBSB5”:
“/pgsql/pg113/bin/pg_ctl -D ‘/pgdata/pg113’ -W -m fast stop”
INFO: prerequisites for executing STANDBY SWITCHOVER are met

正式进行主备切换：
postgres@HWFBSB6:~> repmgr -f repmgr.conf standby switchover
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
NOTICE: executing switchover on node “HWFBSB6” (ID: 2)
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
NOTICE: local node “HWFBSB6” (ID: 2) will be promoted to primary; current primary “HWFBSB5” (ID: 1) will be demoted to standby
NOTICE: stopping current primary node “HWFBSB5” (ID: 1)
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
NOTICE: issuing CHECKPOINT
DETAIL: executing server command “/pgsql/pg113/bin/pg_ctl -D ‘/pgdata/pg113’ -W -m fast stop”
INFO: checking for primary shutdown; 1 of 60 attempts (“shutdown_check_timeout”)
INFO: checking for primary shutdown; 2 of 60 attempts (“shutdown_check_timeout”)
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
NOTICE: current primary has been cleanly shut down at location 0/6000028
NOTICE: promoting standby to primary
DETAIL: promoting server “HWFBSB6” (ID: 2) using “/pgsql/pg113/bin/pg_ctl -w -D ‘/pgdata/pg113’ promote”
waiting for server to promote… done
server promoted
NOTICE: waiting up to 60 seconds (parameter “promote_check_timeout”) for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server “HWFBSB6” (ID: 2) was successfully promoted to primary
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
INFO: local node 1 can attach to rejoin target node 2
DETAIL: local node’s recovery point: 0/6000028; rejoin target node’s fork point: 0/6000098
NOTICE: setting node 1’s upstream to node 2
WARNING: unable to ping “host=HWFBSB5 user=repmgr dbname=repmgr”
DETAIL: PQping() returned “PQPING_NO_RESPONSE”
NOTICE: starting server using “/pgsql/pg113/bin/pg_ctl -w -D ‘/pgdata/pg113’ start”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 1 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 6 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 11 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 16 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 21 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 26 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 31 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 36 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 41 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 46 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 51 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
INFO: waiting for node “HWFBSB5” (ID: 1) to connect to new primary; 56 of max 60 attempts
DETAIL: checking for record in node “HWFBSB6”'s “pg_stat_replication” table where “application_name” is “HWFBSB5”
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
NOTICE: switchover was successful
DETAIL: node “HWFBSB6” is now primary and node “HWFBSB5” is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

postgres@HWFBSB5:~> repmgr -f repmgr.conf standby register --force
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
INFO: connecting to local node “HWFBSB5” (ID: 1)
INFO: connecting to primary database
INFO: standby registration complete
NOTICE: standby node “HWFBSB5” (ID: 1) successfully registered

WARNING: following issues were detected

unable to connect to node “HWFBSB5” (ID: 1)
node “HWFBSB5” (ID: 1) is registered as an active primary but is unreachable
unable to connect to node “HWFBSB6” (ID: 2)'s upstream node “HWFBSB5” (ID: 1)
unable to determine if node “HWFBSB6” (ID: 2) is attached to its upstream node “HWFBSB5” (ID: 1)
备库强制提升为主库：
postgres@HWFBSB6:~> repmgr -f repmgr.conf standby promote
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
NOTICE: promoting standby to primary
DETAIL: promoting server “HWFBSB6” (ID: 2) using “/pgsql/pg113/bin/pg_ctl -w -D ‘/pgdata/pg113’ promote”
waiting for server to promote… done
server promoted
NOTICE: waiting up to 60 seconds (parameter “promote_check_timeout”) for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server “HWFBSB6” (ID: 2) was successfully promoted to primary

WARNING: following issues were detected

unable to connect to node “HWFBSB5” (ID: 1)
启动原主库：
postgres@HWFBSB5:~> pg_ctl -D /pgdata/pg113 -l /pgdata/pg113/server.log start
waiting for server to start… done
server started

WARNING: following issues were detected

node “HWFBSB6” (ID: 2) is registered as standby but running as primary
新主库查看主备库状态：
postgres@HWFBSB6:~> repmgr -f repmgr.conf cluster show
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----±--------±--------±----------±---------±---------±---------±---------±---------------------------------------
1 | HWFBSB5 | primary | ! running | | default | 100 | 1 | host=HWFBSB5 user=repmgr dbname=repmgr
2 | HWFBSB6 | primary | * running | | default | 100 | 2 | host=HWFBSB6 user=repmgr dbname=repmgr

WARNING: following issues were detected

node “HWFBSB5” (ID: 1) is running but the repmgr node record is inactive
可以看到从主备库查看的状态都带有警告，这时如果要把剔除掉的原主库作为备库状态重新加入，需要repmgr node rejoin操作

首先关闭原主库：
postgres@HWFBSB5:~> pg_ctl -D /pgdata/pg113 -l /pgdata/pg113/server.log stop
waiting for server to shut down… done
server stopped

使用repmgr node rejoin重新添加原主库
附加参数说明如下：
–force-rewind重新同步两个服务器（相当于增量恢复）
–verbose查看输出的详细信息
进行预执行命令：
postgres@HWFBSB5:~> repmgr -f repmgr.conf node rejoin -d ‘host=197.0.1.105 user=repmgr dbname=repmgr’ --force-rewind --dry-run --verbose
NOTICE: using provided configuration file “repmgr.conf”
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
INFO: replication connection to the rejoin target node was successful
INFO: local and rejoin target system identifiers match
DETAIL: system identifier is 6719751146104172248
INFO: prerequisites for using pg_rewind are met
INFO: 0 files would have been copied to “/tmp/repmgr-config-archive-HWFBSB5”
INFO: temporary archive directory “/tmp/repmgr-config-archive-HWFBSB5” deleted
INFO: pg_rewind would now be executed
DETAIL: pg_rewind command is:
/pgsql/pg113/bin/pg_rewind -D ‘/pgdata/pg113’ --source-server=‘host=HWFBSB6 user=repmgr dbname=repmgr’
INFO: prerequisites for executing NODE REJOIN are met

正式执行repmgr node rejoin重新添加原主库命令：
postgres@HWFBSB5:~> repmgr -f repmgr.conf node rejoin -d ‘host=197.0.1.105 user=repmgr dbname=repmgr’ --force-rewind --verbose
NOTICE: using provided configuration file “repmgr.conf”
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
INFO: prerequisites for using pg_rewind are met
INFO: 0 files copied to “/tmp/repmgr-config-archive-HWFBSB5”
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is “/pgsql/pg113/bin/pg_rewind -D ‘/pgdata/pg113’ --source-server=‘host=HWFBSB6 user=repmgr dbname=repmgr’”
NOTICE: 0 files copied to /pgdata/pg113
INFO: directory “/tmp/repmgr-config-archive-HWFBSB5” deleted
NOTICE: setting node 1’s upstream to node 2
WARNING: unable to ping “host=HWFBSB5 user=repmgr dbname=repmgr”
DETAIL: PQping() returned “PQPING_NO_RESPONSE”
NOTICE: starting server using “/pgsql/pg113/bin/pg_ctl -w -D ‘/pgdata/pg113’ start”
INFO: demoted primary is pingable
INFO: node 1 has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

- Shared Library Preloading -

shared_preload_libraries = ‘repmgr’ # (change requires restart)
postgres@HWFBSB5:/pgsql> vi /pgdata/pg113/postgresql.conf

- Shared Library Preloading -

shared_preload_libraries = ‘repmgr’ # (change requires restart)

修改主备库/home/postgres/repmgr.conf文件，添加如下参数：
log_level=INFO
log_status_interval=10
log_file=/home/postgres/repmgrd.log
monitoring_history=true （启用监控参数）
monitor_interval_secs=5（定义监视数据间隔写入时间参数）
reconnect_attempts=10（故障转移之前，尝试重新连接主库次数（默认为6）参数）
reconnect_interval=5（每间隔5s尝试重新连接一次参数）

如下参数为自动故障转移所需配置的重要参数：
failover=automatic
promote_command=’/pgsql/pg113/bin/repmgr standby promote -f /home/postgres/repmgr.conf --log-to-file’
follow_command=’/pgsql/pg113/bin/repmgr standby follow -f /home/postgres/repmgr.conf --log-to-file --upstream-node-id=%n’

*附加参数：
replication_lag_critical（如果备用数据库的复制延迟（已秒为单位）超过此值，则将终止切换）

备：参数配置完成后需要重启主备库使其生效

启动主备库的repmgrd：
postgres@HWFBSB6:~> repmgrd –f /home/postgres/repmgr.conf --pid-file /home/postgres/repmgrd.pid
[2019-08-12 10:23:49] [WARNING] the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
[2019-08-12 10:23:49] [NOTICE] redirecting logging output to “/home/postgres/repmgrd.log”

postgres@HWFBSB5:~> repmgrd –f /home/postgres/repmgr.conf --pid-file /home/postgres/repmgrd.pid
[2019-08-12 12:22:06] [WARNING] the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
[2019-08-12 12:22:06] [NOTICE] redirecting logging output to “/home/postgres/repmgrd.log”

主备库的日志输出信息如下：
主库：
postgres@HWFBSB6:~> more /home/postgres/repmgrd.log
[2019-08-12 10:23:49] [NOTICE] repmgrd (repmgrd 4.5dev) starting up
[2019-08-12 10:23:49] [INFO] connecting to database “host=HWFBSB6 user=repmgr dbname=repmgr”
INFO: set_repmgrd_pid(): provided pidfile is /home/postgres/repmgrd.pid
[2019-08-12 10:23:49] [NOTICE] starting monitoring of node “HWFBSB6” (ID: 2)
[2019-08-12 10:23:49] [INFO] “connection_check_type” set to “ping”
[2019-08-12 10:23:49] [NOTICE] monitoring cluster primary “HWFBSB6” (ID: 2)
[2019-08-12 10:23:49] [INFO] child node “HWFBSB5” (ID: 1) is attached
[2019-08-12 10:23:59] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state
[2019-08-12 10:24:09] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state
[2019-08-12 10:24:19] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state
[2019-08-12 10:24:29] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state
[2019-08-12 10:24:39] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state
[2019-08-12 10:24:49] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state
备库：
postgres@HWFBSB5:~> more /home/postgres/repmgrd.log
[2019-08-12 12:22:06] [NOTICE] repmgrd (repmgrd 4.5dev) starting up
[2019-08-12 12:22:06] [INFO] connecting to database “host=HWFBSB5 user=repmgr dbname=repmgr”
INFO: set_repmgrd_pid(): provided pidfile is /home/postgres/repmgrd.pid
[2019-08-12 12:22:06] [NOTICE] starting monitoring of node “HWFBSB5” (ID: 1)
[2019-08-12 12:22:06] [INFO] “connection_check_type” set to “ping”
[2019-08-12 12:22:06] [INFO] monitoring connection to upstream node “HWFBSB6” (ID: 2)
[2019-08-12 12:22:16] [INFO] node “HWFBSB5” (ID: 1) monitoring upstream node “HWFBSB6” (ID: 2) in normal state
[2019-08-12 12:22:16] [DETAIL] last monitoring statistics update was 5 seconds ago
[2019-08-12 12:22:26] [INFO] node “HWFBSB5” (ID: 1) monitoring upstream node “HWFBSB6” (ID: 2) in normal state

postgres@HWFBSB6:~> pg_ctl -D /pgdata/pg113 -l /pgdata/pg113/server.log stop
waiting for server to shut down… done
server stopped

主库另开一个窗口观察repmgrd日志输出:
postgres@HWFBSB6:~> tail -f /home/postgres/repmgrd.log
[2019-08-12 11:01:51] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state
[2019-08-12 11:02:01] [WARNING] unable to ping “host=HWFBSB6 user=repmgr dbname=repmgr”
[2019-08-12 11:02:01] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 11:02:01] [WARNING] connection to node “HWFBSB6” (ID: 2) lost
[2019-08-12 11:02:01] [DETAIL]
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

[2019-08-12 11:02:01] [INFO] attempting to reconnect to node “HWFBSB6” (ID: 2)
[2019-08-12 11:02:01] [ERROR] connection to database failed
[2019-08-12 11:02:01] [DETAIL]
could not connect to server: Connection refused
Is the server running on host “HWFBSB6” (197.0.1.105) and accepting
TCP/IP connections on port 5432?

[2019-08-12 11:02:01] [DETAIL] attempted to connect using:
user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr
[2019-08-12 11:02:01] [WARNING] reconnection to node “HWFBSB6” (ID: 2) failed
[2019-08-12 11:02:01] [WARNING] unable to connect to local node
[2019-08-12 11:02:01] [INFO] checking state of node 2, 1 of 10 attempts
[2019-08-12 11:02:01] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 11:02:01] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 11:02:01] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-08-12 11:02:06] [INFO] checking state of node 2, 2 of 10 attempts
[2019-08-12 11:02:06] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 11:02:06] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 11:02:06] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-08-12 11:02:11] [INFO] checking state of node 2, 3 of 10 attempts
[2019-08-12 11:02:11] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 11:02:11] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”

观察备库repmgrd日志输出:
postgres@HWFBSB5:~> tail -f /home/postgres/repmgrd.log
[2019-08-12 13:00:00] [INFO] node “HWFBSB5” (ID: 1) monitoring upstream node “HWFBSB6” (ID: 2) in normal state
[2019-08-12 13:00:00] [DETAIL] last monitoring statistics update was 5 seconds ago
[2019-08-12 13:00:05] [WARNING] unable to ping “host=HWFBSB6 user=repmgr dbname=repmgr”
[2019-08-12 13:00:05] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 13:00:05] [WARNING] unable to connect to upstream node “HWFBSB6” (ID: 2)
[2019-08-12 13:00:05] [INFO] checking state of node 2, 1 of 10 attempts
[2019-08-12 13:00:05] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 13:00:05] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 13:00:05] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-08-12 13:00:10] [INFO] checking state of node 2, 2 of 10 attempts
[2019-08-12 13:00:10] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 13:00:10] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 13:00:10] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-08-12 13:00:15] [INFO] checking state of node 2, 3 of 10 attempts
[2019-08-12 13:00:15] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 13:00:15] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 13:00:15] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-08-12 13:00:20] [INFO] checking state of node 2, 4 of 10 attempts
[2019-08-12 13:00:20] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 13:00:20] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 13:00:20] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-08-12 13:00:25] [INFO] checking state of node 2, 5 of 10 attempts
[2019-08-12 13:00:25] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 13:00:25] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 13:00:25] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-08-12 13:00:30] [INFO] checking state of node 2, 6 of 10 attempts
[2019-08-12 13:00:30] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 13:00:30] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 13:00:30] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-08-12 13:00:35] [INFO] checking state of node 2, 7 of 10 attempts
[2019-08-12 13:00:35] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 13:00:35] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 13:00:35] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-08-12 13:00:40] [INFO] checking state of node 2, 8 of 10 attempts
[2019-08-12 13:00:40] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 13:00:40] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 13:00:40] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-08-12 13:00:45] [INFO] checking state of node 2, 9 of 10 attempts
[2019-08-12 13:00:45] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 13:00:45] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 13:00:45] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-08-12 13:00:50] [INFO] checking state of node 2, 10 of 10 attempts
[2019-08-12 13:00:50] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB6 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-12 13:00:50] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-12 13:00:50] [WARNING] unable to reconnect to node 2 after 10 attempts
[2019-08-12 13:00:50] [INFO] 0 active sibling nodes registered
[2019-08-12 13:00:50] [INFO] primary and this node have the same location (“default”)
[2019-08-12 13:00:50] [INFO] no other sibling nodes - we win by default
[2019-08-12 13:00:50] [NOTICE] this node is the only available candidate and will now promote itself
[2019-08-12 13:00:50] [INFO] promote_command is:
“/pgsql/pg113/bin/repmgr standby promote -f /home/postgres/repmgr.conf --log-to-file”
[2019-08-12 13:00:50] [WARNING] the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
[2019-08-12 13:00:50] [NOTICE] redirecting logging output to “/home/postgres/repmgrd.log”

[2019-08-12 13:00:50] [NOTICE] promoting standby to primary
[2019-08-12 13:00:50] [DETAIL] promoting server “HWFBSB5” (ID: 1) using “/pgsql/pg113/bin/pg_ctl -w -D ‘/pgdata/pg113’ promote”
[2019-08-12 13:00:50] [NOTICE] waiting up to 60 seconds (parameter “promote_check_timeout”) for promotion to complete
[2019-08-12 13:00:50] [NOTICE] STANDBY PROMOTE successful
[2019-08-12 13:00:50] [DETAIL] server “HWFBSB5” (ID: 1) was successfully promoted to primary
[2019-08-12 13:00:50] [INFO] 0 followers to notify
[2019-08-12 13:00:50] [INFO] switching to primary monitoring mode
[2019-08-12 13:00:50] [NOTICE] monitoring cluster primary “HWFBSB5” (ID: 1)
[2019-08-12 13:01:00] [INFO] monitoring primary node “HWFBSB5” (ID: 1) in normal state
[2019-08-12 13:01:10] [INFO] monitoring primary node “HWFBSB5” (ID: 1) in normal state

手动关闭主库模拟异常：
postgres@HWFBSB5:~> pg_ctl -D /pgdata/pg113 -l /pgdata/pg113/server.log stop
waiting for server to shut down… done
server stopped

主库repmgrd日志输出：
postgres@HWFBSB5:~> tail -f repmgrd.log
[2019-08-13 11:59:15] [INFO] checking state of node 1, 10 of 10 attempts
[2019-08-13 11:59:15] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB5 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-13 11:59:15] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-13 11:59:15] [WARNING] unable to reconnect to node 1 after 10 attempts
[2019-08-13 11:59:15] [NOTICE] unable to connect to local node, falling back to degraded monitoring
[2019-08-13 11:59:15] [WARNING] unable to ping “host=HWFBSB5 user=repmgr dbname=repmgr”
[2019-08-13 11:59:15] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-13 11:59:15] [ERROR] unable to determine if server is in recovery
[2019-08-13 11:59:15] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2019-08-13 11:59:15] [WARNING] unable to determine node recovery status
[2019-08-13 11:59:15] [INFO] monitoring primary node “HWFBSB5” (ID: 1) in degraded state
[2019-08-13 11:59:15] [DETAIL] waiting for the node to become available
[2019-08-13 11:59:20] [WARNING] unable to ping “host=HWFBSB5 user=repmgr dbname=repmgr”
[2019-08-13 11:59:20] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-13 11:59:20] [WARNING] connection to node “HWFBSB5” (ID: 1) lost
[2019-08-13 11:59:20] [DETAIL]
connection pointer is NULL

[2019-08-13 11:59:20] [INFO] attempting to reconnect to node “HWFBSB5” (ID: 1)
[2019-08-13 11:59:20] [ERROR] connection to database failed
[2019-08-13 11:59:20] [DETAIL]
could not connect to server: Connection refused
Is the server running on host “HWFBSB5” (197.0.1.104) and accepting
TCP/IP connections on port 5432?

[2019-08-13 11:59:20] [DETAIL] attempted to connect using:
user=repmgr dbname=repmgr host=HWFBSB5 connect_timeout=2 fallback_application_name=repmgr
[2019-08-13 11:59:20] [WARNING] reconnection to node “HWFBSB5” (ID: 1) failed
[2019-08-13 11:59:20] [WARNING] unable to ping “host=HWFBSB5 user=repmgr dbname=repmgr”
[2019-08-13 11:59:20] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-13 11:59:20] [ERROR] unable to determine if server is in recovery
[2019-08-13 11:59:20] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2019-08-13 11:59:20] [WARNING] unable to determine node recovery status
[2019-08-13 11:59:25] [WARNING] unable to ping “host=HWFBSB5 user=repmgr dbname=repmgr”
[2019-08-13 11:59:25] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-13 11:59:25] [WARNING] connection to node “HWFBSB5” (ID: 1) lost
[2019-08-13 11:59:25] [DETAIL]
connection pointer is NULL
备库1 repmgrd日志输出：
postgres@HWFBSB6:~> tail -f repmgrd.log
[2019-08-13 10:01:08] [INFO] checking state of node 1, 10 of 10 attempts
[2019-08-13 10:01:08] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB5 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-13 10:01:08] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-13 10:01:08] [WARNING] unable to reconnect to node 1 after 10 attempts
[2019-08-13 10:01:08] [INFO] 1 active sibling nodes registered
[2019-08-13 10:01:08] [INFO] primary and this node have the same location (“default”)
[2019-08-13 10:01:08] [INFO] local node’s last receive lsn: 0/E000098
[2019-08-13 10:01:08] [INFO] checking state of sibling node “JCFW1COM” (ID: 3)
[2019-08-13 10:01:08] [INFO] node “JCFW1COM” (ID: 3) reports its upstream is node 1, last seen 51 second(s) ago
[2019-08-13 10:01:08] [INFO] node 3 last saw primary node 51 second(s) ago
[2019-08-13 10:01:08] [INFO] last receive LSN for sibling node “JCFW1COM” (ID: 3) is: 0/E000098
[2019-08-13 10:01:08] [INFO] node “JCFW1COM” (ID: 3) has same LSN as current candidate “HWFBSB6” (ID: 2)
[2019-08-13 10:01:08] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 10 seconds
[2019-08-13 10:01:08] [NOTICE] promotion candidate is “HWFBSB6” (ID: 2)
[2019-08-13 10:01:08] [NOTICE] this node is the winner, will now promote itself and inform other nodes
[2019-08-13 10:01:08] [INFO] promote_command is:
“/pgsql/pg113/bin/repmgr standby promote -f /home/postgres/repmgr.conf --log-to-file”
[2019-08-13 10:01:08] [WARNING] the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
[2019-08-13 10:01:08] [NOTICE] redirecting logging output to “/home/postgres/repmgrd.log”

[2019-08-13 10:01:08] [WARNING] 1 sibling nodes found, but option “–siblings-follow” not specified
[2019-08-13 10:01:08] [DETAIL] these nodes will remain attached to the current primary:
JCFW1COM (node ID: 3)
[2019-08-13 10:01:08] [NOTICE] promoting standby to primary
[2019-08-13 10:01:08] [DETAIL] promoting server “HWFBSB6” (ID: 2) using “/pgsql/pg113/bin/pg_ctl -w -D ‘/pgdata/pg113’ promote”
[2019-08-13 10:01:08] [NOTICE] waiting up to 60 seconds (parameter “promote_check_timeout”) for promotion to complete
[2019-08-13 10:01:08] [NOTICE] STANDBY PROMOTE successful
[2019-08-13 10:01:08] [DETAIL] server “HWFBSB6” (ID: 2) was successfully promoted to primary
[2019-08-13 10:01:08] [INFO] 1 followers to notify
[2019-08-13 10:01:08] [NOTICE] notifying node “JCFW1COM” (ID: 3) to follow node 2
INFO: node 3 received notification to follow node 2
[2019-08-13 10:01:08] [INFO] switching to primary monitoring mode
[2019-08-13 10:01:08] [NOTICE] monitoring cluster primary “HWFBSB6” (ID: 2)
[2019-08-13 10:01:18] [NOTICE] new standby “JCFW1COM” (ID: 3) has connected
[2019-08-13 10:01:18] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state
[2019-08-13 10:01:28] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state
[2019-08-13 10:01:38] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state
可以看到备库1自动提升成为了主库

备库2 repmgrd日志输出：
[postgres@JCFW1COM ~]$ tail -f repmgrd.log
[2019-08-13 10:01:16] [INFO] checking state of node 1, 10 of 10 attempts
[2019-08-13 10:01:16] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB5 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-13 10:01:16] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-13 10:01:16] [WARNING] unable to reconnect to node 1 after 10 attempts
[2019-08-13 10:01:16] [INFO] 1 active sibling nodes registered
[2019-08-13 10:01:16] [INFO] primary and this node have the same location (“default”)
[2019-08-13 10:01:16] [INFO] local node’s last receive lsn: 0/E000098
[2019-08-13 10:01:16] [INFO] checking state of sibling node “HWFBSB6” (ID: 2)
[2019-08-13 10:01:16] [INFO] node “HWFBSB6” (ID: 2) reports its upstream is node 1, last seen 48 second(s) ago
[2019-08-13 10:01:16] [INFO] node 2 last saw primary node 48 second(s) ago
[2019-08-13 10:01:16] [INFO] last receive LSN for sibling node “HWFBSB6” (ID: 2) is: 0/E000098
[2019-08-13 10:01:16] [INFO] node “HWFBSB6” (ID: 2) has same LSN as current candidate “JCFW1COM” (ID: 3)
[2019-08-13 10:01:16] [INFO] node “HWFBSB6” (ID: 2) has same priority but lower node_id than current candidate “JCFW1COM” (ID: 3)
[2019-08-13 10:01:16] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 10 seconds
[2019-08-13 10:01:16] [NOTICE] promotion candidate is “HWFBSB6” (ID: 2)
[2019-08-13 10:01:16] [INFO] follower node awaiting notification from a candidate node
[2019-08-13 10:01:18] [WARNING] the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
[2019-08-13 10:01:18] [NOTICE] redirecting logging output to “/home/postgres/repmgrd.log”

[2019-08-13 10:01:18] [INFO] local node 3 can attach to follow target node 2
[2019-08-13 10:01:18] [DETAIL] local node’s recovery point: 0/E000098; follow target node’s fork point: 0/E000098
[2019-08-13 10:01:18] [NOTICE] setting node 3’s upstream to node 2
[2019-08-13 10:01:18] [NOTICE] stopping server using “/pgsql/pg113/bin/pg_ctl -D ‘/pgdata/pg113’ -w -m fast stop”
[2019-08-13 10:01:23] [NOTICE] starting server using “/pgsql/pg113/bin/pg_ctl -w -D ‘/pgdata/pg113’ start”
[2019-08-13 10:01:23] [NOTICE] STANDBY FOLLOW successful
[2019-08-13 10:01:23] [DETAIL] standby attached to upstream node “HWFBSB6” (ID: 2)
INFO: set_repmgrd_pid(): provided pidfile is /home/postgres/repmgrd.pid
[2019-08-13 10:01:23] [NOTICE] node 3 now following new upstream node 2
[2019-08-13 10:01:23] [INFO] resuming standby monitoring mode
[2019-08-13 10:01:23] [DETAIL] following new primary “HWFBSB6” (ID: 2)
[2019-08-13 10:01:33] [INFO] node “JCFW1COM” (ID: 3) monitoring upstream node “HWFBSB6” (ID: 2) in normal state
[2019-08-13 10:01:33] [DETAIL] last monitoring statistics update was 5 seconds ago
[2019-08-13 10:01:43] [INFO] node “JCFW1COM” (ID: 3) monitoring upstream node “HWFBSB6” (ID: 2) in normal state
[2019-08-13 10:01:43] [DETAIL] last monitoring statistics update was 5 seconds ago

2：测试模拟主库网络故障下的自动failover

主库repmgrd日志输出：
postgres@HWFBSB5:~> tail -f repmgrd.log
[2019-08-13 16:53:42] [DETAIL] attempted to connect using:
user=repmgr dbname=repmgr host=HWFBSB5 connect_timeout=2 fallback_application_name=repmgr
[2019-08-13 16:53:42] [WARNING] reconnection to node “HWFBSB5” (ID: 1) failed
[2019-08-13 16:53:42] [WARNING] unable to ping “host=HWFBSB5 user=repmgr dbname=repmgr”
[2019-08-13 16:53:42] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-13 16:53:42] [ERROR] unable to determine if server is in recovery
[2019-08-13 16:53:42] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2019-08-13 16:53:42] [WARNING] unable to determine node recovery status
[2019-08-13 16:53:47] [WARNING] unable to ping “host=HWFBSB5 user=repmgr dbname=repmgr”
[2019-08-13 16:53:47] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-13 16:53:47] [WARNING] connection to node “HWFBSB5” (ID: 1) lost
[2019-08-13 16:53:47] [DETAIL]
connection pointer is NULL

[2019-08-13 16:53:47] [INFO] attempting to reconnect to node “HWFBSB5” (ID: 1)
[2019-08-13 16:53:47] [ERROR] connection to database failed
[2019-08-13 16:53:47] [DETAIL]
could not connect to server: Network is unreachable
Is the server running on host “HWFBSB5” (197.0.1.104) and accepting
TCP/IP connections on port 5432?

备库1 repmgrd日志输出：
postgres@HWFBSB6:~> tail -f repmgrd.log
[2019-08-13 13:45:23] [INFO] checking state of node 1, 10 of 10 attempts
[2019-08-13 13:45:24] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB5 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-13 13:45:24] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-13 13:45:24] [WARNING] unable to reconnect to node 1 after 10 attempts
[2019-08-13 13:45:24] [INFO] 1 active sibling nodes registered
[2019-08-13 13:45:24] [INFO] primary and this node have the same location (“default”)
[2019-08-13 13:45:24] [INFO] local node’s last receive lsn: 0/14001070
[2019-08-13 13:45:24] [INFO] checking state of sibling node “JCFW1COM” (ID: 3)
[2019-08-13 13:45:24] [INFO] node “JCFW1COM” (ID: 3) reports its upstream is node 1, last seen 130 second(s) ago
[2019-08-13 13:45:24] [INFO] node 3 last saw primary node 130 second(s) ago
[2019-08-13 13:45:24] [INFO] last receive LSN for sibling node “JCFW1COM” (ID: 3) is: 0/14001070
[2019-08-13 13:45:24] [INFO] node “JCFW1COM” (ID: 3) has same LSN as current candidate “HWFBSB6” (ID: 2)
[2019-08-13 13:45:24] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 10 seconds
[2019-08-13 13:45:24] [NOTICE] promotion candidate is “HWFBSB6” (ID: 2)
[2019-08-13 13:45:24] [NOTICE] this node is the winner, will now promote itself and inform other nodes
[2019-08-13 13:45:24] [INFO] promote_command is:
“/pgsql/pg113/bin/repmgr standby promote -f /home/postgres/repmgr.conf --log-to-file”
[2019-08-13 13:45:24] [WARNING] the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
[2019-08-13 13:45:24] [NOTICE] redirecting logging output to “/home/postgres/repmgrd.log”

[2019-08-13 13:45:26] [WARNING] 1 sibling nodes found, but option “–siblings-follow” not specified
[2019-08-13 13:45:26] [DETAIL] these nodes will remain attached to the current primary:
JCFW1COM (node ID: 3)
[2019-08-13 13:45:26] [NOTICE] promoting standby to primary
[2019-08-13 13:45:26] [DETAIL] promoting server “HWFBSB6” (ID: 2) using “/pgsql/pg113/bin/pg_ctl -w -D ‘/pgdata/pg113’ promote”
[2019-08-13 13:45:26] [NOTICE] waiting up to 60 seconds (parameter “promote_check_timeout”) for promotion to complete
[2019-08-13 13:45:26] [NOTICE] STANDBY PROMOTE successful
[2019-08-13 13:45:26] [DETAIL] server “HWFBSB6” (ID: 2) was successfully promoted to primary
[2019-08-13 13:45:26] [INFO] 1 followers to notify
[2019-08-13 13:45:26] [NOTICE] notifying node “JCFW1COM” (ID: 3) to follow node 2
INFO: node 3 received notification to follow node 2
[2019-08-13 13:45:26] [INFO] switching to primary monitoring mode
[2019-08-13 13:45:26] [NOTICE] monitoring cluster primary “HWFBSB6” (ID: 2)
[2019-08-13 13:45:36] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state
[2019-08-13 13:45:46] [INFO] monitoring primary node “HWFBSB6” (ID: 2) in normal state

备库2 repmgrd日志输出：
[postgres@JCFW1COM ~]$ tail -f repmgrd.log
[2019-08-13 13:46:39] [INFO] checking state of node 1, 10 of 10 attempts
[2019-08-13 13:46:41] [WARNING] unable to ping “user=repmgr dbname=repmgr host=HWFBSB5 connect_timeout=2 fallback_application_name=repmgr”
[2019-08-13 13:46:41] [DETAIL] PQping() returned “PQPING_NO_RESPONSE”
[2019-08-13 13:46:41] [WARNING] unable to reconnect to node 1 after 10 attempts
[2019-08-13 13:46:41] [INFO] 1 active sibling nodes registered
[2019-08-13 13:46:41] [INFO] primary and this node have the same location (“default”)
[2019-08-13 13:46:41] [INFO] local node’s last receive lsn: 0/14001070
[2019-08-13 13:46:41] [INFO] checking state of sibling node “HWFBSB6” (ID: 2)
[2019-08-13 13:46:41] [WARNING] node “HWFBSB6” (ID: 2) is not in recovery
[2019-08-13 13:46:41] [INFO] local node 3 can attach to follow target node 2
[2019-08-13 13:46:41] [DETAIL] local node’s recovery point: 0/14001070; follow target node’s fork point: 0/14001070
[2019-08-13 13:46:41] [INFO] follower node intending to follow new primary 2
[2019-08-13 13:46:41] [WARNING] the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
[2019-08-13 13:46:41] [NOTICE] redirecting logging output to “/home/postgres/repmgrd.log”

[2019-08-13 13:46:41] [INFO] local node 3 can attach to follow target node 2
[2019-08-13 13:46:41] [DETAIL] local node’s recovery point: 0/14001070; follow target node’s fork point: 0/14001070
[2019-08-13 13:46:41] [NOTICE] setting node 3’s upstream to node 2
[2019-08-13 13:46:41] [NOTICE] stopping server using “/pgsql/pg113/bin/pg_ctl -D ‘/pgdata/pg113’ -w -m fast stop”
[2019-08-13 13:46:41] [NOTICE] starting server using “/pgsql/pg113/bin/pg_ctl -w -D ‘/pgdata/pg113’ start”
[2019-08-13 13:46:42] [NOTICE] STANDBY FOLLOW successful
[2019-08-13 13:46:42] [DETAIL] standby attached to upstream node “HWFBSB6” (ID: 2)
INFO: set_repmgrd_pid(): provided pidfile is /home/postgres/repmgrd.pid
[2019-08-13 13:46:42] [NOTICE] node 3 now following new upstream node 2
[2019-08-13 13:46:42] [INFO] resuming standby monitoring mode
[2019-08-13 13:46:42] [DETAIL] following new primary “HWFBSB6” (ID: 2)
[2019-08-13 13:46:52] [INFO] node “JCFW1COM” (ID: 3) monitoring upstream node “HWFBSB6” (ID: 2) in normal state
[2019-08-13 13:46:52] [DETAIL] last monitoring statistics update was 5 seconds ago

WARNING: following issues were detected

unable to connect to node “HWFBSB5” (ID: 1)

重新把原主库rejoin变为备库（略）
3.6 利用repmgr工具测试过程中遇到的问题
3.6.1：测试switch over时预执行命令报错

postgres@HWFBSB6:~> repmgr -f repmgr.conf standby switchover --siblings-follow --dry-run
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
NOTICE: checking switchover on node “HWFBSB6” (ID: 2) in --dry-run mode
INFO: SSH connection to host “HWFBSB5” succeeded
ERROR: unable to execute “repmgr” on “HWFBSB5”
HINT: check “pg_bindir” is set to the correct path in “repmgr.conf”; current value: (not set)

根据提示在主备库两个节点repmgr.conf配置文件添加pg_bindir参数，然后重启库使参数生效
postgres@HWFBSB6:~> cat repmgr.conf
cluster=test
node_id=2
node_name=HWFBSB6
conninfo=‘host=HWFBSB6 user=repmgr dbname=repmgr’
data_directory=/pgdata/pg113
pg_bindir=/pgsql/pg113/bin/

3.6.2：rejoin节点时执行命令报错

postgres@HWFBSB5:~> repmgr -f repmgr.conf node rejoin -d ‘host=197.0.1.105 user=repmgr dbname=repmgr’ --force-rewind --dry-run --verbose
NOTICE: using provided configuration file “repmgr.conf”
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
INFO: replication connection to the rejoin target node was successful
INFO: local and rejoin target system identifiers match
DETAIL: system identifier is 6719031362243540048
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 1
DETAIL: rejoin target server’s timeline 3 forked off current database system timeline 2 before current recovery point 0/A000028
ERROR: --force-rewind specified but pg_rewind cannot be used
DETAIL: “wal_log_hints” is set to “off” and data checksums are disabled

根据报错提示需要在新主库的postgressql.conf配置文件中设置wal_log_hints参数

postgres@HWFBSB6:~> vi /pgdata/pg113/postgresql.conf
#full_page_writes = on # recover from partial page writes
#wal_compression = off # enable compression of full-page writes
wal_log_hints = on # also do full page writes of non-critical updates

需要重启新主库使参数生效备注：建议新建库时直接设置该参数为on
4 附录二（repmgr 常用操作命令）
利用repmgr克隆备库：
repmgr -h 197.0.1.104 -U repmgr -d repmgr -D /pgdata/pg113 -f /home/postgres/repmgr.conf standby clone
利用repmgr注册主备库：
repmgr -f /home/postgres/repmgr.conf primary register
repmgr -f /home/postgres/repmgr.conf standby register
利用repmgr switch over数据库：
repmgr -f /home/postgres/repmgr.conf standby switchover --siblings-follow --dry-run
repmgr -f/home/postgres/repmgr.conf standby switchover
利用repmgr promote备库：
repmgr -f /home/postgres/repmgr.conf standby promote
利用rpmgr rejoin备用数据库：
repmgr -f /home/postgres/repmgr.conf node rejoin -d ‘host=197.0.1.105 user=repmgr dbname=repmgr’ --force-rewind --dry-run –verbose
repmgr -f /home/postgres/repmgr.conf node rejoin -d ‘host=197.0.1.105 user=repmgr dbname=repmgr’ --force-rewind –verbose

repmgr常用显示主备库状态命令：
repmgr -f /home/postgres/repmgr.conf cluster show
repmgr -f /home/postgres/repmgr.conf cluster event
repmgr -f /home/postgres/repmgr.conf cluster crosscheck

写在最后：
该测试并未测试安装观察者节点和连接池的情况，有兴趣的朋友可一起研究学习。

postgresql

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

repmgr测试

- Shared Library Preloading -

- Shared Library Preloading -

评论