基于Linux的repmgr搭建

原创 Digital Observer 2024-11-27

559

作者：Digital Observer（施嘉伟）
Oracle ACE Pro: Database
PostgreSQL ACE Partner
11年数据库行业经验，现主要从事数据库服务工作
拥有Oracle OCM、DB2 10.1 Fundamentals、MySQL 8.0 OCP、WebLogic 12c OCA、KCP、PCTP、PCSD、PGCM、OCI、PolarDB技术专家、达梦师资认证、数据安全咨询高级等认证
ITPUB认证专家、PolarDB开源社区技术顾问、HaloDB技术顾问、TiDB社区技术布道师、青学会MOP技术社区专家顾问、国内某高校企业实践指导教师
公众号：Digital Observer；CSDN：施嘉伟；ITPUB：sjw1933；墨天轮：Digital Observer；PGFans：施嘉伟。

第一部分说明

repmgr是一个开源工具套件，用于管理 PostgreSQL 服务器集群中的复制和故障转移。它通过设置备用服务器、监控复制和执行管理任务（例如故障转移或手动切换操作）的工具增强了 PostgreSQL 的内置热备用功能。
PostgreSQL在9.0后引入了流复制架构，并且支持hot standby特性，并且在往后的几个版本中不断完善和增强流复制架构，repmgr为PostgreSQL 的流复制机制提供了高级支持，因为它们是在 9.0 中引入的。当前的repmgr系列， repmgr 5，支持从 PostgreSQL 9.3 引入的复制功能的最新发展，例如级联复制、时间线切换和通过复制协议进行的基本备份
repmgr作为一个开源工具，旨在用于灵活、便捷地管理PostgreSQL集群。

第二部分安装repmgr

2.1 yum在线安装

2.1.1 安装对应PostgreSQL版本的存储库

从列表中找到PostgreSQL对应版本的存储库RPM：Https://dl.2ndquadrant.com/ 。安装发行版和PostgreSQL版本的存储库定义（以postgresql12为例）

# curl https://dl.2ndquadrant.com/default/release/get/12/rpm | sudo bash

2.1.2 验证存储库安装

执行如下：

# sudo yum repolist

输出应该包含两个条目，如下所示：

2ndquadrant-dl-default-release-pg11/7/x86_64         2ndQuadrant packages (PG12) for 7 - x86_64               18
2ndquadrant-dl-default-release-pg11-debug/7/x86_64   2ndQuadrant packages (PG12) for 7 - x86_64 - Debug        8

2.1.3 yum安装repmgr

使用yum在线安装

# sudo yum install repmgr12

若要安装特定的包版本，请执行yum --showduplicates list关于所涉一揽子方案：

然后用连字符将适当的版本号附加到包名，例如：

# yum install repmgr12-5.2.0-1.rhel7

2.2 源码安装

源码包下载地址：https://repmgr.org/download/

2.2.1 解压源码包

解压下载的源码包，将解压目录修改属组

# tar -xzvf repmgr-5.3.1.tar.gz
# chown -R postrges:postgres repmgr-5.3.1/

2.2.2 编译安装

切换为postgres用户安装，查询pg_config的执行位置，选择对应PostgreSQL版本pg_config所在的bin目录

# su – postgres
$ which pg_config
/software/pgsql13/bin/pg_config

进入repmgr软件包目录，指定PostgreSQL软件位置进行编译安装

$ ./configure --prefix=/software/pgsql13/
$ make && make install

查看是否安装成功

$ repmgr –version
repmgr 5.3.1

2.3 设置基本复制群集的先决条件

必须在这两台服务器上安装PostgreSQL和repmgr软件，以及需要两个服务器之间的无密码SSH连接。

2.3.1配置postgres用户互信

主端服务器上生成秘钥

$ ssh-keygen -t rsa

将秘钥拷贝到远程机器

$ ssh-copy-id -i .ssh/id_rsa.pub postgres@node2

验证是否授权完成，不提示密码，直接返回日期说明配置正确

$ ssh node2 date

备端服务器上生成秘钥到用户主目录下的.ssh文件夹下

$ ssh-keygen -t rsa

将秘钥拷贝到远程机器

$ ssh-copy-id -i .ssh/id_rsa.pub postgres@node1

验证是否授权完成：不提示密码，直接返回日期说明配置正确

$ ssh node1 date

2.4 PostgreSQL配置

主库编辑配置文件postgresql.conf，并重启数据库

$ vi postgresql.conf

listen_addresses = '*'
wal_log_hints = on
max_wal_senders = 10
max_replication_slots = 10
wal_level = 'replica'
hot_standby = on
archive_mode = on
archive_command = 'test ! -f /postgres/product/archivedir/%f && cp %p /postgres/product/archivedir/%f'     #归档路径根据具体情况修改
shared_preload_libraries = 'repmgr'

2.5 创建repmgr用户和数据库

创建repmgr流复制用户、数据库以及repmgr扩展，并赋予用户superuser权限

$ psql -d postgres -U postgres
postgres# create user repmgr replication password 'repmgrforrepl';
postgres# alter user repmgr superuser;
postgres# create database repmgr owner repmgr;
postgres# \c repmgr repmgr
repmgr# ALTER USER repmgr SET search_path TO repmgr, "$user", public;
repmgr# alter user repmgr superuser ;

进入该数据库创建repmgr模式，将模式添加到search path中

repmgr# create schema repmgr ;
repmgr# ALTER USER repmgr SET search_path TO repmgr, "$user", public;

创建repmgr扩展

$ psql -d repmgr -U repmgr
repmgr# create extension repmgr;

2.6 配置pg_hba.conf的身份验证

配置pg_hba.conf白名单文件，允许repmgr有连接访问和复制的权限。

local   replication   repmgr                                trust
host    replication   repmgr      127.0.0.1/32            trust
host    replication   repmgr      192.168.22.0/24         md5

local   repmgr         repmgr                                trust
host    repmgr         repmgr      127.0.0.1/32            trust
host    repmgr         repmgr      192.168.22.0/24         md5

2.7 配置本地密码文件

在各节点postgres家目录下创建密码文件

$ vi ~/.pgpasss

192.168.1.4:5432:repmgr:repmgr:repmgrforrepl
192.168.1.5:5432:repmgr:repmgr:repmgrforrepl
192.168.1.5:5435:repmgr:repmgr:repmgrforrepl
192.168.1.4:5432:replication:repmgr:repmgrforrepl
192.168.1.5:5432:replication:repmgr:repmgrforrepl
192.168.1.5:5435:replication:repmgr:repmgrforrepl

2.8 配置repmgr

在/postgres/app/repmgr_config目录（目录可自定义创建）下编辑repmgr.conf，添加以下：

node_id=1
node_name='node1'
conninfo='host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2'
data_directory='/postgres/product/data'
pg_bindir='/postgres/app/bin'
failover=automatic
promote_command='/postgres/app/bin/repmgr standby promote -f /postgres/app/repmgr_config/repmgr.conf --log-to-file'
follow_command='/postgres/app/bin/repmgr standby follow -f /postgres/app/repmgr_config/repmgr.conf --log-to-file --upstream-node-i
d=%n'
log_file='/postgres/app/repmgr_log/repmgr.log'

第三部分主服务安装配置

3.1注册主服务器

若要支持复制群集，必须将主节点注册到repmgr。这将安装repmgr扩展和元数据对象，并为主服务器添加元数据记录

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf standby register
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (id: 1) registered

验证集群状态

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                            
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 11       | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2

记录repmgr的元数据表

repmgr=# SELECT * FROM repmgr.nodes;
-[ RECORD 1 ]----+----------------------------------------------------------------------------------------------
node_id          | 1
upstream_node_id | 
active           | t
node_name        | node1
type             | primary
location         | default
priority         | 100
conninfo         | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2
repluser         | repmgr
slot_name        | 
config_file      | /postgres/app/repmgr_config/repmgr.conf

3.2克隆备用服务器

在node2节点上创建一个备用服务器上的repmgr.conf文件，添加以下内容：

$ vi /postgres/app/repmgr_config/repmgr.conf

node_id=2
node_name='node2'
conninfo='host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2'
data_directory='/postgres/product/data'
pg_bindir='/postgres/app/bin'
failover=automatic
promote_command='/postgres/app/bin/repmgr standby promote -f /postgres/app/repmgr_config/repmgr.conf --log-to-file'
follow_command='/postgres/app/bin/repmgr standby follow -f /postgres/app/repmgr_config/repmgr.conf --log-to-file --upstream-node-i
d=%n'
log_file='/postgres/app/repmgr_log/repmgr.log'

使用如下命令查看克隆是否有问题

$ repmgr -h 192.168.1.4 -U repmgr -d repmgr -f /postgres/app/repmgr_config/repmgr.conf standby clone --dry-run

NOTICE: destination directory "/software/pgsql13/datarepl " provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.1.1 user=repmgr dbname=repmgr
DETAIL: current installation size is 31 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 10 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: all prerequisites for "standby clone" are met

若没有问题去掉调试模式，直接执行

$ repmgr -h 192.168.22.128 -U repmgr -d repmgr -p 5432 -f /postgres/app/repmgr_config/repmgr.conf

NOTICE: destination directory "/postgres/product/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.1.4 user=repmgr dbname=repmgr
DETAIL: current installation size is 31 MB
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: creating directory "/pgdata/dataano"...
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  pg_basebackup -l "repmgr base backup"  -D /postgres/product/data -h 192.168.1.4 -p 5432 -U repmgr -p 5432 -X stream 
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /postgres/product/data start
HINT: after starting the server, you need to register this standby with "repmgr standby register"

启动node2节点服务

$ pg_ctl -D /postgres/product/data start

注册node2 standby角色信息

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf standby register

主节点查看流复制状态正常，成功搭建

repmgr=# select * from pg_stat_replication ;
-[ RECORD 1 ]----+------------------------------
pid              | 24003
usesysid         | 16384
usename          | repmgr
application_name | node2
client_addr      | 192.168.1.5
client_hostname  | 
client_port      | 13360
backend_start    | 2022-03-10 16:06:17.005646+08
backend_xmin     | 
state            | streaming
sent_lsn         | 0/36002C78
write_lsn        | 0/36002C78
flush_lsn        | 0/36002C78
replay_lsn       | 0/36002C78
write_lag        | 00:00:00.000339
flush_lag        | 00:00:00.002684
replay_lag       | 00:00:00.002753
sync_priority    | 0
sync_state       | async
reply_time       | 2022-03-10 16:29:32.898743+08

3.3配置自动故障转移

创建一个新节点witness（node3），建议部署在另一个单独的服务器上，本文档将此节点安装在node2节点上。
在node3节点的repmgr.conf中添加以下参数

node_id=3
node_name='node3'
conninfo='host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5435 connect_timeout=2'
data_directory='/postgres/app/witness/data'
pg_bindir='/postgres/app/bin'
failover=automatic
promote_command='/postgres/app/bin/repmgr standby promote -f /postgres/app/witness/conf/repmgr.conf --log-to-file'
follow_command='/postgres/app/bin/repmgr standby follow -f /postgres/app/witness/conf/repmgr.conf --log-to-file --upstream-node-id
=%n'
log_file='/postgres/app/repmgr_log/repmgr_witness.log'

创建一个新的PostgreSQL实例，将参数文件设置和白名单访问设置同node1、node2节点。（参数配置步骤略）

$ initdb -D /postgres/app/witness/data

将witness节点注册为witness角色

$ repmgr -h 192.168.1.4 -U repmgr -d repmgr -p5432 -f /postgres/app/witness/conf/repmgr.conf witness register

在node1上执行查看各节点状态

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                            
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 11       | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 11       | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2
 3  | node3 | witness | * running | node2    | default  | 0        | n/a      | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5435 connect_timeout=2

各节点启动repmgrd程序

node1:
$ repmgrd -f /postgres/app/repmgr_config/repmgr.conf --pid-file /tmp/repmgrd.pid

node2：
$ repmgrd -f /postgres/app/repmgr_config/repmgr.conf --pid-file /tmp/repmgrd.pid

node3（witness）:
$ repmgrd -f /postgres/app/witness/conf/repmgr.conf --pid-file /tmp/repmgrd_witness.pid

如果需要终止repmgrd程序，使用以下命令

$ kill `cat /tmp/repmgrd.pid`

3.3.1测试自动故障转移

node1上模拟测试关闭主库

$ pg_ctl stop

node2节点打开repmgr日志信息显示，输出如下则成功晋升为主节点

$ tail -30f /postgres/app/repmgr_log/repmgr.log
[2022-03-10 15:33:43] [NOTICE] promoting standby to primary
[2022-03-10 15:33:43] [DETAIL] promoting server "node2" (ID: 2) using pg_promote()
[2022-03-10 15:33:43] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2022-03-10 15:33:44] [NOTICE] STANDBY PROMOTE successful
[2022-03-10 15:33:44] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2022-03-10 15:33:44] [INFO] checking state of node 2, 1 of 6 attempts
[2022-03-10 15:33:44] [NOTICE] node 2 has recovered, reconnecting
[2022-03-10 15:33:44] [INFO] connection to node 2 succeeded
[2022-03-10 15:33:44] [INFO] original connection is still available
[2022-03-10 15:33:44] [INFO] 0 followers to notify
[2022-03-10 15:33:44] [INFO] switching to primary monitoring mode
[2022-03-10 15:33:44] [NOTICE] monitoring cluster primary "node2" (ID: 2)
[2022-03-10 15:33:44] [INFO] child node "node3" (ID: 3) is not yet attached
[2022-03-10 15:34:44] [NOTICE] new witness "node3" (ID: 3) has connected

查看备库是否晋升为主（f为主）

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 f
(1 row)

node2上查看各节点状态，此时node2已经成为primary

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf cluster show
ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                            
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------------------------------------
 1  | node1 | primary | - failed  | ?        | default  | 100      |          | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 12       | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2
 3  | node3 | witness | * running | node2    | default  | 0        | n/a      | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5435 connect_timeout=2

3.3.2将原主库初始化为备库

确保node1原主库已经被关闭，并将其初始化为备库

$ repmgr -h 192.168.1.5 -U repmgr -d repmgr -f /postgres/app/repmgr_config/repmgr.conf standby clone -F

启动node1节点

$ pg_ctl start

将node1强制重新注册为standby

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf standby register -F

查看集群状态

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf cluster show
ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                            
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------------------------------------
 1  | node1 | standby |  running  | node2    | default  | 100      | 12         | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 12       | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2
 3  | node3 | witness | * running | node2    | default  | 0        | n/a      | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5435 connect_timeout=2

3.4手动切换主备节点

node1节点强制提升为主节点

$ repmgr standby switchover -f /postgres/app/repmgr_config/repmgr.conf --siblings-follow --always-promote
DETAIL: promoting server "node1" (ID: 1) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node1" (ID: 1) was successfully promoted to primary
ERROR: new primary diverges from former primary and --force-rewind not provided
HINT: the former primary will need to be restored manually, or use "repmgr node rejoin"

查看各节点状态

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf cluster show 
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                            
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 13        | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2
 2  | node2 | primary | - failed  | ?        | default  | 100      |          | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2
 3  | node3 | witness | * running | ? node2  | default  | 0        | n/a      | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5435 connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "node2" (ID: 2)
  - unable to connect to node "node3" (ID: 3)'s upstream node "node2" (ID: 2)

node2节点进行初始化克隆

$ repmgr -h 192.168.1.4 -U repmgr -d repmgr -f /postgres/app/repmgr_config/repmgr.conf standby clone -F

启动node2节点

$ pg_ctl start

重新注册为standby角色

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf standby register -F