暂无图片
暂无图片
3
暂无图片
暂无图片
暂无图片

基于Linux的repmgr搭建

原创 Digital Observer 2024-11-27
559

作者:Digital Observer(施嘉伟)
Oracle ACE Pro: Database
PostgreSQL ACE Partner
11年数据库行业经验,现主要从事数据库服务工作
拥有Oracle OCM、DB2 10.1 Fundamentals、MySQL 8.0 OCP、WebLogic 12c OCA、KCP、PCTP、PCSD、PGCM、OCI、PolarDB技术专家、达梦师资认证、数据安全咨询高级等认证
ITPUB认证专家、PolarDB开源社区技术顾问、HaloDB技术顾问、TiDB社区技术布道师、青学会MOP技术社区专家顾问、国内某高校企业实践指导教师
公众号:Digital Observer;CSDN:施嘉伟;ITPUB:sjw1933;墨天轮:Digital Observer;PGFans:施嘉伟。

第一部分 说明

repmgr是一个开源工具套件,用于管理 PostgreSQL 服务器集群中的复制和故障转移。它通过设置备用服务器、监控复制和执行管理任务(例如故障转移或手动切换操作)的工具增强了 PostgreSQL 的内置热备用功能。
PostgreSQL在9.0后引入了流复制架构,并且支持hot standby特性,并且在往后的几个版本中不断完善和增强流复制架构,repmgr为PostgreSQL 的流复制机制提供了高级支持,因为它们是在 9.0 中引入的。当前的repmgr系列, repmgr 5,支持从 PostgreSQL 9.3 引入的复制功能的最新发展,例如级联复制、时间线切换和通过复制协议进行的基本备份
repmgr作为一个开源工具,旨在用于灵活、便捷地管理PostgreSQL集群。

第二部分 安装repmgr

2.1 yum在线安装

2.1.1 安装对应PostgreSQL版本的存储库

从列表中找到PostgreSQL对应版本的存储库RPM:Https://dl.2ndquadrant.com/ 。安装发行版和PostgreSQL版本的存储库定义(以postgresql12为例)

# curl https://dl.2ndquadrant.com/default/release/get/12/rpm | sudo bash

2.1.2 验证存储库安装

执行如下:

# sudo yum repolist

输出应该包含两个条目,如下所示:

2ndquadrant-dl-default-release-pg11/7/x86_64 2ndQuadrant packages (PG12) for 7 - x86_64 18 2ndquadrant-dl-default-release-pg11-debug/7/x86_64 2ndQuadrant packages (PG12) for 7 - x86_64 - Debug 8

2.1.3 yum安装repmgr

使用yum在线安装

# sudo yum install repmgr12

若要安装特定的包版本,请执行yum --showduplicates list关于所涉一揽子方案:

然后用连字符将适当的版本号附加到包名,例如:

# yum install repmgr12-5.2.0-1.rhel7

2.2 源码安装

源码包下载地址:https://repmgr.org/download/

2.2.1 解压源码包

解压下载的源码包,将解压目录修改属组

# tar -xzvf repmgr-5.3.1.tar.gz # chown -R postrges:postgres repmgr-5.3.1/

2.2.2 编译安装

切换为postgres用户安装,查询pg_config的执行位置,选择对应PostgreSQL版本pg_config所在的bin目录

# su – postgres $ which pg_config /software/pgsql13/bin/pg_config

进入repmgr软件包目录,指定PostgreSQL软件位置进行编译安装

$ ./configure --prefix=/software/pgsql13/ $ make && make install

查看是否安装成功

$ repmgr –version repmgr 5.3.1

2.3 设置基本复制群集的先决条件

必须在这两台服务器上安装PostgreSQL和repmgr软件,以及需要两个服务器之间的无密码SSH连接。

2.3.1配置postgres用户互信

主端服务器上生成秘钥

$ ssh-keygen -t rsa

将秘钥拷贝到远程机器

$ ssh-copy-id -i .ssh/id_rsa.pub postgres@node2

验证是否授权完成,不提示密码,直接返回日期说明配置正确

$ ssh node2 date

备端服务器上生成秘钥到用户主目录下的.ssh文件夹下

$ ssh-keygen -t rsa

将秘钥拷贝到远程机器

$ ssh-copy-id -i .ssh/id_rsa.pub postgres@node1

验证是否授权完成:不提示密码,直接返回日期说明配置正确

$ ssh node1 date

2.4 PostgreSQL配置

主库编辑配置文件postgresql.conf,并重启数据库

$ vi postgresql.conf listen_addresses = '*' wal_log_hints = on max_wal_senders = 10 max_replication_slots = 10 wal_level = 'replica' hot_standby = on archive_mode = on archive_command = 'test ! -f /postgres/product/archivedir/%f && cp %p /postgres/product/archivedir/%f' #归档路径根据具体情况修改 shared_preload_libraries = 'repmgr'

2.5 创建repmgr用户和数据库

创建repmgr流复制用户、数据库以及repmgr扩展,并赋予用户superuser权限

$ psql -d postgres -U postgres postgres# create user repmgr replication password 'repmgrforrepl'; postgres# alter user repmgr superuser; postgres# create database repmgr owner repmgr; postgres# \c repmgr repmgr repmgr# ALTER USER repmgr SET search_path TO repmgr, "$user", public; repmgr# alter user repmgr superuser ;

进入该数据库创建repmgr模式,将模式添加到search path中

repmgr# create schema repmgr ; repmgr# ALTER USER repmgr SET search_path TO repmgr, "$user", public;

创建repmgr扩展

$ psql -d repmgr -U repmgr repmgr# create extension repmgr;

2.6 配置pg_hba.conf的身份验证

配置pg_hba.conf白名单文件,允许repmgr有连接访问和复制的权限。

local replication repmgr trust host replication repmgr 127.0.0.1/32 trust host replication repmgr 192.168.22.0/24 md5 local repmgr repmgr trust host repmgr repmgr 127.0.0.1/32 trust host repmgr repmgr 192.168.22.0/24 md5

2.7 配置本地密码文件

在各节点postgres家目录下创建密码文件

$ vi ~/.pgpasss 192.168.1.4:5432:repmgr:repmgr:repmgrforrepl 192.168.1.5:5432:repmgr:repmgr:repmgrforrepl 192.168.1.5:5435:repmgr:repmgr:repmgrforrepl 192.168.1.4:5432:replication:repmgr:repmgrforrepl 192.168.1.5:5432:replication:repmgr:repmgrforrepl 192.168.1.5:5435:replication:repmgr:repmgrforrepl

2.8 配置repmgr

在/postgres/app/repmgr_config目录(目录可自定义创建)下编辑repmgr.conf,添加以下:

node_id=1 node_name='node1' conninfo='host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2' data_directory='/postgres/product/data' pg_bindir='/postgres/app/bin' failover=automatic promote_command='/postgres/app/bin/repmgr standby promote -f /postgres/app/repmgr_config/repmgr.conf --log-to-file' follow_command='/postgres/app/bin/repmgr standby follow -f /postgres/app/repmgr_config/repmgr.conf --log-to-file --upstream-node-i d=%n' log_file='/postgres/app/repmgr_log/repmgr.log'

第三部分 主服务安装配置

3.1注册主服务器

若要支持复制群集,必须将主节点注册到repmgr。这将安装repmgr扩展和元数据对象,并为主服务器添加元数据记录

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf standby register INFO: connecting to primary database... NOTICE: attempting to install extension "repmgr" NOTICE: "repmgr" extension successfully installed NOTICE: primary node record (id: 1) registered

验证集群状态

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 11 | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2

记录repmgr的元数据表

repmgr=# SELECT * FROM repmgr.nodes; -[ RECORD 1 ]----+---------------------------------------------------------------------------------------------- node_id | 1 upstream_node_id | active | t node_name | node1 type | primary location | default priority | 100 conninfo | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2 repluser | repmgr slot_name | config_file | /postgres/app/repmgr_config/repmgr.conf

3.2克隆备用服务器

在node2节点上创建一个备用服务器上的repmgr.conf文件,添加以下内容:

$ vi /postgres/app/repmgr_config/repmgr.conf node_id=2 node_name='node2' conninfo='host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2' data_directory='/postgres/product/data' pg_bindir='/postgres/app/bin' failover=automatic promote_command='/postgres/app/bin/repmgr standby promote -f /postgres/app/repmgr_config/repmgr.conf --log-to-file' follow_command='/postgres/app/bin/repmgr standby follow -f /postgres/app/repmgr_config/repmgr.conf --log-to-file --upstream-node-i d=%n' log_file='/postgres/app/repmgr_log/repmgr.log'

使用如下命令查看克隆是否有问题

$ repmgr -h 192.168.1.4 -U repmgr -d repmgr -f /postgres/app/repmgr_config/repmgr.conf standby clone --dry-run NOTICE: destination directory "/software/pgsql13/datarepl " provided INFO: connecting to source node DETAIL: connection string is: host=192.168.1.1 user=repmgr dbname=repmgr DETAIL: current installation size is 31 MB INFO: "repmgr" extension is installed in database "repmgr" INFO: replication slot usage not requested; no replication slot will be set up for this standby INFO: parameter "max_wal_senders" set to 10 NOTICE: checking for available walsenders on the source node (2 required) INFO: sufficient walsenders available on the source node DETAIL: 2 required, 10 available NOTICE: checking replication connections can be made to the source server (2 required) INFO: required number of replication connections could be made to the source server DETAIL: 2 replication connections required NOTICE: standby will attach to upstream node 1 HINT: consider using the -c/--fast-checkpoint option INFO: all prerequisites for "standby clone" are met

若没有问题去掉调试模式,直接执行

$ repmgr -h 192.168.22.128 -U repmgr -d repmgr -p 5432 -f /postgres/app/repmgr_config/repmgr.conf NOTICE: destination directory "/postgres/product/data" provided INFO: connecting to source node DETAIL: connection string is: host=192.168.1.4 user=repmgr dbname=repmgr DETAIL: current installation size is 31 MB INFO: replication slot usage not requested; no replication slot will be set up for this standby NOTICE: checking for available walsenders on the source node (2 required) NOTICE: checking replication connections can be made to the source server (2 required) INFO: creating directory "/pgdata/dataano"... NOTICE: starting backup (using pg_basebackup)... HINT: this may take some time; consider using the -c/--fast-checkpoint option INFO: executing: pg_basebackup -l "repmgr base backup" -D /postgres/product/data -h 192.168.1.4 -p 5432 -U repmgr -p 5432 -X stream NOTICE: standby clone (using pg_basebackup) complete NOTICE: you can now start your PostgreSQL server HINT: for example: pg_ctl -D /postgres/product/data start HINT: after starting the server, you need to register this standby with "repmgr standby register"

启动node2节点服务

$ pg_ctl -D /postgres/product/data start

注册node2 standby角色信息

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf standby register

主节点查看流复制状态正常,成功搭建

repmgr=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 24003 usesysid | 16384 usename | repmgr application_name | node2 client_addr | 192.168.1.5 client_hostname | client_port | 13360 backend_start | 2022-03-10 16:06:17.005646+08 backend_xmin | state | streaming sent_lsn | 0/36002C78 write_lsn | 0/36002C78 flush_lsn | 0/36002C78 replay_lsn | 0/36002C78 write_lag | 00:00:00.000339 flush_lag | 00:00:00.002684 replay_lag | 00:00:00.002753 sync_priority | 0 sync_state | async reply_time | 2022-03-10 16:29:32.898743+08

3.3配置自动故障转移

创建一个新节点witness(node3),建议部署在另一个单独的服务器上,本文档将此节点安装在node2节点上。
在node3节点的repmgr.conf中添加以下参数

node_id=3 node_name='node3' conninfo='host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5435 connect_timeout=2' data_directory='/postgres/app/witness/data' pg_bindir='/postgres/app/bin' failover=automatic promote_command='/postgres/app/bin/repmgr standby promote -f /postgres/app/witness/conf/repmgr.conf --log-to-file' follow_command='/postgres/app/bin/repmgr standby follow -f /postgres/app/witness/conf/repmgr.conf --log-to-file --upstream-node-id =%n' log_file='/postgres/app/repmgr_log/repmgr_witness.log'

创建一个新的PostgreSQL实例,将参数文件设置和白名单访问设置同node1、node2节点。(参数配置步骤略)

$ initdb -D /postgres/app/witness/data

将witness节点注册为witness角色

$ repmgr -h 192.168.1.4 -U repmgr -d repmgr -p5432 -f /postgres/app/witness/conf/repmgr.conf witness register

在node1上执行查看各节点状态

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 11 | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2 2 | node2 | standby | running | node1 | default | 100 | 11 | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2 3 | node3 | witness | * running | node2 | default | 0 | n/a | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5435 connect_timeout=2

各节点启动repmgrd程序

node1: $ repmgrd -f /postgres/app/repmgr_config/repmgr.conf --pid-file /tmp/repmgrd.pid node2: $ repmgrd -f /postgres/app/repmgr_config/repmgr.conf --pid-file /tmp/repmgrd.pid node3(witness): $ repmgrd -f /postgres/app/witness/conf/repmgr.conf --pid-file /tmp/repmgrd_witness.pid

如果需要终止repmgrd程序,使用以下命令

$ kill `cat /tmp/repmgrd.pid`

3.3.1测试自动故障转移

node1上模拟测试关闭主库

$ pg_ctl stop

node2节点打开repmgr日志信息显示,输出如下则成功晋升为主节点

$ tail -30f /postgres/app/repmgr_log/repmgr.log [2022-03-10 15:33:43] [NOTICE] promoting standby to primary [2022-03-10 15:33:43] [DETAIL] promoting server "node2" (ID: 2) using pg_promote() [2022-03-10 15:33:43] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete [2022-03-10 15:33:44] [NOTICE] STANDBY PROMOTE successful [2022-03-10 15:33:44] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary [2022-03-10 15:33:44] [INFO] checking state of node 2, 1 of 6 attempts [2022-03-10 15:33:44] [NOTICE] node 2 has recovered, reconnecting [2022-03-10 15:33:44] [INFO] connection to node 2 succeeded [2022-03-10 15:33:44] [INFO] original connection is still available [2022-03-10 15:33:44] [INFO] 0 followers to notify [2022-03-10 15:33:44] [INFO] switching to primary monitoring mode [2022-03-10 15:33:44] [NOTICE] monitoring cluster primary "node2" (ID: 2) [2022-03-10 15:33:44] [INFO] child node "node3" (ID: 3) is not yet attached [2022-03-10 15:34:44] [NOTICE] new witness "node3" (ID: 3) has connected

查看备库是否晋升为主(f为主)

postgres=# select pg_is_in_recovery(); pg_is_in_recovery ------------------- f (1 row)

node2上查看各节点状态,此时node2已经成为primary

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------- 1 | node1 | primary | - failed | ? | default | 100 | | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 12 | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2 3 | node3 | witness | * running | node2 | default | 0 | n/a | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5435 connect_timeout=2

3.3.2将原主库初始化为备库

确保node1原主库已经被关闭,并将其初始化为备库

$ repmgr -h 192.168.1.5 -U repmgr -d repmgr -f /postgres/app/repmgr_config/repmgr.conf standby clone -F

启动node1节点

$ pg_ctl start

将node1强制重新注册为standby

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf standby register -F

查看集群状态

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------- 1 | node1 | standby | running | node2 | default | 100 | 12 | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 12 | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2 3 | node3 | witness | * running | node2 | default | 0 | n/a | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5435 connect_timeout=2

3.4手动切换主备节点

node1节点强制提升为主节点

$ repmgr standby switchover -f /postgres/app/repmgr_config/repmgr.conf --siblings-follow --always-promote DETAIL: promoting server "node1" (ID: 1) using pg_promote() NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete NOTICE: STANDBY PROMOTE successful DETAIL: server "node1" (ID: 1) was successfully promoted to primary ERROR: new primary diverges from former primary and --force-rewind not provided HINT: the former primary will need to be restored manually, or use "repmgr node rejoin"

查看各节点状态

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 13 | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2 2 | node2 | primary | - failed | ? | default | 100 | | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2 3 | node3 | witness | * running | ? node2 | default | 0 | n/a | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5435 connect_timeout=2 WARNING: following issues were detected - unable to connect to node "node2" (ID: 2) - unable to connect to node "node3" (ID: 3)'s upstream node "node2" (ID: 2)

node2节点进行初始化克隆

$ repmgr -h 192.168.1.4 -U repmgr -d repmgr -f /postgres/app/repmgr_config/repmgr.conf standby clone -F

启动node2节点

$ pg_ctl start

重新注册为standby角色

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf standby register -F

查看各节点状态

$ repmgr -f /postgres/app/repmgr_config/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 13 | host=192.168.1.4 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2 2 | node2 | standby | running | node1 | default | 100 | 13 | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5432 connect_timeout=2 3 | node3 | witness | * running | node1 | default | 0 | n/a | host=192.168.1.5 user=repmgr dbname=repmgr password=repmgrforrepl port=5435 connect_timeout=2

3.5集群维护

如果需要对PostgreSQL环境进行维护,例如配置修改、架构切换等,建议关闭各节点repmgrd自动故障转移进程。找到对应进程的pid文件或者pid执行关闭命令:

$ kill `cat /tmp/repmgrd.pid`

hhh6.jpg

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论