流复制同步工具pg_rewind功能改进

原创多米爸比 2022-11-05

2662

本周遇到一个案例：standby做完promote之后，old master使用pg_rewind工具重新同步主备关系失败。今天系统梳理一下pg_rewind工具的特性。

一、pg_rewind对服务端的要求

数据库开启 checksums 或者设置wal_log_hints参数为on，一般采用后者。

wal_log_hints=on

二、pg_rewind对权限的要求

pg_rewind工具其实只依赖如下以下四个文件读取系统函数的权限：

pg_ls_dir()
pg_read_file()
pg_read_binary_file()
pg_stat_file()

这四个函数在PostgreSQL 11版本之前只能超级用户才有权限使用。从PostgreSQL 11开始，pg_rewind可以不依赖超级用户只需要分配这几个系统函数权限。

三、pg_rewind对流复制的功能改进

-R / --write-recovery-conf

使用这个选项可以让pg_rewind帮我们自动创建流复制相关的恢复配置文件，并将指定选项–source-server里的连接字符串附加到postgresql.auto.conf中的primary_conninfo参数里，这个选项可以用来将原主库快速恢复为备库。

-c / --restore-target-wal

在进行pg_rewind恢复时，源库pg_wal目录下的wal文件可能因为某些原因不存在，因此会出现下面的报错提示：

$ pg_rewind -D /var/lib/pgsql/data --source-server='host=node1 dbname=postgres user=postgres port=5432'
pg_rewind: servers diverged at WAL location 0/30005C8 on timeline 1
(... snip log output from postgres starting up in single user mode ...)
pg_rewind: error: could not open file "/var/lib/pgsql/data/pg_wal/000000010000000000000002": No such file or directory
pg_rewind: fatal: could not find previous WAL record at 0/2000100

当出现这种情况时PostgreSQL可以使用restore_command参数配置的命令来获取所需的WAL文件。

$ pg_rewind -D /var/lib/pgsql/data --source-server='host=node1 dbname=postgres user=postgres port=5432' --restore-target-wal
pg_rewind: servers diverged at WAL location 0/30005C8 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/2000060 on timeline 1
pg_rewind: Done!

自动崩溃恢复

pg_rewind只能对干净关闭的PostgreSQL实例进行操作，否则它不能正确判断有哪些变更需要进行应用回放。pg_rewind侦测到数据库实例未干净关闭时，会自动以单用户模式启动进行崩溃恢复。

pg_rewind自动崩溃恢复的简化过程如下：

$ pg_rewind -D /var/lib/pgsql/data --source-server='host=node1 dbname=postgres user=postgres port=5432'
pg_rewind: executing "/usr/bin/pgsql/postgres" for target server to complete crash recovery
...
pg_rewind: servers diverged at WAL location 0/30005D0 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/2000060 on timeline 1
pg_rewind: Done!

standby作为恢复源

PostgreSQL 14支持pg_rewind在–source-server使用standby作为恢复源，下面在本地环境进行演示：

首先初始化：

$ /opt/pg14/bin/initdb -D data1401

设置如下参数：

port=1401
listen_addresses = '0.0.0.0'
wal_log_hints = on
wal_keep_size=100

我们将wal_keep_size参数设置为一个较高的值来保留为standby预留足够多旧的WAL文件，这样可以增大pg_rewind同步恢复数据的几率。

接着启动data1401

$ /opt/pg14/bin/pg_ctl start -D data1401

然后搭建两个standby：

/opt/pg14/bin/pg_basebackup -D data1402 -h 127.0.0.1 -p1401
/opt/pg14/bin/pg_basebackup -D data1403 -h 127.0.0.1 -p1401

分别修改端口和primary_conninfo，data1402

port=1402
primary_conninfo = 'host=127.0.0.1 port=1401 user=postgres application_name=1402'

data1403

port=1403
primary_conninfo = 'host=127.0.0.1 port=1401 user=postgres application_name=1403'

然后建立standby触发文件并启动

touch data1402/standby.signal
/opt/pg14/bin/pg_ctl start -D data1402

touch data1403/standby.signal
/opt/pg14/bin/pg_ctl start -D data1403

至此主备搭建完成，从1401端口查看状态如下：

$ /opt/pg14/bin/psql -p1401

postgres=# select usename,application_name,client_addr,client_port,state,sync_state from pg_stat_replication;
 usename  | application_name | client_addr | client_port |   state   | sync_state 
----------+------------------+-------------+-------------+-----------+------------
 postgres | 1402             | 127.0.0.1   |       34378 | streaming | async
 postgres | 1403             | 127.0.0.1   |       34380 | streaming | async
(2 rows)

下面模拟双主，对1402进行promote，此时不关闭1401

$ /opt/pg14/bin/psql -p1402
select pg_promote();

1402变为新主之后，下面先恢复旧主1401，先关闭1401

$ /opt/pg14/bin/pg_ctl stop -D data1401

然后使用pg_rewind对1401做增量同步恢复

$ /opt/pg14/bin/pg_rewind -D data1401 --source-server="host=127.0.0.1 port=1402 user=postgres" 
pg_rewind: servers diverged at WAL location 0/4000060 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/3000060 on timeline 1
pg_rewind: Done!

然后修改1401的primary_conninfo如下：

primary_conninfo = 'host=127.0.0.1 port=1402 user=postgres application_name=1401'

接着创建standby触发文件并启动服务

$ touch data1401/standby.signal
$ /opt/pg14/bin/pg_ctl start -D data1401

再从新主1402查看状态如下：

$ /opt/pg14/bin/psql -p1402
postgres=# select usename,application_name,client_addr,client_port,state,sync_state from pg_stat_replication;
 usename  | application_name | client_addr | client_port |   state   | sync_state 
----------+------------------+-------------+-------------+-----------+------------
 postgres | 1401             | 127.0.0.1   |       40976 | streaming | async
(1 row)

可以看到1402与1401已经建立主备关系。

此时1403节点由于流复制进程连接失败，服务已关闭，恢复1403节点时除了可以从新主库1402作为恢复源，也可以从1401备节点进行增量恢复，可以降低主库的压力。

从1401备节点进行pg_rewind操作如下：

$ /opt/pg14/bin/pg_rewind -D data1403 --source-server="host=127.0.0.1 port=1401 user=postgres" 
pg_rewind: servers diverged at WAL location 0/4000060 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/3000060 on timeline 1
pg_rewind: Done!

1403节点不需要新建standby触发文件，因为之前已经是standby角色，直接启动服务。

$ /opt/pg14/bin/pg_ctl start -D data1403

最后我们在从1402观察主备状态：

$ /opt/pg14/bin/psql -p1402
postgres=# select usename,application_name,client_addr,client_port,state,sync_state from pg_stat_replication;
 usename  | application_name | client_addr | client_port |   state   | sync_state 
----------+------------------+-------------+-------------+-----------+------------
 postgres | 1401             | 127.0.0.1   |       40976 | streaming | async
 postgres | 1403             | 127.0.0.1   |       41000 | streaming | async
(2 rows)

可以看到1402与两个standby节点1401、1403流复制状态正常。

该操作如果在PostgreSQL 14之前的版本会报如下错误：

pg_rewind: fatal: source server must not be in recovery mode

–config-file

有一些Linux的发行版将PostgreSQL的postgresql.conf配置文件放置于PGDATA之外，还有一些HA软件也是采用类似方式进行配置管理，PostgreSQL 15对pg_rewind工具增加了一个比较有用的选项–config-file，用来指定postgresql.conf文件的位置，以此来增强主备环境操作的可靠性。

保持联系

从2019年12月开始写第一篇文章，分享的初心一直在坚持，本人现在组建了一个PG乐知乐享交流群，欢迎关注我文章的小伙伴加我微信进群吹牛唠嗑，交流技术。

墨力计划 postgresql

最后修改时间：2022-11-08 09:45:58

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

文章被以下合辑收录

PostgreSQL乐知乐享（共187篇）

PostgreSQL工作实践分享