【实战】PG高可用的“那些事”......

dba悠然 2024-05-17

“采菊东篱下，

悠然现南山”。

晋·陶渊明

DBA悠然

专注数据库技术，开源信创、AI云计算.......

点击关注，一路同行吧！

今天因为你的分享，让我元气满满！

"作者: 悠然 | 首发公众号: dba悠然(orasky666)"

<温馨提示:以下内容仅代表个人观点>

青春不打折，信仰不迷失！

数据库高可用（High Availability）是保障业务连续性的重要手段之一。对于大多信创数据库而言，基本上都有较为成熟的高可用方案；而对于原生PG数据库，高可用方案却参差不齐，因此选型一个稳定可靠的高可用方案就显得至关重要了。

PART1：引言

考虑这样的故障场景：

1）主库节点宕机无法启动；

2）备库因网络抖动超时/WAL日志GAP/实例重启恢复/误操作switchover等原因导致集群状态异常；

针对上述场景，有经验的dbaer都有可能踩坑。引用行业内的一句话来表达此情此景：“运维就像悬崖踩钢丝，稍有不慎就会万劫不复”。接下来就和笔者一起探索如何使用正确的“姿势”避坑（以PAF为例）。

PART2：正题

1.查看集群状态正常

[postgres@monitor ~]$ pg_autoctl show state
  Name |  Node |          Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | 172.20.70.204:5432 | 0/706AA30 |       yes |             primary |             primary
node_8 |     8 | 172.20.70.205:5432 | 0/706AA30 |       yes |           secondary |           secondary
--node1(node_1)为primary主库
--node2(node_8)为secondary 从库

2.故障模拟(primay关机/宕机),primay短时间无法恢复

3.错误姿势:备机迫不及待直接发起perform switchover(failover参考step6)

[postgres@monitor ~]$ pg_autoctl show state
  Name |  Node |          Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | 172.20.70.204:5432 | 0/706AA68 |        no |             primary |             demoted
node_8 |     8 | 172.20.70.205:5432 | 0/706C4F0 |       yes |        wait_primary |        wait_primary
--主库降级状态
--备库wait_primary状态

//正确姿势(参考step8-12):

当主库长时间不可用且无法恢复时:

1)手动将主库剔除集群,业务不允许一直等待

2)备库failover/promote升主,变成单机(single)主库模式(read write)

3)备库成为新的primary主库接管业务

4)将原主机/其他备用机器接入到集群,成为新的备机

5)高可用集群重构完成

4.查看备库日志

[postgres@monitor pg_data]$ 09:18:21 51157 INFO  Node 1 (172.20.70.204:5432) is marked as unhealthy by the monitor
09:18:21 51157 INFO  Setting goal state of node 1 "node_1" (172.20.70.204:5432) to catchingup after it became unhealthy.
09:18:21 51157 INFO  New state for node 1 "node_1" (172.20.70.204:5432): secondary ? catchingup
09:18:21 51157 INFO  Setting goal state of node 8 "node_8" (172.20.70.205:5432) to wait_primary because none of the 1 standby candidate nodes are healthy at the moment.
09:18:21 51157 INFO  New state for node 8 "node_8" (172.20.70.205:5432): primary ? wait_primary
09:18:21 51157 INFO  New state is reported by node 8 "node_8" (172.20.70.205:5432): "wait_primary"
09:18:21 51157 INFO  New state for node 8 "node_8" (172.20.70.205:5432): wait_primary ? wait_primary
pg_autoctl show state
  Name |  Node |          Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | 172.20.70.204:5432 | 0/B000148 |        no |           secondary |          catchingup
node_8 |     8 | 172.20.70.205:5432 | 0/B000148 |       yes |        wait_primary |        wait_primary

可以明显看到:

1)原主库node1不可用(Reachable=no),状态(secondary备库),等待新的主库(原从库node8)WAL日志;

2)原备库node8,状态(wait_primary),等待原主库(node1)WAL日志;

3)陷入死循环WAL等待

5.查看备库恢复模式

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 f
(1 row)
postgres=# select name,setting from pg_settings where name='transaction_read_only';
         name          | setting
-----------------------+---------
 transaction_read_only | off
(1 row)

分析:

1)备库未处于恢复模式；

2)备库事务可以读写,但是当前状态为(wait_primary),无法成为新的primary主库。

6.主库宕机不可用,备库尝试尝试failover失败

[postgres@db2 log]$ pg_autoctl perform failover
09:20:07 27761 INFO  Targetting group 0 in formation "default"
09:20:07 27761 ERROR Monitor ERROR:  couldn't find the primary node in formation "default", group 0
09:20:07 27761 ERROR SQL query: SELECT pgautofailover.perform_failover($1, $2)
09:20:07 27761 ERROR SQL params: 'default', '0'
09:20:07 27761 ERROR Failed to perform failover for formation default and group 0
09:20:07 27761 FATAL Failed to perform failover/switchover, see above for details'

7.备库尝试升主失败

[postgres@db2 log]$ pg_autoctl perform promotion
09:21:07 27887 ERROR Monitor ERROR:  cannot perform promotion: node node_8 in formation default has reported state "wait_primary" and is assigned state "wait_primary", promotion can only be performed when in state "secondary".
09:21:07 27887 ERROR SQL query: SELECT pgautofailover.perform_promotion($1, $2)
09:21:07 27887 ERROR SQL params: 'default', 'node_8'
09:21:07 27887 ERROR Failed to perform failover for node node_8 in formation default

//正确姿势,参考step3

8.将不可用主库剔除monitor集群,待恢复后可以脑裂作为新的主库

--仅从集群monitor清除管控信息,保留原主库PG实例/数据

pg_autoctl drop node --hostname=172.20.70.204 --pgdata=/data/pg_data/ --pgport=5432
[postgres@monitor pg_data]$ pg_autoctl drop node --hostname=172.20.70.204 --pgdata=/data/pg_data/ --pgport=5432
09:24:31 51157 INFO  Removing node 1 "node_1" (172.20.70.204:5432) from formation "default" and group 0
09:24:31 51157 INFO  Setting number_sync_standbys to 0 for formation "default" now that we have -1 standby nodes set with replication-quorum.
09:24:31 51157 INFO  Setting goal state of node 8 "node_8" (172.20.70.205:5432) to single as there is no other node.
09:24:31 51157 INFO  New state for node 8 "node_8" (172.20.70.205:5432): wait_primary ? single
[postgres@monitor pg_data]$ 09:24:31 51157 INFO  Setting goal state of node 8 "node_8" (172.20.70.205:5432) to single as there is no other node.
09:24:31 51157 INFO  New state for node 8 "node_8" (172.20.70.205:5432): wait_primary ? single
09:24:31 51157 INFO  New state is reported by node 8 "node_8" (172.20.70.205:5432): "single"
09:24:31 51157 INFO  New state for node 8 "node_8" (172.20.70.205:5432): single ? single

9.查看节点node8(原从库)变成单机模式(single)主库

[postgres@monitor pg_data]$ pg_autoctl show state
  Name |  Node |          Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------+-----------+-----------+---------------------+--------------------
node_8 |     8 | 172.20.70.205:5432 | 0/B000148 |       yes |              single |              single

10.尝试将节点node8(原从库)升主(将原宕机主库剔除monitor集群后,备机自动升级为primary)

[postgres@monitor pg_data]$ pg_autoctl perform promotion --name node_8
09:28:03 63653 WARN  NOTICE:  cannot perform promotion: node node_8 in formation default is already a primary.

//对比(正确姿势)：

step4的node8(原从库)虽然可以读写,但是状态为"wait_primary",无法作为主库添加新的secondary节点组成新的高可用集群,存在单点故障风险;

step10的node8(原从库)自动升主为primary,可以作为主库添加新的secondary从库,组成新的高可集群,参考step12

11.将原主库节点启动(模拟宕机恢复),可以成功拉起实现脑裂(原主库脱离原集群后变成单机可读)--特定场景下有妙用

[postgres@db1 ~]$ pg_ctl start -D data/pg_data/
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2024-05-15 09:32:51.902 CST [1466] LOG:  starting PostgreSQL 12.16 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2024-05-15 09:32:51.903 CST [1466] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-05-15 09:32:51.903 CST [1466] LOG:  listening on IPv6 address "::", port 5432
2024-05-15 09:32:51.905 CST [1466] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-05-15 09:32:51.906 CST [1466] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-05-15 09:32:51.966 CST [1466] LOG:  redirecting log output to logging collector process
2024-05-15 09:32:51.966 CST [1466] HINT:  Future log output will appear in directory "log".
 done
server started
[postgres@db1 ~]$
[postgres@db1 ~]$
[postgres@db1 ~]$ pwd
/home/postgres
[postgres@db1 ~]$ psql
psql (12.16)
Type "help" for help.

11.1)查看脑裂后原主库状态为："恢复/事务只读模式"

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 t
(1 row)
postgres=# select name,setting from pg_settings where name='transaction_read_only';
         name          | setting
-----------------------+---------
 transaction_read_only | on
(1 row)

11.2)修复原主库状态为“未处于恢复状态”

--原主库node1手动升主脑裂
[postgres@db1 ~]$ pg_ctl promote
waiting for server to promote.... done
server promoted
--查看是否还处于恢复模式(未处于恢复状态)
postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 f
(1 row)

11.3)修复脑裂后原新主库状态(只读->读写)

a)注释postgres.conf前面2行paf配置信息（避坑：关闭PAF干扰）

#include 'postgresql-auto-failover-standby.conf' # Auto-generated by pg_auto_failover, do not remove
#include 'postgresql-auto-failover.conf' # Auto-generated by pg_auto_failover, do not remove

b)清理PAF相关配置

mv postgresql-auto-failover.conf postgresql-auto-failover.conf.bak
mv postgresql-auto-failover-standby.conf postgresql-auto-failover-standby.conf.bak

c)更改postgres.conf设置关闭流复制配置

#hot_standby=off
#synchronous_standby_names=''
d)重启生效
pg_ctl restart -m fast -D postgres/data

11.4)验证原主库成为新的可读写单机

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 f
(1 row)
postgres=# select name,setting from pg_settings where name='transaction_read_only';
         name          | setting
-----------------------+---------
 transaction_read_only | off
(1 row)
INSERT 0 2
postgres=# create table t(id int);
CREATE TABLE                  
postgres=# insert into t select generate_series(1,2);

12.至此,通过脑裂恢将原备库node8恢复为primary模式；原主库node1恢复为单机模式(可独立运行)。

13.后续操作:将原备库node8(primary)添加新的secondary从库组成新的HA集群

//上接第10步,在新主库(原备库上添加新节点到PAF集群)

1)查看集群状态(原备库node8为 single/primary)

[postgres@monitor pg_data]$ pg_autoctl show state
  Name |  Node |          Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------+-----------+-----------+---------------------+--------------------
node_8 |     8 | 172.20.70.205:5432 | 0/B000148 |       yes |              single |              single

2)添加新节点(如果原宕主库无法启动,将环境清空后或者其他干净机器)上加入集群(异步模式)

--删除旧的配置文件,否则报错无法加入集群,如下:
"10:01:16 47610 FATAL The state file "/home/postgres/.local/share/pg_autoctl/data/pg_data/pg_autoctl.state" exists and there's no init in progress"
rm -Rf home/postgres/.local
--添加新从库到原备库node8(新集群primary)
sudo -u postgres 
/usr/pgsql-12/bin/pg_autoctl create postgres --pgdata data/pg_data/ --auth trust --ssl-self-signed \
 --username postgres --dbname postgres --hostname 172.20.70.204 --pgctl usr/pgsql-12/bin/pg_ctl \
 --monitor 'postgres://autoctl_node@172.20.70.203:5432/pg_auto_failover?sslmode=require' --replication-quorum false              
--日志信息
[postgres@db1 data]$ usr/pgsql-12/bin/pg_autoctl create postgres --pgdata data/pg_data/ --auth trust --ssl-self-signed \
>  --username postgres --dbname postgres --hostname 172.20.70.204 --pgctl usr/pgsql-12/bin/pg_ctl \
>  --monitor 'postgres://autoctl_node@172.20.70.203:5432/pg_auto_failover?sslmode=require' --replication-quorum false
10:04:46 47615 INFO  Using default --ssl-mode "require"
10:04:46 47615 INFO  Using --ssl-self-signed: pg_autoctl will create self-signed certificates, allowing for encrypted network traffic
10:04:46 47615 WARN  Self-signed certificates provide protection against eavesdropping; this setup does NOT protect against Man-In-The-Middle attacks nor Impersonation attacks.
10:04:46 47615 WARN  See https://www.postgresql.org/docs/current/libpq-ssl.html for details
10:04:46 47615 INFO  Started pg_autoctl postgres service with pid 47618
10:04:46 47618 INFO   usr/pgsql-12/bin/pg_autoctl do service postgres --pgdata data/pg_data/ -v
10:04:46 47615 INFO  Started pg_autoctl node-init service with pid 47619
10:04:46 47619 INFO  Registered node 9 (172.20.70.204:5432) with name "node_1" in formation "default", group 0, state "wait_standby"
10:04:46 47619 INFO  Writing keeper state file at "/home/postgres/.local/share/pg_autoctl/data/pg_data/pg_autoctl.state"
10:04:46 47619 INFO  Writing keeper init state file at "/home/postgres/.local/share/pg_autoctl/data/pg_data/pg_autoctl.init"
10:04:46 47619 INFO  Successfully registered as "wait_standby" to the monitor.
10:04:46 47619 INFO  FSM transition from "init" to "wait_standby": Start following a primary
10:04:46 47619 INFO  Transition complete: current state is now "wait_standby"
10:04:46 47619 INFO  New state for node 8 "node_8" (172.20.70.205:5432): single ? wait_primary
10:04:46 47619 INFO  New state for node 8 "node_8" (172.20.70.205:5432): wait_primary ? wait_primary
10:04:46 47619 INFO  Still waiting for the monitor to drive us to state "catchingup"
10:04:46 47619 WARN  Please make sure that the primary node is currently running `pg_autoctl run` and contacting the monitor.
10:04:46 47619 INFO  FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
10:04:46 47619 INFO  Initialising PostgreSQL as a hot standby
10:04:46 47619 INFO   usr/pgsql-12/bin/pg_basebackup -w -d application_name=pgautofailover_standby_9 host=172.20.70.205 port=5432 user=pgautofailover_replicator sslmode=require --pgdata data/backup/node_1 -U pgautofailover_replicator --verbose --progress --max-rate 100M --wal-method=stream --slot pgautofailover_standby_9
10:04:46 47619 INFO  pg_basebackup: initiating base backup, waiting for checkpoint to complete
10:04:46 47619 INFO  pg_basebackup: checkpoint completed
10:04:46 47619 INFO  pg_basebackup: write-ahead log start point: 0/C000028 on timeline 12
10:04:46 47619 INFO  pg_basebackup: starting background WAL receiver
10:04:46 47619 INFO      0/33603 kB (0%), 0/1 tablespace (/data/backup/node_1/backup_label   )
10:04:46 47619 INFO  33612/33612 kB (100%), 0/1 tablespace (.../backup/node_1/global/pg_control)
10:04:46 47619 INFO  33612/33612 kB (100%), 1/1 tablespace
10:04:46 47619 INFO  pg_basebackup: write-ahead log end point: 0/C000100
10:04:46 47619 INFO  pg_basebackup: waiting for background process to finish streaming ...
10:04:46 47619 INFO  pg_basebackup: syncing data to disk ...
10:04:46 47619 INFO  pg_basebackup: base backup completed
10:04:46 47619 INFO  Creating the standby signal file at "/data/pg_data/standby.signal", and replication setup at "/data/pg_data/postgresql-auto-failover-standby.conf"
10:04:46 47619 INFO  Contents of "/data/pg_data/postgresql-auto-failover-standby.conf" have changed, overwriting
10:04:46 47619 INFO   bin/openssl req -new -x509 -days 365 -nodes -text -out data/pg_data/server.crt -keyout data/pg_data/server.key -subj "/CN=172.20.70.204"
10:04:47 47627 INFO   usr/pgsql-12/bin/postgres -D data/pg_data -p 5432 -h *
10:04:47 47618 INFO  Postgres is now serving PGDATA "/data/pg_data" on port 5432 with pid 47627
10:04:47 47619 INFO  PostgreSQL started on port 5432
10:04:47 47619 INFO  Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes.
10:04:47 47619 INFO  Ensuring HBA rules for node 8 "node_8" (172.20.70.205:5432)
10:04:47 47619 INFO  Transition complete: current state is now "catchingup"
10:04:47 47619 INFO  keeper has been successfully initialized.
10:04:47 47615 WARN  pg_autoctl service node-init exited with exit status 0
10:04:47 47618 INFO  Postgres controller service received signal SIGTERM, terminating
10:04:47 47618 INFO  Stopping pg_autoctl postgres service
10:04:47 47618 INFO  /usr/pgsql-12/bin/pg_ctl --pgdata /data/pg_data --wait stop --mode fast
10:04:47 47615 INFO  Stop pg_autoctl
--monitor信息记录新节点join到集群
[postgres@monitor pg_data]$ 10:04:46 51157 INFO  Registering node 9 "node_1" (172.20.70.204:5432) to formation "default" with replication quorum false and candidate priority 50 [50]
10:04:46 51157 INFO  New state is reported by node 9 "node_1" (172.20.70.204:5432): "wait_standby"
10:04:46 51157 INFO  New state for node 9 "node_1" (172.20.70.204:5432): wait_standby ? wait_standby
10:04:46 51157 INFO  Setting goal state of node 8 "node_8" (172.20.70.205:5432) to wait_primary after node 9 "node_1" (172.20.70.204:5432) joined.
10:04:46 51157 INFO  New state for node 8 "node_8" (172.20.70.205:5432): single ? wait_primary
10:04:46 51157 INFO  New state is reported by node 8 "node_8" (172.20.70.205:5432): "wait_primary"
10:04:46 51157 INFO  New state for node 8 "node_8" (172.20.70.205:5432): wait_primary ? wait_primary
10:04:46 51157 INFO  Setting goal state of node 9 "node_1" (172.20.70.204:5432) to catchingup after node 8 "node_8" (172.20.70.205:5432) converged to wait_primary.
10:04:46 51157 INFO  New state for node 9 "node_1" (172.20.70.204:5432): wait_standby ? catchingup
10:04:47 51157 INFO  New state is reported by node 9 "node_1" (172.20.70.204:5432): "catchingup"
10:04:47 51157 INFO  New state for node 9 "node_1" (172.20.70.204:5432): catchingup ? catchingup

3)启动新加入集群的secondary从库

/usr/pgsql-12/bin/pg_autoctl run &


--monitor显示新成员从库已加入集群
[postgres@monitor pg_data]$ 10:04:46 51157 INFO  Registering node 9 "node_1" (172.20.70.204:5432) to formation "default" with replication quorum false and candidate priority 50 [50]
10:04:46 51157 INFO  New state is reported by node 9 "node_1" (172.20.70.204:5432): "wait_standby"
10:04:46 51157 INFO  New state for node 9 "node_1" (172.20.70.204:5432): wait_standby ? wait_standby
10:04:46 51157 INFO  Setting goal state of node 8 "node_8" (172.20.70.205:5432) to wait_primary after node 9 "node_1" (172.20.70.204:5432) joined.
10:04:46 51157 INFO  New state for node 8 "node_8" (172.20.70.205:5432): single ? wait_primary
10:04:46 51157 INFO  New state is reported by node 8 "node_8" (172.20.70.205:5432): "wait_primary"
10:04:46 51157 INFO  New state for node 8 "node_8" (172.20.70.205:5432): wait_primary ? wait_primary
10:04:46 51157 INFO  Setting goal state of node 9 "node_1" (172.20.70.204:5432) to catchingup after node 8 "node_8" (172.20.70.205:5432) converged to wait_primary.
10:04:46 51157 INFO  New state for node 9 "node_1" (172.20.70.204:5432): wait_standby ? catchingup
10:04:47 51157 INFO  New state is reported by node 9 "node_1" (172.20.70.204:5432): "catchingup"
10:04:47 51157 INFO  New state for node 9 "node_1" (172.20.70.204:5432): catchingup ? catchingup
10:06:23 51157 INFO  Node 9 (172.20.70.204:5432) is marked as unhealthy by the monitor
10:06:23 51157 INFO  Node 9 (172.20.70.204:5432) is marked as healthy by the monitor
10:06:23 51157 INFO  Setting goal state of node 9 "node_1" (172.20.70.204:5432) to secondary after it caught up.
10:06:23 51157 INFO  New state for node 9 "node_1" (172.20.70.204:5432): catchingup ? secondary
10:06:23 51157 INFO  New state is reported by node 9 "node_1" (172.20.70.204:5432): "secondary"
10:06:23 51157 INFO  New state for node 9 "node_1" (172.20.70.204:5432): secondary ? secondary
10:06:23 51157 INFO  Setting goal state of node 8 "node_8" (172.20.70.205:5432) to primary now that at least one secondary candidate node is healthy.
10:06:23 51157 INFO  New state for node 8 "node_8" (172.20.70.205:5432): wait_primary ? primary
10:06:23 51157 INFO  New state is reported by node 8 "node_8" (172.20.70.205:5432): "primary"
10:06:23 51157 INFO  New state for node 8 "node_8" (172.20.70.205:5432): primary ? primary

4)查看新集群节点加入成功,新的高可用集群重建完成。

[postgres@monitor pg_data]$ pg_autoctl show state
  Name |  Node |          Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------+-----------+-----------+---------------------+--------------------
node_8 |     8 | 172.20.70.205:5432 | 0/D000060 |       yes |             primary |             primary
node_1 |     9 | 172.20.70.204:5432 | 0/D000060 |       yes |           secondary |           secondary


--原备库node8成为新集群primary主库
--原主库node1(初始化环境)加入集群成为新集群secondary 从库