PostgreSQL高可用测试及恢复

原创心境无恒 2020-10-26

2712

该实验假设你已经做了一个基于pgpool-II的高可用集群，已经可以做到主备切换。这里主要模拟各种情况下的宕机。

一、实验前准备

本实验使用的是两个节点搭建的高可用集群，一个主库一个备库，使用的PostgreSQL版本是12.1，挂载盘中/pgdata存放软件信息以及数据库信息；/pg_exp主要用来数据迁移使用的，本实验用不到。

节点	主机名	IP	操作系统	挂载盘
主库	postgresql1	192.168.136.131	CentOS7	/pgdata：8G，/pg_exp：2G
从库	postgresql2	192.168.136.132	CentOS7	/pgdata：8G，/pg_exp：2G
虚拟ip		192.168.136.133

二、模拟主pgpool-ii节点宕机

处理关键步骤:

如果是正常关闭: 释放虚拟IP > 节点正常关闭 > 其他节点检测到时区链接 > 仲裁选出新的"主节点" > 接管虚拟IP > 正常提供服务
如果异常关闭: 其他节点检测到时区链接且IP不可用> 仲裁选出新的"主节点" > 接管虚拟IP > 正常提供服务

在做实验之前，必须配置好配置文件pgpool.conf中的参数，并重启服务。

#将Watchdog配置中的几个选项配置如下
failover_when_quorum_exists = on
                                    # Only perform backend node failover
                                    # when the watchdog cluster holds the quorum
                                    # (change requires restart)

failover_require_consensus = on
                                    # Perform failover when majority of Pgpool-II nodes
                                    # aggrees on the backend node status change
                                    # (change requires restart)

allow_multiple_failover_requests_from_node = off
                                    # A Pgpool-II node can cast multiple votes
                                    # for building the consensus on failover
                                    # (change requires restart)

enable_consensus_with_half_votes = on
                                    # apply majority rule for consensus and quorum computation
                                    # at 50% of votes in a cluster with even number of nodes.
                                    # when enabled the existence of quorum and consensus
                                    # on failover is resolved after receiving half of the
                                    # total votes in the cluster, otherwise both these
                                    # decisions require at least one more vote than
                                    # half of the total votes.
                                    # (change requires restart)

根据官网手册，可以知道这几个选项的作用。
- failover_when_quorum_exists：开启这个选项，pgpool在后端节点上执行退化/故障转移时将考虑仲裁。执行仲裁的时候会考虑活跃的watchdog节点数，当活跃的节点数大于总数一半的时候才能执行仲裁，仲裁使得pgpool在故障检测方面可以做得更高效。本地复制模式下不可打开这个选项。
- failover_require_consensus：开启这个选项，pgpool会在存在仲裁并且满足最小活跃节点数的情况下执行故障切换。例如在三个watchdog集群下，集群的故障切换要在至少有两个节点要求在同一个节点上执行故障切换的情况下才能进行。如果选项关闭，即使节点间没有统一一个节点，也会执行故障切换。
- allow_multiple_failover_requests_from_node：此参数与failover_require_consensus连接。启用后，单个pgpool节点可以对故障转移进行多个投票。好处是可以发现一些其他节点上未能检查出来的持续性错误，因为如果在一次健康检查中发现的错误未能在充分投票后得到解决，那么下一次这个健康检查可以得到更多的票数从而得到解决。该选项必须在以上两个选项都开启时才能用。
- enable_consensus_with_half_votes：此参数配置pgpool如何进行多数规则计算，以计算活跃节点数并解决故障转移的一致意见。启用后，只需要集群中配置的总投票数的一半就可以执行故障转移，否则至少需要比一半多一票才能进行转移。这个参数只有在将watchdog集群配置为偶数个节点时才会发挥作用。奇数节点数不受此配置参数值的影响。（本实验中因为是两个节点，这个选项必须开启）当在两个节点watchdog集群中启用此参数时，有一个节点处于活动状态，就可以执行仲裁。如果参数为off，则两个节点必须都是活动的，才能使仲裁存在。（但是开启这个参数可能会出现脑裂的现象，建议生产中还是配置奇数节点的集群。）
watchdog本身(pgpool-ii节点)本身故障后, 如果配置打开, 其他节点会执行仲裁, 如仲裁从节点中哪一个成为主节点，进而建设虚拟IP等, 这个仲裁本身有投票机制，和无视仲裁结果等配置;
如果不配置, 主pgpool-i 节点关闭后, 可能不会转移虚拟IP, 会出现集群暂时不可访问。

#主节点的网卡此时存在虚拟网卡，而从节点的网卡上不存在虚拟网卡。
[postgres@postgresql1 pgpool-II]$ ifconfig
...

ens33:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.136.133  netmask 255.255.255.0  broadcast 0.0.0.0
        ether 00:0c:29:b3:b6:f0  txqueuelen 1000  (Ethernet)
...
# 主pgpool-ii停止服务
[postgres@postgresql1 pgpool-II]$ /pgdata/app/PostgreSQL/pgpool-II/bin/pgpool -m fast stop
# 原从节点切换完之后，访问集群
# 可查看pgpool.log 日志, 可看到被关闭的节点释放虚拟IP, 其他节点接管虚拟IP
# 从节点中可以看到创建出虚拟网卡，主节点的虚拟ip转移了
[postgres@postgresql2 pgpool-II]$ ifconfig
...
ens33:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.136.133  netmask 255.255.255.0  broadcast 0.0.0.0
        ether 00:0c:29:b3:b6:f0  txqueuelen 1000  (Ethernet)
...
# 登陆虚拟IP
[postgres@postgresql1 pgpool-II]$ psql -h 192.168.136.133 -p 9999
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |    hostname     | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_de
lay | replication_state | replication_sync_state | last_status_change
---------+-----------------+------+--------+-----------+---------+------------+-------------------+---------------
----+-------------------+------------------------+---------------------
 0       | 192.168.136.131 | 5432 | up     | 0.500000  | primary | 0          | true              | 0
    |                   |                        | 2020-10-19 13:43:35
 1       | 192.168.136.132 | 5432 | up     | 0.500000  | standby | 0          | false             | 0
    |                   |                        | 2020-10-19 13:43:35
(2 rows)

postgres=# \q
# 访问成功，在主节点上的pgpool宕机后，由从节点的pgpool接管vip和集群服务，并未中断应用访问。
# 恢复主节点的pgpool
[postgres@postgresql1 pgpool-II]$ /pgdata/app/PostgreSQL/pgpool-II/bin/pgpool -n -D > /pgdata/app/PostgreSQL/pgpool-II/log/pgpool.log 2>&1 &
[1] 14948
# 查看节点日志可以查看到已加入watchdog集群并且是standby节点

从该实验可以看出，主pgpool-ii节点宕机后，要开启恢复必须执行的操作就只需要重新开启pgpool服务。

三、模拟从pgpool-ii节点宕机

根据上面的实验，从pgpool-ii节点其实已经变成了postgressql1这个节点，而postgresql2这个节点已经成为集群的领导者。

# 关闭从pgpool节点
[postgres@postgresql1 pgpool-II]$ /pgdata/app/PostgreSQL/pgpool-II/bin/pgpool -m fast stop                      2020-10-20 09:04:41: pid 15859: LOG:  stop request sent to pgpool. waiting for termination...
.done.
[1]+  完成                  /pgdata/app/PostgreSQL/pgpool-II/bin/pgpool -n -D > /pgdata/app/PostgreSQL/pgpool-II/log/pgpool.log 2>&1
# 主pgpool节点依旧是领导者，持有vip网卡
[postgres@postgresql2 pgpool-II]$ ifconfig
...
ens33:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.136.133  netmask 255.255.255.0  broadcast 0.0.0.0
        ether 00:0c:29:ab:86:be  txqueuelen 1000  (Ethernet)
...
# 集群依旧可以访问
[postgres@postgresql1 pgpool-II]$ psql -h 192.168.136.133 -p 9999
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |    hostname     | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_
delay | replication_state | replication_sync_state | last_status_change
---------+-----------------+------+--------+-----------+---------+------------+-------------------+-------------
------+-------------------+------------------------+---------------------
 0       | 192.168.136.131 | 5432 | up     | 0.500000  | primary | 0          | true              | 0
      |                   |                        | 2020-10-20 08:34:34
 1       | 192.168.136.132 | 5432 | up     | 0.500000  | standby | 0          | false             | 0
      |                   |                        | 2020-10-20 08:34:34
(2 rows)

postgres=# \q
# 恢复从pgpool节点
[postgres@postgresql1 pgpool-II]$ /pgdata/app/PostgreSQL/pgpool-II/bin/pgpool -n -D > /pgdata/app/PostgreSQL/pgpool-II/log/pgpool.log 2>&1 &
[1] 15992

从该实验可以看出，从pgpool-ii节点宕机后，要开启恢复必须执行的操作就只需要重新开启pgpool服务。

四、模拟主数据库宕机

经过上面的实验，postgresql1在pgpool中是从节点，但在此节点上运行着主库，从这里可以看出主库节点和pgpool的主节点可以不在同一个主机上。

# 模拟主数据库宕机
[postgres@postgresql1 pgpool-II]$ pg_ctl stop -m fast
waiting for server to shut down..... done
server stopped
[postgres@postgresql1 pgpool-II]$ psql -h 192.168.136.133 -p 9999
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |    hostname     | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_
delay | replication_state | replication_sync_state | last_status_change
---------+-----------------+------+--------+-----------+---------+------------+-------------------+-------------
------+-------------------+------------------------+---------------------
 0       | 192.168.136.131 | 5432 | down   | 0.500000  | standby | 0          | false             | 0
      |                   |                        | 2020-10-20 09:16:32
 1       | 192.168.136.132 | 5432 | up     | 0.500000  | primary | 0          | true              | 0
      |                   |                        | 2020-10-20 09:16:32
(2 rows)

postgres=# \q
# 可以发现主库挂掉后，从库接管服务成为主库（standby状态变为primary）
# 恢复原主库节点，并启动数据库
[postgres@postgresql1 12.1]$ mv data/ databak/
[postgres@postgresql1 12.1]$ pg_basebackup -D $PGDATA -Fp -Xs -v -P -R -h 192.168.136.132 -p 5432 -U replica
Password:
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/E000028 on timeline 5
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_15467"
24720/24720 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/E000138
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: syncing data to disk ...
pg_basebackup: base backup completed
[postgres@postgresql1 12.1]$ pg_ctl  start -D $PGDATA
waiting for server to start....2020-10-20 09:21:17.037 CST [16489] LOG:  starting PostgreSQL 12.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-10-20 09:21:17.038 CST [16489] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-10-20 09:21:17.038 CST [16489] LOG:  listening on IPv6 address "::", port 5432
2020-10-20 09:21:17.040 CST [16489] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
.2020-10-20 09:21:18.095 CST [16489] LOG:  redirecting log output to logging collector process
2020-10-20 09:21:18.095 CST [16489] HINT:  Future log output will appear in directory "log".
 done
server started
#查看节点状态
[postgres@postgresql1 12.1]$ pg_controldata | grep 'Database cluster state'
Database cluster state:               in archive recovery
#将节点重新加入集群
[postgres@postgresql1 12.1]$ cd ../pgpool-II/bin/
[postgres@postgresql1 bin]$ ./pcp_attach_node -d -U postgres -n 0
Password:
DEBUG: recv: tos="m", len=8
DEBUG: recv: tos="r", len=21
DEBUG: send: tos="C", len=6
DEBUG: recv: tos="c", len=20
pcp_attach_node -- Command Successful
DEBUG: send: tos="X", len=4
[postgres@postgresql1 bin]$ psql -h 192.168.136.133 -p 9999
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |    hostname     | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_
delay | replication_state | replication_sync_state | last_status_change
---------+-----------------+------+--------+-----------+---------+------------+-------------------+-------------
------+-------------------+------------------------+---------------------
 0       | 192.168.136.131 | 5432 | up     | 0.500000  | standby | 0          | true              | 0
      |                   |                        | 2020-10-20 09:23:08
 1       | 192.168.136.132 | 5432 | up     | 0.500000  | primary | 0          | false             | 0
      |                   |                        | 2020-10-20 09:16:32
(2 rows)

postgres=# \q
# 原主库节点启动并且成为standby状态，原从库节点变成主库成为primary状态。

从该实验可以看出，主数据库宕机后，要开启恢复必须执行的操作有：清空节点数据目录 > 使用pg_basebackup命令同步数据 > 启动数据库服务 > 将数据库重新加入集群。

五、模拟从数据库宕机

经过第四步的实验后，主从库节点身份已经转换，并且pgpool节点身份也已经转换。现在的主库和主pgpool节点是postgresql2节点，从库和从pgpool节点是postgresql1节点。

# 模拟从库宕机
[postgres@postgresql1 bin]$ pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped
[postgres@postgresql1 bin]$ psql -h 192.168.136.133 -p 9999
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |    hostname     | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_
delay | replication_state | replication_sync_state | last_status_change
---------+-----------------+------+--------+-----------+---------+------------+-------------------+-------------
------+-------------------+------------------------+---------------------
 0       | 192.168.136.131 | 5432 | down   | 0.500000  | standby | 0          | false             | 0
      |                   |                        | 2020-10-20 09:26:42
 1       | 192.168.136.132 | 5432 | up     | 0.500000  | primary | 0          | true              | 0
      |                   |                        | 2020-10-20 09:16:32
(2 rows)

postgres=# \q
# 主库运行中，从库显示宕机
# 模拟此时应用插入数据
[postgres@postgresql1 bin]$ psql -h 192.168.136.133 -p 9999
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# \d
          List of relations
 Schema |  Name   | Type  |  Owner
--------+---------+-------+----------
 public | test_ms | table | postgres
(1 row)

postgres=# table test_ms;
 id |   name
----+----------
 26 | a
 27 | b

postgres=# insert into test_ms values(30,'can?');
INSERT 0 1
postgres=# table test_ms;
 id |   name
----+----------
 26 | a
 27 | b
 30 | can?

postgres=# \q
# 开启从数据库，查看数据库在节点的状态
[postgres@postgresql1 bin]$ pg_ctl  start -D $PGDATA
waiting for server to start....2020-10-20 09:32:46.632 CST [16616] LOG:  starting PostgreSQL 12.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-10-20 09:32:46.632 CST [16616] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-10-20 09:32:46.633 CST [16616] LOG:  listening on IPv6 address "::", port 5432
2020-10-20 09:32:46.634 CST [16616] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-10-20 09:32:46.734 CST [16616] LOG:  redirecting log output to logging collector process
2020-10-20 09:32:46.734 CST [16616] HINT:  Future log output will appear in directory "log".
 done
server started
[postgres@postgresql2 bin]$ pg_controldata | grep 'Database cluster state'
Database cluster state:               in archive recovery
#数据库加入集群（必须加入，否则vip中还是宕机状态）
[postgres@postgresql1 bin]$ ./pcp_attach_node -d -U postgres -n 0
Password:
DEBUG: recv: tos="m", len=8
DEBUG: recv: tos="r", len=21
DEBUG: send: tos="C", len=6
DEBUG: recv: tos="c", len=20
pcp_attach_node -- Command Successful
DEBUG: send: tos="X", len=4
[postgres@postgresql1 bin]$ psql -h 192.168.136.133 -p 9999
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |    hostname     | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_
delay | replication_state | replication_sync_state | last_status_change
---------+-----------------+------+--------+-----------+---------+------------+-------------------+-------------
------+-------------------+------------------------+---------------------
 0       | 192.168.136.131 | 5432 | up     | 0.500000  | standby | 0          | false             | 0
      |                   |                        | 2020-10-20 09:34:10
 1       | 192.168.136.132 | 5432 | up     | 0.500000  | primary | 3          | true              | 0
      |                   |                        | 2020-10-20 09:16:32
(2 rows)

postgres=# \q
# 从数据库的数据保持一致
[postgres@postgresql1 bin]$ psql
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# table test_ms;
 id |   name
----+----------
 26 | a
 27 | b
 30 | can?

postgres=# \q

从该实验可以看出，从数据库宕机，就算此时集群里加入了新数据，从数据库恢复的时候，数据也能保持一致，可以不用特地再从主数据那边备份data文件夹过来。
从数据库宕机，要开启恢复必须执行的操作有：启动数据库服务 > 将数据库重新加入集群。

六、模拟主服务器宕机

目前主库和主pgpool节点是postgresql2节点，从库和从pgpool节点是postgresql1节点。

# 模拟主服务器宕机
[postgres@postgresql2 pgpool-II]$ exit
登出
[root@postgresql2 ~]# reboot
# watchdog集群推举出新的领导者，并且创建虚拟ip
[postgres@postgresql1 ~]$ ifconfig
...
ens33:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.136.133  netmask 255.255.255.0  broadcast 0.0.0.0
        ether 00:0c:29:b3:b6:f0  txqueuelen 1000  (Ethernet)
...
# 集群状态
[postgres@postgresql1 ~]$ psql -h 192.168.136.133 -p 9999
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |    hostname     | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_
delay | replication_state | replication_sync_state | last_status_change
---------+-----------------+------+--------+-----------+---------+------------+-------------------+-------------
------+-------------------+------------------------+---------------------
 0       | 192.168.136.131 | 5432 | up     | 0.500000  | primary | 0          | true              | 0
      |                   |                        | 2020-10-20 09:56:48
 1       | 192.168.136.132 | 5432 | down   | 0.500000  | standby | 0          | false             | 0
      |                   |                        | 2020-10-20 09:55:25
(2 rows)

# postgresql1成为主库，postgresql2显示宕机
# 此时集群中插入数据
postgres=# table test_ms;
 id |   name
----+----------
 26 | a
 27 | b
 30 | can?

postgres=# insert into test_ms values(35,'in');
INSERT 0 1

postgres=# table test_ms;
 id |   name
----+----------
 26 | a
 27 | b
 30 | can?
 35 | in

postgres=# \q
# 开启postgresql2的pgpool
[postgres@postgresql2 pgpool-II]$ /pgdata/app/PostgreSQL/pgpool-II/bin/pgpool -n -D > /pgdata/app/PostgreSQL/pgpool-II/log/pgpool.log 2>&1 &
[1] 1325
# 不做备份直接启动数据库，查看节点状态
[postgres@postgresql2 pgpool-II]$ pg_ctl start -D $PGDATA
waiting for server to start....2020-10-20 10:08:42.799 CST [1302] LOG:  starting PostgreSQL 12.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-10-20 10:08:42.801 CST [1302] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-10-20 10:08:42.801 CST [1302] LOG:  listening on IPv6 address "::", port 5432
2020-10-20 10:08:42.803 CST [1302] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-10-20 10:08:43.454 CST [1302] LOG:  redirecting log output to logging collector process
2020-10-20 10:08:43.454 CST [1302] HINT:  Future log output will appear in directory "log".
 done
server started
[postgres@postgresql2 pgpool-II]$ pg_controldata | grep 'Database cluster state'
Database cluster state:               in production
# 可以看到节点是在生产中而不是处于可归档恢复中（in archive recovery）
# 所以这里做备份（pg_basebackup命令）是必要的
# 关闭数据库
[postgres@postgresql2 pgpool-II]$ pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped
# 备份data目录
[postgres@postgresql2 12.1]$ mv data/ databak/
[postgres@postgresql2 12.1]$ mkdir data
[postgres@postgresql2 12.1]$ chmod 0750 data
[postgres@postgresql2 12.1]$ pg_basebackup -D $PGDATA -F p -X s -v -P -R -h 192.168.136.131 -p 5432 -U replica
Password:
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/12000028 on timeline 6
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_17852"
24725/24725 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/12000100
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: syncing data to disk ...
pg_basebackup: base backup completed
# 开启数据库，查看节点在集群的状态
[postgres@postgresql2 12.1]$ pg_ctl start -D $PGDATA
waiting for server to start....2020-10-20 10:19:04.705 CST [1528] LOG:  starting PostgreSQL 12.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-10-20 10:19:04.705 CST [1528] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-10-20 10:19:04.706 CST [1528] LOG:  listening on IPv6 address "::", port 5432
2020-10-20 10:19:04.706 CST [1528] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
.2020-10-20 10:19:05.908 CST [1528] LOG:  redirecting log output to logging collector process
2020-10-20 10:19:05.908 CST [1528] HINT:  Future log output will appear in directory "log".
 done
server started
[postgres@postgresql2 12.1]$ pg_controldata | grep 'Database cluster state'
Database cluster state:               in archive recovery
# 此时该数据库处于in archive recovery
# 节点加入集群
[postgres@postgresql2 12.1]$ cd ../pgpool-II/bin/
[postgres@postgresql2 bin]$ ./pcp_attach_node -d -U postgres -n 1
Password:
DEBUG: recv: tos="m", len=8
DEBUG: recv: tos="r", len=21
DEBUG: send: tos="C", len=6
DEBUG: recv: tos="c", len=20
pcp_attach_node -- Command Successful
DEBUG: send: tos="X", len=4
[postgres@postgresql2 bin]$ psql -h 192.168.136.133 -p 9999
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |    hostname     | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_
delay | replication_state | replication_sync_state | last_status_change
---------+-----------------+------+--------+-----------+---------+------------+-------------------+-------------
------+-------------------+------------------------+---------------------
 0       | 192.168.136.131 | 5432 | up     | 0.500000  | primary | 3          | false             | 0
      |                   |                        | 2020-10-20 09:56:48
 1       | 192.168.136.132 | 5432 | up     | 0.500000  | standby | 0          | true              | 0
      |                   |                        | 2020-10-20 10:20:34
(2 rows)

postgres=# \q
# 查看此时从库数据
[postgres@postgresql2 bin]$ psql
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# table test_ms;
 id |   name
----+----------
 26 | a
 27 | b
 30 | can?
 35 | in

postgres=# \q

从该实验可以看出，主服务器宕机后，要开启恢复必须执行的操作有：启动pgpool节点服务 > 清空备机节点数据目录 > 使用pg_basebackup命令同步数据 > 启动数据库服务 > 将数据库重新加入集群。

七、模拟从服务器宕机

目前主库和主pgpool节点是postgresql1节点，从库和从pgpool节点是postgresql2节点。

# 模拟从服务器宕机
[postgres@postgresql2 bin]$ exit
登出
[root@postgresql2 ~]# reboot
# 主pgpool节点依旧持有领导者身份，依旧持有vip网卡
[postgres@postgresql1 ~]$ ifconfig
...
ens33:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.136.133  netmask 255.255.255.0  broadcast 0.0.0.0
        ether 00:0c:29:b3:b6:f0  txqueuelen 1000  (Ethernet)
...
# 集群状态
[postgres@postgresql1 ~]$ psql -h 192.168.136.133 -p 9999
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |    hostname     | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_
delay | replication_state | replication_sync_state | last_status_change
---------+-----------------+------+--------+-----------+---------+------------+-------------------+-------------
------+-------------------+------------------------+---------------------
 0       | 192.168.136.131 | 5432 | up     | 0.500000  | primary | 3          | true              | 0
      |                   |                        | 2020-10-20 09:56:48
 1       | 192.168.136.132 | 5432 | down   | 0.500000  | standby | 0          | false             | 0
      |                   |                        | 2020-10-20 10:43:26
(2 rows)
# postgresql1依旧是主库，postgresql2是从库但是宕机
# 此时集群中插入数据
postgres=# table test_ms;
 id |   name
----+----------
 26 | a
 27 | b
 30 | can?
 35 | in

postgres=# insert into test_ms values(40,'hello');
INSERT 0 1
postgres=# table test_ms;
 id |   name
----+----------
 26 | a
 27 | b
 30 | can?
 35 | in
 40 | hello

postgres=# \q
# 开启postgresql2的pgpool
[postgres@postgresql2 ~]$ /pgdata/app/PostgreSQL/pgpool-II/bin/pgpool -n -D > /pgdata/app/PostgreSQL/pgpool-II/log/pgpool.log 2>&1 &
[1] 1290
# 不做备份直接启动数据库，查看节点状态
[postgres@postgresql2 ~]$ pg_ctl start -D $PGDATA                                                           
waiting for server to start....2020-10-20 11:01:38.470 CST [1443] LOG:  starting PostgreSQL 12.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-10-20 11:01:38.471 CST [1443] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-10-20 11:01:38.471 CST [1443] LOG:  listening on IPv6 address "::", port 5432
2020-10-20 11:01:38.472 CST [1443] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-10-20 11:01:38.576 CST [1443] LOG:  redirecting log output to logging collector process
2020-10-20 11:01:38.576 CST [1443] HINT:  Future log output will appear in directory "log".
 done
server started
[postgres@postgresql2 ~]$ pg_controldata | grep 'Database cluster state'
Database cluster state:               in archive recovery
# 此时该数据库处于in archive recovery
# 节点加入集群
[postgres@postgresql2 ~]$ cd /pgdata/app/PostgreSQL/pgpool-II/bin/
[postgres@postgresql2 bin]$ ./pcp_attach_node -d -U postgres -n 1
Password:
DEBUG: recv: tos="m", len=8
DEBUG: recv: tos="r", len=21
DEBUG: send: tos="C", len=6
DEBUG: recv: tos="c", len=20
pcp_attach_node -- Command Successful
DEBUG: send: tos="X", len=4
[postgres@postgresql2 bin]$ psql -h 192.168.136.133 -p 9999
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |    hostname     | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_
delay | replication_state | replication_sync_state | last_status_change
---------+-----------------+------+--------+-----------+---------+------------+-------------------+-------------
------+-------------------+------------------------+---------------------
 0       | 192.168.136.131 | 5432 | up     | 0.500000  | primary | 5          | true              | 0
      |                   |                        | 2020-10-20 09:56:48
 1       | 192.168.136.132 | 5432 | up     | 0.500000  | standby | 0          | false             | 0
      |                   |                        | 2020-10-20 11:03:08
(2 rows)

postgres=# \q
# 查看此时从库数据
[postgres@postgresql2 bin]$ psql
Password for user postgres:
psql (12.1)
Type "help" for help.

postgres=# table test_ms;
 id |   name
----+----------
 26 | a
 27 | b
 30 | can?
 35 | in
 40 | hello

postgres=# \q

从服务器宕机后，和从数据库宕机后的情况基本一样，也是不用备份同步数据（使用pg_basebackup命令）。
从该实验可以看出，从服务器宕机后，要开启恢复必须执行的操作有：启动pgpool节点服务 > 启动数据库服务 > 将数据库重新加入集群。

总结

数据库postgresql 层面的故障转移执行切换脚本, 本质上执行的还是 pg_ctl promote命令;
pgpool-ii 层面故障, 后续的检测/切换虚拟IP, 是不影响下层postgresql数据库运行状态的;
一旦涉及到主数据库的宕机，都要执行：清空备机节点数据目录 > 使用pg_basebackup命令同步数据。

postgresql

最后修改时间：2020-10-28 14:18:30

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者