PostgreSQL高可用集群之patroni（一）

原创贺晓群 2021-06-24

8015

服务器列表：

节点名	IP	操作系统	安装软件	备注
pg_node1	192.168.210.15	CentOS 7.6	PostgreSQL 13.3/patroni 2.0.2/etcd 3.5.0	初始主节点
pg_node2	192.168.210.81	CentOS 7.6	PostgreSQL 13.3/patroni 2.0.2/etcd 3.5.0	初始备节点
pg_node3	192.168.210.33	CentOS 7.6	PostgreSQL 13.3/patroni 2.0.2/etcd 3.5.0	初始备节点

主节点VIP实现漂移：192.168.210.66

主机配置（所有节点）

修改主机名

按照以上表格说明修改对应主机名

#pg_node2、pg_node3对应修改
hostnamectl set-hostname "pg_node1"

关闭selinux

sed -i 's/SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config

配置防火墙

防火墙需要开放postgres，etcd和patroni的端口。

postgres:5432
patroni:8008
etcd:2379/2380

firewall-cmd --add-port=5432/tcp --permanent
firewall-cmd --add-port=8008/tcp --permanent
firewall-cmd --add-port=2379/tcp --permanent
firewall-cmd --add-port=2380/tcp --permanent
firewall-cmd --reload
firewall-cmd --list-all

配置主机时区

timedatectl set-timezone Asia/Shanghai

配置主机同步时间

yum -y install chrony
sed '/^server/d' /etc/chrony.conf
echo 'server s1a.time.edu.cn iburst' >> /etc/chrony.conf
systemctl start chronyd
systemctl enable chronyd

安装需要的包

yum -y install gcc epel-release wget readline* zlib* bzip2 gcc-c++ openssl-devel  python-pip python-psycopg2 python-devel lrzsz jq

重启服务器

reboot

创建安装用户

groupadd -g 5432 postgres
useradd -u 5432 -g postgres postgres; echo 'Test123456' | passwd -f --stdin postgres

安装etcd

wget https://github.com/coreos/etcd/releases/download/v3.5.0/etcd-v3.5.0-linux-amd64.tar.gz
tar -zxvf etcd-v3.5.0-linux-amd64.tar.gz -C /opt/
cd /opt
mv etcd-v3.5.0-linux-amd64 etcd-v3.5.0
mkdir /etc/etcd
chown -R postgres:postgres /opt/etcd-v3.5.0 /etc/etcd
su - postgres

#pg_node1添加etcd配置
cat >> /etc/etcd/conf.yml <<EOF
name: etcd-1
data-dir: /opt/etcd-v3.5.0/data
listen-client-urls: http://192.168.210.15:2379,http://127.0.0.1:2379
advertise-client-urls: http://192.168.210.15:2379,http://127.0.0.1:2379
listen-peer-urls: http://192.168.210.15:2380
initial-advertise-peer-urls: http://192.168.210.15:2380
initial-cluster: etcd-1=http://192.168.210.15:2380,etcd-2=http://192.168.210.81:2380,etcd-3=http://192.168.210.33:2380
initial-cluster-token: etcd-cluster-token
initial-cluster-state: new
EOF

#pg_node2添加etcd配置
cat >> /etc/etcd/conf.yml <<EOF
name: etcd-2
data-dir: /opt/etcd-v3.5.0/data
listen-client-urls: http://192.168.210.81:2379,http://127.0.0.1:2379
advertise-client-urls: http://192.168.210.81:2379,http://127.0.0.1:2379
listen-peer-urls: http://192.168.210.81:2380
initial-advertise-peer-urls: http://192.168.210.81:2380
initial-cluster: etcd-1=http://192.168.210.15:2380,etcd-2=http://192.168.210.81:2380,etcd-3=http://192.168.210.33:2380
initial-cluster-token: etcd-cluster-token
initial-cluster-state: new
EOF

#pg_node3添加etcd配置
cat >> /etc/etcd/conf.yml <<EOF
name: etcd-3
data-dir: /opt/etcd-v3.5.0/data
listen-client-urls: http://192.168.210.33:2379,http://127.0.0.1:2379
advertise-client-urls: http://192.168.210.33:2379,http://127.0.0.1:2379
listen-peer-urls: http://192.168.210.33:2380
initial-advertise-peer-urls: http://192.168.210.33:2380
initial-cluster: etcd-1=http://192.168.210.15:2380,etcd-2=http://192.168.210.81:2380,etcd-3=http://192.168.210.33:2380
initial-cluster-token: etcd-cluster-token
initial-cluster-state: new
EOF

#添加环境变量
echo 'export ETCDCTL_API=3' >> /etc/profile
echo 'export PATRONICTL_CONFIG_FILE=/etc/patroni/patroni.yml' >> /etc/profile
echo 'PATH=/opt/PostgreSQL/13/bin:/opt/etcd-v3.5.0:$PATH' >> /etc/profile
source /etc/profile

#配置启动文件
su - root
cat >> /usr/lib/systemd/system/etcd.service <<EOF
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
WorkingDirectory=/opt/etcd-v3.5.0/
User=postgres
ExecStart=/opt/etcd-v3.5.0/etcd --config-file=/etc/etcd/conf.yml
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable etcd
systemctl start etcd
systemctl restart etcd

#查看集群成员信息
[root@pg_node1 data]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |     false |      false |        15 |       4863 |               4863 |        |
| 192.168.210.81:2379 |  ff5595d67d21105 |   3.5.0 |  1.0 MB |      true |      false |        15 |       4863 |               4863 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |     false |      false |        15 |       4863 |               4863 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pg_node1 data]# etcdctl endpoint health --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
+---------------------+--------+------------+-------+
|      ENDPOINT       | HEALTH |    TOOK    | ERROR |
+---------------------+--------+------------+-------+
| 192.168.210.81:2379 |   true | 7.232213ms |       |
| 192.168.210.33:2379 |   true | 7.497868ms |       |
| 192.168.210.15:2379 |   true | 7.464251ms |       |
+---------------------+--------+------------+-------+

[root@pg_node1 data]# etcdctl member list -w table
+------------------+---------+--------+----------------------------+--------------------------------------------------+------------+
|        ID        | STATUS  |  NAME  |         PEER ADDRS         |                   CLIENT ADDRS                   | IS LEARNER |
+------------------+---------+--------+----------------------------+--------------------------------------------------+------------+
|  ff5595d67d21105 | started | etcd-2 | http://192.168.210.81:2380 | http://127.0.0.1:2379,http://192.168.210.81:2379 |      false |
| 2baa1a77ec379977 | started | etcd-1 | http://192.168.210.15:2380 | http://127.0.0.1:2379,http://192.168.210.15:2379 |      false |
| b5d9c4826815356e | started | etcd-3 | http://192.168.210.33:2380 | http://127.0.0.1:2379,http://192.168.210.33:2379 |      false |
+------------------+---------+--------+----------------------------+--------------------------------------------------+------------+

etcdctl --help

etcd配置项说明：

#etcd名称，自定义
name: etcd-1
#存放etcd数据的目录，自定义
data-dir: /opt/etcd-v3.5.0/data
#监听URL，用户客户端和SERVER进行通信
listen-client-urls: http://192.168.210.15:2379,http://127.0.0.1:2379
#告知客户端自身的URL，TCP 2379端口用于监听客户端请求
advertise-client-urls: http://192.168.210.15:2379,http://127.0.0.1:2379
#监听URL，用于和其他节点通信
listen-peer-urls: http://192.168.210.15:2380
#告知集群其他节点，端口2380用于集群通信
initial-advertise-peer-urls: http://192.168.210.15:2380
#定义了集群内所有成员
initial-cluster: etcd-1=http://192.168.210.15:2380,etcd-2=http://192.168.108.81:2380,etcd-3=http://192.168.108.33:2380
#集群ID，唯一标识
initial-cluster-token: etcd-cluster-token
#集群状态，new为新创建集群，existing为已经存在的集群
initial-cluster-state: new

PostgreSQL安装：

#下载源码包
wget https://ftp.postgresql.org/pub/source/v13.3/postgresql-13.3.tar.bz2
#解压源码包
tar xjvf postgresql-13.3.tar.bz2
#创建安装目录
mkdir -p -m 700 /opt/PostgreSQL/13/data
chown -R postgres:postgres /opt/PostgreSQL/13/data
#编译安装
cd postgresql-13.3
./configure --prefix=/opt/PostgreSQL/13 --with-pgport=5432 --with-python --with-openssl
gmake -j 8 world
make install
make install-docs
make install-world

Patroni安装：

curl https://bootstrap.pypa.io/pip/2.7/get-pip.py -o get-pip.py
python get-pip.py
pip install --upgrade pip
pip install --upgrade setuptools
pip install --ignore-installed psycopg2
pip install psycopg2-binary
#验证python -c "import psycopg2; print(psycopg2.__version__)"
pip install patroni[etcd]

配置sudo权限：

cat >> /etc/sudoers <<EOF
postgres        ALL=(root)        NOPASSWD: ALL
EOF

创建patroni服务：

cat >> /usr/lib/systemd/system/patroni.service <<EOF
[Unit]
Description=Runners to orchestrate a high-availability PostgreSQL
After=syslog.target network.target
  
[Service]
Type=simple
User=postgres
Group=postgres
EnvironmentFile=-/etc/patroni/patroni_env.conf
# 使用watchdog进行服务监控
ExecStartPre=-/usr/bin/sudo /sbin/modprobe softdog
# 使用postgres用户管理，需要sudo
ExecStartPre=-/usr/bin/sudo /bin/chown postgres /dev/watchdog
# 注意纠正patroni命令的路径
ExecStart=/usr/bin/patroni /etc/patroni/patroni.yml 
ExecReload=/bin/kill -s HUP \$MAINPID
KillMode=process
TimeoutSec=30
Restart=no
  
[Install]
WantedBy=multi-user.target
EOF

#重新加载systemd服务
systemctl daemon-reload

安装watchdog

使用watchdog为防止出现脑裂，如果Leader节点异常导致patroni进程无法及时更新watchdog，会在Leader key过期的前5秒触发重启。重启如果在5秒之内完成，Leader节点有机会再次获得Leader锁，否则Leader key过期后，由备库通过选举选出新的Leader。Patroni会在将PostgreSQL提升为master之前尝试激活watchdog。如果看watchdog激活失败并且watchdog模式是required那么节点将拒绝成为主节点。在决定参加领导者选举时，Patroni还将检查watchdog配置是否允许它成为领导者。在将PostgreSQL降级后（例如由于手动故障转移），Patroni将再次禁用watchdog。当 Patroni处于暂停状态时，watchdog也将被禁用。正常停止Patroni服务，也会将watchdog禁用。

# 安装软件，linux内置功能
yum install -y watchdog
# 初始化watchdog字符设备
modprobe softdog
# 修改/dev/watchdog设备权限
chmod 666 /dev/watchdog
# 启动watchdog服务
systemctl start watchdog
systemctl enable watchdog

配置patroni

su - postgres
sudo mkdir -p /etc/patroni
sudo chown -R postgres:postgres /etc/patroni
# pg_node1
# 配置文件路径在前面systemd服务中已经定义
cat >> /etc/patroni/patroni.yml <<EOF
scope: twpg  #这个会配置到PG的cluster_name参数中
namespace: /service/ # Etcd中键值位置，例如：/service/twpg/
name: pg1 # patroni名称，每个节点不一样
  
restapi:
  listen: 0.0.0.0:8008 # 保持默认，监听所有的8008端口
  connect_address: 192.168.210.15:8008 # 本地连接通信
  
etcd3: # 这里建议使用etcdv3，默认是etcdv2，默认写入到etcd中的键值都是不可见的(仅patroni如此)
  hosts: 192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379 # Etcd地址,如果这里使用单节点的etcd，需要将hosts关键字替换为host

log:
  dir: /etc/patroni
  file_size: 50000000
  file_num: 10
  dateformat: '%Y-%m-%d %H:%M:%S'
  loggers:
    patroni.postmaster: WARNING
    #etcd.client: DEBUG
    #urllib3: DEBUG

#Patroni的引导程序，patroni集群初始化的时候，就会把信息写入到etcd中的/namespace/scope/config下面  
bootstrap:
  dcs:
    ttl: 30  #领导者的密钥的过期时间，也就是主库出现问题，故障转移的时间
    loop_wait: 10 #循环更新领导者密钥过程中的间隔时间
    retry_timeout: 10 #etcd和PostgreSQL操作重试的超时时间（以秒为单位），任何小于此值的超时都不会导致领导者降级，例如：网络出现问题后,保证整体集群不进行故障转移的保留时间
    maximum_lag_on_failover: 1048576 #如果Master和Replicate之间的字节数延迟大于此值，那么Replicate将不参与新的领导者选举。
    master_start_timeout: 300 #在触发故障转移之前允许主服务器从故障中恢复的时间
    synchronous_mode: false # 异步复制
    postgresql: # 以下是pgsql服务的特性即参数配置，不详述
      use_pg_rewind: true
      use_slots: true
      parameters:
        listen_addresses: "0.0.0.0"
        port: 5432
        wal_level: replica
        hot_standby: "on"
        wal_keep_segments: 256
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"
        logging_collector: "on"
        #archive_mode: "on"
        #archive_timeout: 1800s
        #archive_command: test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f
      #recovery_conf:
        #restore_command: cp /mnt/server/archivedir/%f %p
  
  initdb:
  - encoding: UTF8
  - locale: C
  - lc-ctype: zh_CN.UTF-8
  - data-checksums
  
  pg_hba: # 定义流复制用户和远程连接身份鉴别设置
  - host replication repuser 192.168.210.0/24 md5
  - host all all 192.168.0.0/16 md5
  
postgresql: 
  listen: 0.0.0.0:5432
  connect_address: 192.168.210.15:5432 # 连接pgsql服务的配置,这里不能使用127.0.0.1,pg_basebackup需要远程连接主库进行在线复制
  data_dir: /opt/PostgreSQL/13/data # \$PGDATA
  bin_dir: /opt/PostgreSQL/13/bin # \$PGHOME/bin
  
  authentication:
    replication:
      username: repuser
      password: "Test@123456"
    superuser:
      username: twsm
      password: "Test@123456"
    rewind:
      username: twsm
      password: "Test@123456"
  
  basebackup:
    #max-rate: 100M
    checkpoint: fast
  
  callbacks: # 本次配置没有使用haproxy+keepalived实现VIP切换和负载均衡，因为callbacks方式更快速，对系统资源消耗更小，操作更简单，脚本后面提供
    on_start: /bin/bash /etc/patroni/patroni_callback.sh # patroni服务启动时候的触发的操作
    on_stop: /bin/bash /etc/patroni/patroni_callback.sh # patroni服务停止时候触发的操作
    on_role_change: /bin/bash /etc/patroni/patroni_callback.sh # patroni服务角色切换时触发的操作
  
watchdog: # 使用linux自带的软件watchdog监控patroni的服务持续性
  mode: automatic # Allowed values: off, automatic, required
  device: /dev/watchdog # watchdog设备，/dev/watchdog和/dev/watchdog0等同，可能存在兼容性区别
  safety_margin: 5
##safety_margin指如果Patroni没有及时更新watchdog，watchdog会在Leader key过期前多久触发重启。在本例的配置下(ttl=30，loop_wait=10，safety_margin=5)下，patroni进程每隔10秒（loop_wait）都会更新Leader key和watchdog。
  
tags: #标签的设置，如果集群包含异地的数据中心，可以根据需要配置该节点为不参与选主，不参与负载均衡，也不作为同步备库。
    nofailover: false # 一般用于异地是否执行自动切换
    noloadbalance: false # 一般用于异是否开启负载均衡
    clonefrom: false
    nosync: false # 一般用于异是否开启负载均衡
EOF

# pg_node2
# 修改
name: pg2
restapi:
  listen: 0.0.0.0:8008
  connect_address: 192.168.210.81:8008
postgresql: 
  listen: 0.0.0.0:5432
  connect_address: 192.168.210.81:5432

# pg_node3
# 修改
name: pg3
restapi:
  listen: 0.0.0.0:8008
  connect_address: 192.168.210.33:8008
postgresql: 
  listen: 0.0.0.0:5432
  connect_address: 192.168.210.33:5432

创建patroni_callback脚本

# 脚本的开头传入了三个变量，但是在patroni.yml文件中我们并没有传入任何的变量，实际测试过程中发现由patroni服务默认传入三个变量
# $1 - action, patroni触发的动作,stop/start/on_role_change/restart/reload
# $2 - role, 当前节点的角色，master/{slave|replica}
# $3 - scope, 作用范围，twpg服务
# 脚本来自于其他博客，由于逻辑很简单，直接引用了
# 当节点角色为主，使用ip addr命令绑定VIP地址
# 当节点角色为备，使用ip addr命令解绑VIP地址
cat >> /etc/patroni/patroni_callback.sh <<EOF
#!/bin/bash
  
readonly action=\$1
readonly role=\$2
readonly scope=\$3

vip=192.168.210.66
dev=eth0

function usage() {
    echo "Usage: \$0 <on_start|on_stop|on_role_change> <role> <scope>"
    exit 1
}
  
echo "this is patroni callback \$action \$role \$scope"
  
case \$action in
    on_stop)
        sudo ip addr del \${vip}/24 dev \$dev 2>/dev/null
        ;;
    on_start)
        ;;
    on_role_change)
        if [[ \$role == 'master' ]]; then
            # 绑定VIP
            sudo ip addr add \${vip}/24 dev \$dev 2>/dev/null
            # 监测VIP冲突,并屏蔽冲突的IP
            sudo arping -q -A -c 1 -I \$dev \$vip
        else
            sudo ip addr del \${vip}/24 dev \$dev 2>/dev/null
        fi
        ;;
    *)
        usage
        ;;
esac
EOF

chmod u+x /etc/patroni/patroni_callback.sh
chown postgres:postgres /etc/patroni/patroni_callback.sh

启动patroni服务

#初次启动patroni时会自动进行数据库的初始化和备库的创建
systemctl enable patroni
systemctl start patroni 
[root@pg_node1 data]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running |  3 |           |
| pg2    | 192.168.210.81:5432 | Replica | running |  3 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | running |  3 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+

运维

#查询所有key
[root@pg_node1 data]# etcdctl --endpoints='192.168.210.15:2379' get / --prefix --keys-only
/service/twpg/config
/service/twpg/history
/service/twpg/initialize
/service/twpg/leader
/service/twpg/members/pg1
/service/twpg/members/pg2
/service/twpg/members/pg3
/service/twpg/optime/leader
--查询具体key
[root@pg_node1 data]# etcdctl --endpoints='192.168.210.15:2379' get /service/twpg/config

#通过restapi查询patroni信息（主节点）
[root@pg_node1 data]# curl -s http://192.168.210.15:8008/patroni | jq
{
  "database_system_identifier": "6976142033405049133",
  "postmaster_start_time": "2021-06-21 15:33:22.073 CST",
  "timeline": 3,
  "cluster_unlocked": false,
  "patroni": {
    "scope": "twpg",
    "version": "2.0.2"
  },
  "replication": [
    {
      "sync_state": "async",
      "sync_priority": 0,
      "client_addr": "192.168.210.81",
      "state": "streaming",
      "application_name": "pg2",
      "usename": "repuser"
    },
    {
      "sync_state": "async",
      "sync_priority": 0,
      "client_addr": "192.168.210.33",
      "state": "streaming",
      "application_name": "pg3",
      "usename": "repuser"
    }
  ],
  "state": "running",
  "role": "master",
  "xlog": {
    "location": 83886408
  },
  "server_version": 130003
}
#通过restapi查询patroni信息（备节点）
[root@pg_node1 data]# curl -s http://192.168.210.81:8008/patroni | jq
{
  "database_system_identifier": "6976142033405049133",
  "postmaster_start_time": "2021-06-21 15:37:42.746 CST",
  "timeline": 3,
  "cluster_unlocked": false,
  "patroni": {
    "scope": "twpg",
    "version": "2.0.2"
  },
  "state": "running",
  "role": "replica",
  "xlog": {
    "received_location": 83886408,
    "replayed_timestamp": null,
    "paused": false,
    "replayed_location": 83886408
  },
  "server_version": 130003
}

#模拟etcd节点出现问题，停掉etcd的LEADER节点
[root@pg_node1 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |     false |      false |        15 |       4866 |               4866 |        |
| 192.168.210.81:2379 |  ff5595d67d21105 |   3.5.0 |  1.0 MB |      true |      false |        15 |       4866 |               4866 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |     false |      false |        15 |       4866 |               4866 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pg_node2 ~]# systemctl stop etcd
[root@pg_node2 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23T09:20:00.826+0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003fea80/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
Failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |     false |      false |        16 |       4867 |               4867 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |      true |      false |        16 |       4867 |               4867 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pg_node1 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23T09:16:37.350+0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00037c700/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
Failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |     false |      false |        16 |       4867 |               4867 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |      true |      false |        16 |       4867 |               4867 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pg_node3 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23T09:20:39.338+0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00013a380/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
Failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |     false |      false |        16 |       4867 |               4867 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |      true |      false |        16 |       4867 |               4867 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
#从以上信息可以看到，etcd已重新选举了LEADER，pg_node2节点已拒绝连接
#下面再次启动pg_node2节点的etcd，节点加入集群
[root@pg_node2 ~]# systemctl start etcd
[root@pg_node2 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |     false |      false |        16 |       4868 |               4868 |        |
| 192.168.210.81:2379 |  ff5595d67d21105 |   3.5.0 |  1.0 MB |     false |      false |        16 |       4868 |               4868 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |      true |      false |        16 |       4868 |               4868 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pg_node1 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |     false |      false |        16 |       4868 |               4868 |        |
| 192.168.210.81:2379 |  ff5595d67d21105 |   3.5.0 |  1.0 MB |     false |      false |        16 |       4868 |               4868 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |      true |      false |        16 |       4868 |               4868 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pg_node3 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |     false |      false |        16 |       4868 |               4868 |        |
| 192.168.210.81:2379 |  ff5595d67d21105 |   3.5.0 |  1.0 MB |     false |      false |        16 |       4868 |               4868 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |      true |      false |        16 |       4868 |               4868 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

#模拟etcd节点出现问题，停掉任意2个etcd节点
[root@pg_node2 ~]# systemctl stop etcd
[root@pg_node2 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23T09:20:00.826+0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003fea80/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
Failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |     false |      false |        16 |       4867 |               4867 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |      true |      false |        16 |       4867 |               4867 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pg_node3 ~]# systemctl stop etcd
[root@pg_node3 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23T09:27:22.438+0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000150000/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
Failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
{"level":"warn","ts":"2021-06-23T09:27:27.439+0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000150000/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.210.33:2379: connect: connection refused\""}
Failed to get the status of endpoint 192.168.210.33:2379 (context deadline exceeded)
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |        ERRORS         |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |     false |      false |        17 |       4869 |               4869 | etcdserver: no leader |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
#从以上信息可以看到，停止3个中任意两个节点，etcd集群不再可用
#启动后集群恢复正常，健壮性非常强
[root@pg_node3 ~]# systemctl start etcd
[root@pg_node3 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23T09:28:31.642+0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000150a80/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
Failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |      true |      false |        18 |       4883 |               4883 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |     false |      false |        18 |       4883 |               4883 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pg_node2 ~]# systemctl start etcd
[root@pg_node2 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |      true |      false |        18 |       4888 |               4888 |        |
| 192.168.210.81:2379 |  ff5595d67d21105 |   3.5.0 |  1.0 MB |     false |      false |        18 |       4888 |               4888 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |     false |      false |        18 |       4888 |               4888 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pg_node1 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.210.15:2379 | 2baa1a77ec379977 |   3.5.0 |  1.0 MB |      true |      false |        19 |       4905 |               4905 |        |
| 192.168.210.81:2379 |  ff5595d67d21105 |   3.5.0 |  1.0 MB |     false |      false |        19 |       4905 |               4905 |        |
| 192.168.210.33:2379 | b5d9c4826815356e |   3.5.0 |  1.0 MB |     false |      false |        19 |       4905 |               4905 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

#模拟PostgreSQL数据库出现问题
#停止主库服务
[postgres@pg_node1 ~]$ pg_ctl stop -D /opt/PostgreSQL/13/data
waiting for server to shut down.... done
server stopped
[postgres@pg_node1 ~]$ ps -ef|grep postgres
postgres  4447  4367  0 09:37 pts/0    00:00:00 grep --color=auto postgres
[postgres@pg_node1 ~]$ ps -ef|grep postgres
postgres  4451  4367  0 09:37 pts/0    00:00:00 grep --color=auto postgres
[postgres@pg_node1 ~]$ ps -ef|grep postgres
postgres  4453  4367  0 09:37 pts/0    00:00:00 grep --color=auto postgres
[postgres@pg_node1 ~]$ ps -ef|grep postgres
postgres  4459     1  1 09:37 ?        00:00:00 /opt/PostgreSQL/13/bin/postgres -D /opt/PostgreSQL/13/data --config-file=/opt/PostgreSQL/13/data/postgresql.conf --listen_addresses=0.0.0.0 --max_worker_processes=8 --max_prepared_transactions=0 --wal_level=replica --track_commit_timestamp=off --max_locks_per_transaction=64 --port=5432 --max_replication_slots=10 --max_connections=100 --hot_standby=on --cluster_name=twpg --wal_log_hints=on --max_wal_senders=10
postgres  4462  4459  0 09:37 ?        00:00:00 postgres: twpg: checkpointer 
postgres  4463  4459  0 09:37 ?        00:00:00 postgres: twpg: background writer 
postgres  4464  4459  0 09:37 ?        00:00:00 postgres: twpg: stats collector 
postgres  4472  4459  0 09:37 ?        00:00:00 postgres: twpg: twsm postgres 127.0.0.1(40298) idle
postgres  4487  4459  0 09:37 ?        00:00:00 postgres: twpg: walwriter 
postgres  4488  4459  0 09:37 ?        00:00:00 postgres: twpg: autovacuum launcher 
postgres  4489  4459  0 09:37 ?        00:00:00 postgres: twpg: logical replication launcher 
postgres  4490  4459  0 09:37 ?        00:00:00 postgres: twpg: walsender repuser 192.168.210.33(52430) streaming 0/5000668
postgres  4491  4459  0 09:37 ?        00:00:00 postgres: twpg: walsender repuser 192.168.210.81(62184) streaming 0/5000668
#PostgreSQL服务被patroni自动拉起，未发生故障转移
[root@pg_node1 log]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running |  6 |           |
| pg2    | 192.168.210.81:5432 | Replica | running |  6 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | running |  6 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
#查看下刚刚停止PG服务后的patroni日志，patroni检测到PG服务关闭，会尝试把PG服务启动
Jun 23 09:37:45 pg_node1 patroni: 2021-06-23 09:37:45,395 INFO: no action.  i am the leader with the lock
Jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.040 CST [4012] LOG:  received fast shutdown request
Jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.047 CST [4012] LOG:  aborting any active transactions
Jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.047 CST [4056] FATAL:  terminating connection due to administrator command
Jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.048 CST [4012] LOG:  background worker "logical replication launcher" (PID 4258) exited with exit code 1
Jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.049 CST [4015] LOG:  shutting down
Jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.122 CST [4012] LOG:  database system is shut down
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,389 WARNING: Postgresql is not running.
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,390 INFO: Lock owner: pg1; I am pg1
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,404 INFO: pg_controldata:
Jun 23 09:37:55 pg_node1 patroni: Database system identifier: 6976142033405049133
Jun 23 09:37:55 pg_node1 patroni: pg_control last modified: Wed Jun 23 09:37:47 2021
Jun 23 09:37:55 pg_node1 patroni: Blocks per segment of large relation: 131072
Jun 23 09:37:55 pg_node1 patroni: Size of a large-object chunk: 2048
Jun 23 09:37:55 pg_node1 patroni: WAL block size: 8192
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's oldestActiveXID: 0
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's TimeLineID: 5
Jun 23 09:37:55 pg_node1 patroni: Bytes per WAL segment: 16777216
Jun 23 09:37:55 pg_node1 patroni: Fake LSN counter for unlogged rels: 0/3E8
Jun 23 09:37:55 pg_node1 patroni: max_connections setting: 100
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint location: 0/5000510
Jun 23 09:37:55 pg_node1 patroni: Float8 argument passing: by value
Jun 23 09:37:55 pg_node1 patroni: Minimum recovery ending location: 0/0
Jun 23 09:37:55 pg_node1 patroni: track_commit_timestamp setting: off
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's newestCommitTsXid: 0
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's NextMultiXactId: 1
Jun 23 09:37:55 pg_node1 patroni: Maximum size of a TOAST chunk: 1996
Jun 23 09:37:55 pg_node1 patroni: Maximum data alignment: 8
Jun 23 09:37:55 pg_node1 patroni: Date/time type storage: 64-bit integers
Jun 23 09:37:55 pg_node1 patroni: Database block size: 8192
Jun 23 09:37:55 pg_node1 patroni: Data page checksum version: 1
Jun 23 09:37:55 pg_node1 patroni: Time of latest checkpoint: Wed Jun 23 09:37:47 2021
Jun 23 09:37:55 pg_node1 patroni: wal_log_hints setting: on
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's full_page_writes: on
Jun 23 09:37:55 pg_node1 patroni: End-of-backup record required: no
Jun 23 09:37:55 pg_node1 patroni: max_prepared_xacts setting: 0
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's NextMultiOffset: 0
Jun 23 09:37:55 pg_node1 patroni: Backup start location: 0/0
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's oldestMultiXid: 1
Jun 23 09:37:55 pg_node1 patroni: Mock authentication nonce: 020ac2d0808d3cf471a8d90e23263e8a317b31c5f7e494fc3fba551683e5c39b
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's NextOID: 16385
Jun 23 09:37:55 pg_node1 patroni: Maximum columns in an index: 32
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's oldestXID: 479
Jun 23 09:37:55 pg_node1 patroni: Catalog version number: 202007201
Jun 23 09:37:55 pg_node1 patroni: max_worker_processes setting: 8
Jun 23 09:37:55 pg_node1 patroni: Maximum length of identifiers: 64
Jun 23 09:37:55 pg_node1 patroni: Min recovery ending loc's timeline: 0
Jun 23 09:37:55 pg_node1 patroni: max_locks_per_xact setting: 64
Jun 23 09:37:55 pg_node1 patroni: max_wal_senders setting: 10
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's NextXID: 0:488
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's REDO location: 0/5000510
Jun 23 09:37:55 pg_node1 patroni: Backup end location: 0/0
Jun 23 09:37:55 pg_node1 patroni: Database cluster state: shut down
Jun 23 09:37:55 pg_node1 patroni: pg_control version number: 1300
Jun 23 09:37:55 pg_node1 patroni: wal_level setting: replica
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's REDO WAL file: 000000050000000000000005
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's oldestCommitTsXid: 0
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's oldestXID's DB: 1
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's oldestMulti's DB: 1
Jun 23 09:37:55 pg_node1 patroni: Latest checkpoint's PrevTimeLineID: 5
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,406 INFO: Lock owner: pg1; I am pg1
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,406 INFO: Lock owner: pg1; I am pg1
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,418 INFO: starting as readonly because i had the session lock
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,420 INFO: closed patroni connection to the postgresql cluster
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,452 INFO: postmaster pid=4459
Jun 23 09:37:55 pg_node1 patroni: localhost:5432 - no response
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.485 CST [4459] LOG:  starting PostgreSQL 13.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.485 CST [4459] LOG:  listening on IPv4 address "0.0.0.0", port 5432
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.508 CST [4459] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.524 CST [4461] LOG:  database system was shut down at 2021-06-23 09:37:47 CST
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.525 CST [4461] WARNING:  specified neither primary_conninfo nor restore_command
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.525 CST [4461] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.525 CST [4461] LOG:  entering standby mode
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.533 CST [4461] LOG:  consistent recovery state reached at 0/5000588
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.534 CST [4461] LOG:  invalid record length at 0/5000588: wanted 24, got 0
Jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.535 CST [4459] LOG:  database system is ready to accept read only connections
Jun 23 09:37:56 pg_node1 patroni: localhost:5432 - accepting connections
Jun 23 09:37:56 pg_node1 patroni: localhost:5432 - accepting connections
Jun 23 09:37:56 pg_node1 patroni: this is patroni callback on_role_change replica twpg
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,544 INFO: Lock owner: pg1; I am pg1
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,545 INFO: establishing a new patroni connection to the postgres cluster
Jun 23 09:37:56 pg_node1 systemd: Started Session c31 of user root.
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,570 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,584 INFO: promoted self to leader because i had the session lock
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,586 INFO: Lock owner: pg1; I am pg1
Jun 23 09:37:56 pg_node1 patroni: server promoting
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56.593 CST [4461] LOG:  received promote request
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56.593 CST [4461] LOG:  redo is not required
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,593 INFO: cleared rewind state after becoming the leader
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56.599 CST [4461] LOG:  selected new timeline ID: 6
Jun 23 09:37:56 pg_node1 patroni: this is patroni callback on_role_change master twpg
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,618 INFO: updated leader lock during promote
Jun 23 09:37:56 pg_node1 systemd: Started Session c32 of user root.
Jun 23 09:37:56 pg_node1 systemd: Started Session c33 of user root.
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56.880 CST [4461] LOG:  archive recovery complete
Jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56.911 CST [4459] LOG:  database system is ready to accept connections
Jun 23 09:37:57 pg_node1 patroni: 2021-06-23 09:37:57,638 INFO: Lock owner: pg1; I am pg1
Jun 23 09:37:57 pg_node1 patroni: 2021-06-23 09:37:57,724 INFO: no action.  i am the leader with the lock

#模拟PostgreSQL数据库出现问题
#停止备库服务
[postgres@pg_node2 ~]$ pg_ctl stop -D /opt/PostgreSQL/13/data
waiting for server to shut down.... done
server stopped
[postgres@pg_node2 ~]$ ps -ef|grep postgres
postgres 26277 26180  0 09:51 pts/0    00:00:00 grep --color=auto postgres
[postgres@pg_node2 ~]$ ps -ef|grep postgres
postgres 26284     1  2 09:51 ?        00:00:00 /opt/PostgreSQL/13/bin/postgres -D /opt/PostgreSQL/13/data --config-file=/opt/PostgreSQL/13/data/postgresql.conf --listen_addresses=0.0.0.0 --max_worker_processes=8 --max_prepared_transactions=0 --wal_level=replica --track_commit_timestamp=off --max_locks_per_transaction=64 --port=5432 --max_replication_slots=10 --max_connections=100 --hot_standby=on --cluster_name=twpg --wal_log_hints=on --max_wal_senders=10
postgres 26286 26284  0 09:51 ?        00:00:00 postgres: twpg: startup recovering 000000060000000000000005
postgres 26287 26284  0 09:51 ?        00:00:00 postgres: twpg: checkpointer 
postgres 26288 26284  0 09:51 ?        00:00:00 postgres: twpg: background writer 
postgres 26289 26284  0 09:51 ?        00:00:00 postgres: twpg: stats collector 
postgres 26290 26284  0 09:51 ?        00:00:00 postgres: twpg: walreceiver 
postgres 26293 26180  0 09:51 pts/0    00:00:00 grep --color=auto postgres
#查看下刚刚停止PG服务后的patroni日志，patroni检测到PG服务关闭，会尝试把PG服务启动
Jun 23 09:51:29 pg_node2 patroni: 2021-06-23 09:51:29,043 INFO: does not have lock
Jun 23 09:51:29 pg_node2 patroni: 2021-06-23 09:51:29,045 INFO: no action.  i am a secondary and i am following a leader
Jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.498 CST [26228] LOG:  received fast shutdown request
Jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.505 CST [26228] LOG:  aborting any active transactions
Jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.505 CST [26234] FATAL:  terminating walreceiver process due to administrator command
Jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.505 CST [26241] FATAL:  terminating connection due to administrator command
Jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.507 CST [26231] LOG:  shutting down
Jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.522 CST [26228] LOG:  database system is shut down
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,042 WARNING: Postgresql is not running.
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,042 INFO: Lock owner: pg1; I am pg2
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,053 INFO: pg_controldata:
Jun 23 09:51:39 pg_node2 patroni: Database system identifier: 6976142033405049133
Jun 23 09:51:39 pg_node2 patroni: pg_control last modified: Wed Jun 23 09:51:37 2021
Jun 23 09:51:39 pg_node2 patroni: Blocks per segment of large relation: 131072
Jun 23 09:51:39 pg_node2 patroni: Size of a large-object chunk: 2048
Jun 23 09:51:39 pg_node2 patroni: WAL block size: 8192
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's oldestActiveXID: 488
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's TimeLineID: 6
Jun 23 09:51:39 pg_node2 patroni: Bytes per WAL segment: 16777216
Jun 23 09:51:39 pg_node2 patroni: Fake LSN counter for unlogged rels: 0/3E8
Jun 23 09:51:39 pg_node2 patroni: max_connections setting: 100
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint location: 0/50006A0
Jun 23 09:51:39 pg_node2 patroni: Float8 argument passing: by value
Jun 23 09:51:39 pg_node2 patroni: Minimum recovery ending location: 0/5000750
Jun 23 09:51:39 pg_node2 patroni: track_commit_timestamp setting: off
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's newestCommitTsXid: 0
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's NextMultiXactId: 1
Jun 23 09:51:39 pg_node2 patroni: Maximum size of a TOAST chunk: 1996
Jun 23 09:51:39 pg_node2 patroni: Maximum data alignment: 8
Jun 23 09:51:39 pg_node2 patroni: Date/time type storage: 64-bit integers
Jun 23 09:51:39 pg_node2 patroni: Database block size: 8192
Jun 23 09:51:39 pg_node2 patroni: Data page checksum version: 1
Jun 23 09:51:39 pg_node2 patroni: Time of latest checkpoint: Wed Jun 23 09:37:57 2021
Jun 23 09:51:39 pg_node2 patroni: wal_log_hints setting: on
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's full_page_writes: on
Jun 23 09:51:39 pg_node2 patroni: End-of-backup record required: no
Jun 23 09:51:39 pg_node2 patroni: max_prepared_xacts setting: 0
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's NextMultiOffset: 0
Jun 23 09:51:39 pg_node2 patroni: Backup start location: 0/0
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's oldestMultiXid: 1
Jun 23 09:51:39 pg_node2 patroni: Mock authentication nonce: 020ac2d0808d3cf471a8d90e23263e8a317b31c5f7e494fc3fba551683e5c39b
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's NextOID: 16385
Jun 23 09:51:39 pg_node2 patroni: Maximum columns in an index: 32
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's oldestXID: 479
Jun 23 09:51:39 pg_node2 patroni: Catalog version number: 202007201
Jun 23 09:51:39 pg_node2 patroni: max_worker_processes setting: 8
Jun 23 09:51:39 pg_node2 patroni: Maximum length of identifiers: 64
Jun 23 09:51:39 pg_node2 patroni: Min recovery ending loc's timeline: 6
Jun 23 09:51:39 pg_node2 patroni: max_locks_per_xact setting: 64
Jun 23 09:51:39 pg_node2 patroni: max_wal_senders setting: 10
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's NextXID: 0:488
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's REDO location: 0/5000668
Jun 23 09:51:39 pg_node2 patroni: Backup end location: 0/0
Jun 23 09:51:39 pg_node2 patroni: Database cluster state: shut down in recovery
Jun 23 09:51:39 pg_node2 patroni: pg_control version number: 1300
Jun 23 09:51:39 pg_node2 patroni: wal_level setting: replica
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's REDO WAL file: 000000060000000000000005
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's oldestCommitTsXid: 0
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's oldestXID's DB: 1
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's oldestMulti's DB: 1
Jun 23 09:51:39 pg_node2 patroni: Latest checkpoint's PrevTimeLineID: 6
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,055 INFO: Lock owner: pg1; I am pg2
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,081 INFO: Local timeline=6 lsn=0/5000750
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,088 INFO: master_timeline=6
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,089 INFO: Lock owner: pg1; I am pg2
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,105 INFO: starting as a secondary
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,107 INFO: closed patroni connection to the postgresql cluster
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,130 INFO: postmaster pid=26284
Jun 23 09:51:39 pg_node2 patroni: localhost:5432 - no response
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.172 CST [26284] LOG:  starting PostgreSQL 13.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.172 CST [26284] LOG:  listening on IPv4 address "0.0.0.0", port 5432
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.186 CST [26284] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.203 CST [26286] LOG:  database system was shut down in recovery at 2021-06-23 09:51:37 CST
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.203 CST [26286] LOG:  entering standby mode
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.212 CST [26286] LOG:  redo starts at 0/5000668
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.212 CST [26286] LOG:  consistent recovery state reached at 0/5000750
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.212 CST [26286] LOG:  invalid record length at 0/5000750: wanted 24, got 0
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.213 CST [26284] LOG:  database system is ready to accept read only connections
Jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.229 CST [26290] LOG:  started streaming WAL from primary at 0/5000000 on timeline 6
Jun 23 09:51:40 pg_node2 patroni: localhost:5432 - accepting connections
Jun 23 09:51:40 pg_node2 patroni: localhost:5432 - accepting connections
Jun 23 09:51:40 pg_node2 patroni: this is patroni callback on_start replica twpg
Jun 23 09:51:40 pg_node2 patroni: 2021-06-23 09:51:40,229 INFO: Lock owner: pg1; I am pg2
Jun 23 09:51:40 pg_node2 patroni: 2021-06-23 09:51:40,229 INFO: does not have lock
Jun 23 09:51:40 pg_node2 patroni: 2021-06-23 09:51:40,229 INFO: establishing a new patroni connection to the postgres cluster
Jun 23 09:51:40 pg_node2 patroni: 2021-06-23 09:51:40,261 INFO: no action.  i am a secondary and i am following a leader
Jun 23 09:51:50 pg_node2 patroni: 2021-06-23 09:51:50,228 INFO: Lock owner: pg1; I am pg2
Jun 23 09:51:50 pg_node2 patroni: 2021-06-23 09:51:50,228 INFO: does not have lock
Jun 23 09:51:50 pg_node2 patroni: 2021-06-23 09:51:50,238 INFO: no action.  i am a secondary and i am following a leader

#模拟patroni出现问题
#停止PG主机所在节点的patroni服务
[root@pg_node1 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running |  6 |           |
| pg2    | 192.168.210.81:5432 | Replica | running |  6 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | running |  6 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node1 ~]# ip -o -4 a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 192.168.210.15/24 brd 192.168.210.255 scope global noprefixroute dynamic eth0\       valid_lft 63429sec preferred_lft 63429sec
2: eth0    inet 192.168.210.66/24 scope global secondary eth0\       valid_lft forever preferred_lft forever
[root@pg_node1 ~]# systemctl stop patroni
[root@pg_node1 ~]# ip -o -4 a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 192.168.210.15/24 brd 192.168.210.255 scope global noprefixroute dynamic eth0\       valid_lft 63357sec preferred_lft 63357sec
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running |  6 |           |
| pg2    | 192.168.210.81:5432 | Replica | running |  6 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | running |  6 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Replica | stopped |    |   unknown |
| pg2    | 192.168.210.81:5432 | Replica | running |  7 |       0.0 |
| pg3    | 192.168.210.33:5432 | Leader  | running |  7 |           |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg2    | 192.168.210.81:5432 | Replica | running |  7 |       0.0 |
| pg3    | 192.168.210.33:5432 | Leader  | running |  7 |           |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node3 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running |  6 |           |
| pg2    | 192.168.210.81:5432 | Replica | running |  6 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | running |  6 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node3 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg2    | 192.168.210.81:5432 | Replica | running |  7 |       0.0 |
| pg3    | 192.168.210.33:5432 | Leader  | running |  7 |           |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node3 ~]# ip -o -4 a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 192.168.210.33/24 brd 192.168.210.255 scope global noprefixroute dynamic eth0\       valid_lft 69980sec preferred_lft 69980sec
2: eth0    inet 192.168.210.66/24 scope global secondary eth0\       valid_lft forever preferred_lft forever
#节点3日志信息，数据库提升为主库
Jun 23 09:59:07 pg_node3 patroni: 2021-06-23 09:59:07,841 INFO: Lock owner: pg1; I am pg3
Jun 23 09:59:07 pg_node3 patroni: 2021-06-23 09:59:07,841 INFO: does not have lock
Jun 23 09:59:07 pg_node3 patroni: 2021-06-23 09:59:07,849 INFO: no action.  i am a secondary and i am following a leader
Jun 23 09:59:12 pg_node3 patroni: 2021-06-23 09:59:12.171 CST [25279] LOG:  replication terminated by primary server
Jun 23 09:59:12 pg_node3 patroni: 2021-06-23 09:59:12.171 CST [25279] DETAIL:  End of WAL reached on timeline 6 at 0/50007C8.
Jun 23 09:59:12 pg_node3 patroni: 2021-06-23 09:59:12.171 CST [25279] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
Jun 23 09:59:12 pg_node3 patroni: 2021-06-23 09:59:12.172 CST [32443] LOG:  invalid record length at 0/50007C8: wanted 24, got 0
Jun 23 09:59:12 pg_node3 patroni: 2021-06-23 09:59:12.179 CST [26335] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
Jun 23 09:59:12 pg_node3 patroni: Is the server running on host "192.168.210.15" and accepting
Jun 23 09:59:12 pg_node3 patroni: TCP/IP connections on port 5432?
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,151 WARNING: Request failed to pg1: GET http://192.168.210.15:8008/patroni (HTTPConnectionPool(host=u'192.168.210.15', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', error(104, 'Connection reset by peer'))))
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,153 INFO: Got response from pg2 http://192.168.210.81:8008/patroni: {"database_system_identifier": "6976142033405049133", "postmaster_start_time": "2021-06-23 09:51:39.194 CST", "timeline": 6, "cluster_unlocked": false, "patroni": {"scope": "twpg", "version": "2.0.2"}, "state": "running", "role": "replica", "xlog": {"received_location": 83888072, "replayed_timestamp": null, "paused": false, "replayed_location": 83888072}, "server_version": 130003}
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,247 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,264 INFO: promoted self to leader by acquiring session lock
Jun 23 09:59:13 pg_node3 patroni: server promoting
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,273 INFO: Lock owner: pg3; I am pg3
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13.276 CST [32443] LOG:  received promote request
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13.276 CST [32443] LOG:  redo done at 0/5000750
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,277 INFO: cleared rewind state after becoming the leader
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13.289 CST [32443] LOG:  selected new timeline ID: 7
Jun 23 09:59:13 pg_node3 patroni: this is patroni callback on_role_change master twpg
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,298 INFO: updated leader lock during promote
Jun 23 09:59:13 pg_node3 systemd: Started Session c7 of user root.
Jun 23 09:59:13 pg_node3 systemd: Started Session c8 of user root.
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13.585 CST [32443] LOG:  archive recovery complete
Jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13.650 CST [32441] LOG:  database system is ready to accept connections
Jun 23 09:59:14 pg_node3 patroni: 2021-06-23 09:59:14,318 INFO: Lock owner: pg3; I am pg3
Jun 23 09:59:14 pg_node3 patroni: 2021-06-23 09:59:14,435 INFO: no action.  i am the leader with the lock
#
Jun 23 09:59:10 pg_node2 patroni: 2021-06-23 09:59:10,228 INFO: Lock owner: pg1; I am pg2
Jun 23 09:59:10 pg_node2 patroni: 2021-06-23 09:59:10,228 INFO: does not have lock
Jun 23 09:59:10 pg_node2 patroni: 2021-06-23 09:59:10,237 INFO: no action.  i am a secondary and i am following a leader
Jun 23 09:59:12 pg_node2 patroni: 2021-06-23 09:59:12.170 CST [26290] LOG:  replication terminated by primary server
Jun 23 09:59:12 pg_node2 patroni: 2021-06-23 09:59:12.170 CST [26290] DETAIL:  End of WAL reached on timeline 6 at 0/50007C8.
Jun 23 09:59:12 pg_node2 patroni: 2021-06-23 09:59:12.170 CST [26290] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
Jun 23 09:59:12 pg_node2 patroni: 2021-06-23 09:59:12.170 CST [26286] LOG:  invalid record length at 0/50007C8: wanted 24, got 0
Jun 23 09:59:12 pg_node2 patroni: 2021-06-23 09:59:12.177 CST [26671] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
Jun 23 09:59:12 pg_node2 patroni: Is the server running on host "192.168.210.15" and accepting
Jun 23 09:59:12 pg_node2 patroni: TCP/IP connections on port 5432?
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,174 WARNING: Request failed to pg1: GET http://192.168.210.15:8008/patroni (HTTPConnectionPool(host=u'192.168.210.15', port=8008): Max retries exceeded with url: /patroni (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1b110fb9d0>: Failed to establish a new connection: [Errno 111] Connection refused',)))
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,188 INFO: Got response from pg3 http://192.168.210.33:8008/patroni: {"database_system_identifier": "6976142033405049133", "postmaster_start_time": "2021-06-21 15:43:32.837 CST", "timeline": 6, "cluster_unlocked": true, "patroni": {"scope": "twpg", "version": "2.0.2"}, "state": "running", "role": "replica", "xlog": {"received_location": 83888072, "replayed_timestamp": null, "paused": false, "replayed_location": 83888072}, "server_version": 130003}
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,277 INFO: Could not take out TTL lock
Jun 23 09:59:13 pg_node2 patroni: server signaled
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13.291 CST [26284] LOG:  received SIGHUP, reloading configuration files
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13.292 CST [26284] LOG:  parameter "primary_conninfo" changed to "user=repuser passfile=/home/postgres/pgpass host=192.168.210.33 port=5432 sslmode=prefer application_name=pg2 gssencmode=prefer channel_binding=prefer"
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13.296 CST [26682] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
Jun 23 09:59:13 pg_node2 patroni: Is the server running on host "192.168.210.15" and accepting
Jun 23 09:59:13 pg_node2 patroni: TCP/IP connections on port 5432?
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,330 INFO: following new leader after trying and failing to obtain lock
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,333 INFO: Lock owner: pg3; I am pg2
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,333 INFO: does not have lock
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,342 INFO: Local timeline=6 lsn=0/50007C8
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,349 INFO: master_timeline=6
Jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,354 INFO: no action.  i am a secondary and i am following a leader
Jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,373 INFO: Lock owner: pg3; I am pg2
Jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,373 INFO: does not have lock
Jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,375 INFO: no action.  i am a secondary and i am following a leader
Jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,491 INFO: Lock owner: pg3; I am pg2
Jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,492 INFO: does not have lock
Jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,500 INFO: no action.  i am a secondary and i am following a leader
Jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.306 CST [26690] LOG:  fetching timeline history file for timeline 7 from primary server
Jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.320 CST [26690] LOG:  started streaming WAL from primary at 0/5000000 on timeline 6
Jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.320 CST [26690] LOG:  replication terminated by primary server
Jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.320 CST [26690] DETAIL:  End of WAL reached on timeline 6 at 0/50007C8.
Jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.321 CST [26286] LOG:  new target timeline is 7
Jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.322 CST [26690] LOG:  restarted WAL streaming at 0/5000000 on timeline 7
Jun 23 09:59:24 pg_node2 patroni: 2021-06-23 09:59:24,492 INFO: Lock owner: pg3; I am pg2
Jun 23 09:59:24 pg_node2 patroni: 2021-06-23 09:59:24,492 INFO: does not have lock
Jun 23 09:59:24 pg_node2 patroni: 2021-06-23 09:59:24,513 INFO: no action.  i am a secondary and i am following a leader
#从以上信息可以看到，当主库所在的patroni服务不可用时，会发生故障转移（pg3选举为了主库，VIP进行漂移）

#模拟主库所在的patroni进程无响应
#kill掉patroni进程，会导致服务器reboot
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Replica | running | 10 |       0.0 |
| pg2    | 192.168.210.81:5432 | Leader  | running | 10 |           |
| pg3    | 192.168.210.33:5432 | Replica | running | 10 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
kill -9 `pgrep patroni`
[root@pg_node2 ~]# uptime
 14:26:01 up 1 min,  1 user,  load average: 1.35, 0.41, 0.14
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Replica | running | 10 |       0.0 |
| pg2    | 192.168.210.81:5432 | Replica | running | 10 |       0.0 |
| pg3    | 192.168.210.33:5432 | Leader  | running | 10 |           |
+--------+---------------------+---------+---------+----+-----------+

#模拟主机不可用（主库所在主机）
#立刻关机：poweroff
[root@pg_node3 ~]# uptime
 14:38:38 up 4 days, 22:08,  3 users,  load average: 0.03, 0.09, 0.07
[root@pg_node3 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Replica | running | 10 |       0.0 |
| pg2    | 192.168.210.81:5432 | Replica | running | 10 |       0.0 |
| pg3    | 192.168.210.33:5432 | Leader  | running | 10 |           |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node3 ~]# poweroff
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running | 11 |           |
| pg2    | 192.168.210.81:5432 | Replica | running | 11 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | stopped |    |   unknown |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
2021-06-23 14:42:14,614 - ERROR - Request to server http://192.168.210.33:2379 failed: MaxRetryError("HTTPConnectionPool(host=u'192.168.210.33', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7feec6e0d9d0>, u'Connection to 192.168.210.33 timed out. (connect timeout=1.25)'))",)
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running | 11 |           |
| pg2    | 192.168.210.81:5432 | Replica | running | 11 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | stopped |    |   unknown |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running | 11 |           |
| pg2    | 192.168.210.81:5432 | Replica | running | 11 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
#再次启动pg3节点
[root@pg_node3 ~]# uptime
 14:46:14 up 0 min,  1 user,  load average: 0.71, 0.16, 0.05
[root@pg_node3 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running | 11 |           |
| pg2    | 192.168.210.81:5432 | Replica | running | 11 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | running | 10 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+

#手工switchover
[root@pg_node1 ~]# patronictl switchover
Master [pg1]: 
Candidate ['pg2', 'pg3'] []: ^CAborted!
[root@pg_node1 ~]# patronictl switchover
Master [pg1]: 
Candidate ['pg2', 'pg3'] []: pg2
When should the switchover take place (e.g. 2021-06-23T16:12 )  [now]: 
Current cluster topology
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running | 11 |           |
| pg2    | 192.168.210.81:5432 | Replica | running | 11 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | running | 11 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
Are you sure you want to switchover cluster twpg, demoting current master pg1? [y/N]: y  
2021-06-23 15:13:26.97922 Successfully switched over to "pg2"
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Replica | stopped |    |   unknown |
| pg2    | 192.168.210.81:5432 | Leader  | running | 11 |           |
| pg3    | 192.168.210.33:5432 | Replica | running | 11 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node1 ~]# patronictl list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Replica | stopped |    |   unknown |
| pg2    | 192.168.210.81:5432 | Leader  | running | 12 |           |
| pg3    | 192.168.210.33:5432 | Replica | running | 11 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node1 ~]# patronictl list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Replica | running | 12 |       0.0 |
| pg2    | 192.168.210.81:5432 | Leader  | running | 12 |           |
| pg3    | 192.168.210.33:5432 | Replica | running | 12 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
#日志信息如下：
Jun 23 15:13:16 pg_node1 patroni: 2021-06-23 15:13:16,311 INFO: Lock owner: pg1; I am pg1
Jun 23 15:13:16 pg_node1 patroni: 2021-06-23 15:13:16,316 INFO: no action.  i am the leader with the lock
Jun 23 15:13:24 pg_node1 patroni: 2021-06-23 15:13:24,854 INFO: received switchover request with leader=pg1 candidate=pg2 scheduled_at=None
Jun 23 15:13:24 pg_node1 patroni: 2021-06-23 15:13:24,879 INFO: Got response from pg2 http://192.168.210.81:8008/patroni: {"database_system_identifier": "6976142033405049133", "postmaster_start_time": "2021-06-23 14:25:40.252 CST", "timeline": 11, "cluster_unlocked": false, "patroni": {"scope": "twpg", "version": "2.0.2"}, "state": "running", "role": "replica", "xlog": {"received_location": 84186784, "replayed_timestamp": "2021-06-23 14:25:40.045 CST", "paused": false, "replayed_location": 84186784}, "server_version": 130003}
Jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25,011 INFO: Lock owner: pg1; I am pg1
Jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25,031 INFO: Got response from pg2 http://192.168.210.81:8008/patroni: {"database_system_identifier": "6976142033405049133", "postmaster_start_time": "2021-06-23 14:25:40.252 CST", "timeline": 11, "cluster_unlocked": false, "patroni": {"scope": "twpg", "version": "2.0.2"}, "state": "running", "role": "replica", "xlog": {"received_location": 84186784, "replayed_timestamp": "2021-06-23 14:25:40.045 CST", "paused": false, "replayed_location": 84186784}, "server_version": 130003}
Jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25,134 INFO: manual failover: demoting myself
Jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.222 CST [4256] LOG:  received fast shutdown request
Jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.234 CST [4256] LOG:  aborting any active transactions
Jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.234 CST [4270] FATAL:  terminating connection due to administrator command
Jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.235 CST [4256] LOG:  background worker "logical replication launcher" (PID 4522) exited with exit code 1
Jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.237 CST [4259] LOG:  shutting down
Jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.306 CST [4256] LOG:  database system is shut down
Jun 23 15:13:26 pg_node1 patroni: this is patroni callback on_stop master twpg
Jun 23 15:13:26 pg_node1 systemd: Started Session c5 of user root.
Jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,275 INFO: Leader key released
Jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,277 INFO: Lock owner: None; I am pg1
Jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,277 INFO: not healthy enough for leader race
Jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,277 INFO: manual failover: demote in progress
Jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,278 INFO: Lock owner: None; I am pg1
Jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,279 INFO: not healthy enough for leader race
Jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,279 INFO: manual failover: demote in progress
Jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,296 INFO: Lock owner: pg2; I am pg1
Jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,297 INFO: manual failover: demote in progress
Jun 23 15:13:27 pg_node1 patroni: 2021-06-23 15:13:27,416 INFO: Lock owner: pg2; I am pg1
Jun 23 15:13:27 pg_node1 patroni: 2021-06-23 15:13:27,417 INFO: manual failover: demote in progress
Jun 23 15:13:27 pg_node1 patroni: 2021-06-23 15:13:27,532 INFO: Lock owner: pg2; I am pg1
Jun 23 15:13:27 pg_node1 patroni: 2021-06-23 15:13:27,532 INFO: manual failover: demote in progress
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28,292 INFO: Local timeline=11 lsn=0/5049750
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28,300 INFO: master_timeline=12
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28,315 INFO: master: history=8#0110/5014B48#011no recovery target specified
Jun 23 15:13:28 pg_node1 patroni: 9#0110/502CCE8#011no recovery target specified
Jun 23 15:13:28 pg_node1 patroni: 10#0110/5048A98#011no recovery target specified
Jun 23 15:13:28 pg_node1 patroni: 11#0110/50497C8#011no recovery target specified
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28,317 INFO: closed patroni connection to the postgresql cluster
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28,348 INFO: postmaster pid=4739
Jun 23 15:13:28 pg_node1 patroni: localhost:5432 - no response
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.386 CST [4739] LOG:  starting PostgreSQL 13.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.386 CST [4739] LOG:  listening on IPv4 address "0.0.0.0", port 5432
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.401 CST [4739] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.417 CST [4741] LOG:  database system was shut down at 2021-06-23 15:13:25 CST
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.418 CST [4741] LOG:  entering standby mode
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.428 CST [4741] LOG:  consistent recovery state reached at 0/50497C8
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.428 CST [4741] LOG:  invalid record length at 0/50497C8: wanted 24, got 0
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.429 CST [4739] LOG:  database system is ready to accept read only connections
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.447 CST [4745] LOG:  fetching timeline history file for timeline 12 from primary server
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.460 CST [4745] LOG:  started streaming WAL from primary at 0/5000000 on timeline 11
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.463 CST [4745] LOG:  replication terminated by primary server
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.463 CST [4745] DETAIL:  End of WAL reached on timeline 11 at 0/50497C8.
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.463 CST [4741] LOG:  new target timeline is 12
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.464 CST [4745] LOG:  restarted WAL streaming at 0/5000000 on timeline 12
Jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.703 CST [4741] LOG:  redo starts at 0/50497C8
Jun 23 15:13:29 pg_node1 patroni: localhost:5432 - accepting connections
Jun 23 15:13:29 pg_node1 patroni: localhost:5432 - accepting connections
Jun 23 15:13:29 pg_node1 patroni: this is patroni callback on_role_change replica twpg
Jun 23 15:13:29 pg_node1 systemd: Started Session c6 of user root.
Jun 23 15:13:37 pg_node1 patroni: 2021-06-23 15:13:37,534 INFO: Lock owner: pg2; I am pg1
Jun 23 15:13:37 pg_node1 patroni: 2021-06-23 15:13:37,534 INFO: does not have lock
Jun 23 15:13:37 pg_node1 patroni: 2021-06-23 15:13:37,535 INFO: establishing a new patroni connection to the postgres cluster
Jun 23 15:13:37 pg_node1 patroni: 2021-06-23 15:13:37,579 INFO: no action.  i am a secondary and i am following a leader

#手工failover
[root@pg_node1 ~]# patronictl failover
Candidate ['pg1', 'pg3'] []: pg1
Current cluster topology
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Replica | running | 12 |       0.0 |
| pg2    | 192.168.210.81:5432 | Leader  | running | 12 |           |
| pg3    | 192.168.210.33:5432 | Replica | running | 12 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
Are you sure you want to failover cluster twpg, demoting current master pg2? [y/N]: y
2021-06-23 15:17:33.25244 Successfully failed over to "pg1"
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running | 12 |           |
| pg2    | 192.168.210.81:5432 | Replica | stopped |    |   unknown |
| pg3    | 192.168.210.33:5432 | Replica | running | 12 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node1 ~]# patronictl list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running | 13 |           |
| pg2    | 192.168.210.81:5432 | Replica | stopped |    |   unknown |
| pg3    | 192.168.210.33:5432 | Replica | running | 13 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
[root@pg_node1 ~]# patronictl list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running | 13 |           |
| pg2    | 192.168.210.81:5432 | Replica | running | 13 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | running | 13 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
#日志信息如下：
Jun 23 15:17:27 pg_node1 patroni: 2021-06-23 15:17:27,609 INFO: Lock owner: pg2; I am pg1
Jun 23 15:17:27 pg_node1 patroni: 2021-06-23 15:17:27,609 INFO: does not have lock
Jun 23 15:17:27 pg_node1 patroni: 2021-06-23 15:17:27,612 INFO: no action.  i am a secondary and i am following a leader
Jun 23 15:17:31 pg_node1 patroni: 2021-06-23 15:17:31.544 CST [4745] LOG:  replication terminated by primary server
Jun 23 15:17:31 pg_node1 patroni: 2021-06-23 15:17:31.544 CST [4745] DETAIL:  End of WAL reached on timeline 12 at 0/504A428.
Jun 23 15:17:31 pg_node1 patroni: 2021-06-23 15:17:31.544 CST [4745] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
Jun 23 15:17:31 pg_node1 patroni: 2021-06-23 15:17:31.545 CST [4741] LOG:  invalid record length at 0/504A428: wanted 24, got 0
Jun 23 15:17:31 pg_node1 patroni: 2021-06-23 15:17:31.550 CST [4810] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
Jun 23 15:17:31 pg_node1 patroni: Is the server running on host "192.168.210.81" and accepting
Jun 23 15:17:31 pg_node1 patroni: TCP/IP connections on port 5432?
Jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,556 INFO: Cleaning up failover key after acquiring leader lock...
Jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,565 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds #激活看门狗
Jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,577 INFO: promoted self to leader by acquiring session lock
Jun 23 15:17:32 pg_node1 patroni: server promoting  #数据库提升
Jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,579 INFO: Lock owner: pg1; I am pg1
Jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32.589 CST [4741] LOG:  received promote request
Jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32.589 CST [4741] LOG:  redo done at 0/504A3B0
Jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,593 INFO: cleared rewind state after becoming the leader
Jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32.594 CST [4741] LOG:  selected new timeline ID: 13
Jun 23 15:17:32 pg_node1 patroni: this is patroni callback on_role_change master twpg #回调VIP漂移脚本
Jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,611 INFO: updated leader lock during promote
Jun 23 15:17:32 pg_node1 systemd: Started Session c7 of user root.
Jun 23 15:17:32 pg_node1 systemd: Started Session c8 of user root.
Jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32.837 CST [4741] LOG:  archive recovery complete
Jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32.869 CST [4739] LOG:  database system is ready to accept connections
Jun 23 15:17:33 pg_node1 patroni: 2021-06-23 15:17:33,631 INFO: Lock owner: pg1; I am pg1
Jun 23 15:17:33 pg_node1 patroni: 2021-06-23 15:17:33,772 INFO: no action.  i am the leader with the lock

#查看集群参数
[root@pg_node1 ~]# patronictl show-config
loop_wait: 10
master_start_timeout: 300
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    hot_standby: 'on'
    listen_addresses: 0.0.0.0
    max_replication_slots: 10
    max_wal_senders: 10
    port: 5432
    wal_keep_segments: 256
    wal_level: replica
    wal_log_hints: 'on'
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
synchronous_mode: false
ttl: 30

#修改PG参数
#例如修改shared_buffers: 1GB
[root@pg_node1 ~]# patronictl edit-config twpg
--- 
+++ 
@@ -4,6 +4,7 @@
 postgresql:
   parameters:
     hot_standby: 'on'
+    shared_buffers: '1GB'
     listen_addresses: 0.0.0.0
     max_replication_slots: 10
     max_wal_senders: 10

Apply these changes? [y/N]: y
Configuration changed
[root@pg_node1 ~]# patronictl restart twpg
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+-----------------+
| Member | Host                | Role    | State   | TL | Lag in MB | Pending restart |
+--------+---------------------+---------+---------+----+-----------+-----------------+
| pg1    | 192.168.210.15:5432 | Leader  | running | 13 |           | *               |
| pg2    | 192.168.210.81:5432 | Replica | running | 13 |       0.0 | *               |
| pg3    | 192.168.210.33:5432 | Replica | running | 13 |       0.0 | *               |
+--------+---------------------+---------+---------+----+-----------+-----------------+
When should the restart take place (e.g. 2021-06-23T16:45)  [now]: 
Are you sure you want to restart members pg2, pg3, pg1? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []: 
Success: restart on member pg2
Success: restart on member pg3
Success: restart on member pg1
[root@pg_node2 ~]# psql -U twsm -d postgres -c 'show shared_buffers;'
 shared_buffers 
----------------
 1GB
(1 row)

#或者通过REST API修改，例如：max_connections修改为1000
curl -s -XPATCH -d '{"postgresql":{"parameters":{"max_connections":"1000"}}}' http://localhost:8008/config | jq .
#如果想reset或是删除
curl -s -XPATCH -d '{"postgresql":{"parameters":{"max_connections":null}}}' http://localhost:8008/config | jq .
#修改max_connections需要重启（Pending restart）
[root@pg_node1 ~]# patronictl list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+-----------------+
| Member | Host                | Role    | State   | TL | Lag in MB | Pending restart |
+--------+---------------------+---------+---------+----+-----------+-----------------+
| pg1    | 192.168.210.15:5432 | Replica | running | 14 |       0.0 | *               |
| pg2    | 192.168.210.81:5432 | Replica | running | 14 |       0.0 | *               |
| pg3    | 192.168.210.33:5432 | Leader  | running | 14 |           | *               |
+--------+---------------------+---------+---------+----+-----------+-----------------+
#如果想无条件地完全重写现有的动态配置
curl -s -XPUT -d '{"maximum_lag_on_failover":1048576,"retry_timeout":10,"postgresql":{"use_slots":true,"use_pg_rewind":true,"parameters":{"hot_standby":"on","wal_log_hints":"on","wal_level":"hot_standby","unix_socket_directories":".","max_wal_senders":5}},"loop_wait":3,"ttl":20}' http://localhost:8008/config | jq .

#查看历史failovers/switchovers
[root@pg_node1 ~]# patronictl history
+----+----------+------------------------------+----------------------------------+
| TL |      LSN | Reason                       | Timestamp                        |
+----+----------+------------------------------+----------------------------------+
|  1 | 25210568 | no recovery target specified | 2021-06-21T15:24:42.878129+08:00 |
|  2 | 25211144 | no recovery target specified | 2021-06-21T15:33:23.462860+08:00 |
|  3 | 83886408 | no recovery target specified | 2021-06-23T09:28:25.304246+08:00 |
|  4 | 83886920 | no recovery target specified | 2021-06-23T09:35:14.536083+08:00 |
|  5 | 83887496 | no recovery target specified | 2021-06-23T09:37:56.880226+08:00 |
|  6 | 83888072 | no recovery target specified | 2021-06-23T09:59:13.584445+08:00 |
|  7 | 83888704 | no recovery target specified | 2021-06-23T12:39:14.897591+08:00 |
|  8 | 83970888 | no recovery target specified | 2021-06-23T13:55:26.207367+08:00 |
|  9 | 84069608 | no recovery target specified | 2021-06-23T14:24:48.030224+08:00 |
| 10 | 84183704 | no recovery target specified | 2021-06-23T14:41:45.263156+08:00 |
| 11 | 84187080 | no recovery target specified | 2021-06-23T15:13:26.627326+08:00 |
| 12 | 84190248 | no recovery target specified | 2021-06-23T15:17:32.836962+08:00 |
+----+----------+------------------------------+----------------------------------+

#集群进入维护模式，防止自动故障转移
+----+----------+------------------------------+----------------------------------+
[root@pg_node1 ~]# patronictl pause
Success: cluster management is paused
[root@pg_node1 ~]# patronictl list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running | 13 |           |
| pg2    | 192.168.210.81:5432 | Replica | running | 13 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | running | 13 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
 Maintenance mode: on
#日志信息：
Jun 23 16:00:02 pg_node1 patroni: 2021-06-23 16:00:02,006 INFO: Lock owner: pg1; I am pg1
Jun 23 16:00:02 pg_node1 patroni: 2021-06-23 16:00:02,018 INFO: PAUSE: no action.  i am the leader with the lock
Jun 23 16:00:02 pg_node1 patroni: 2021-06-23 16:00:02,028 INFO: No PostgreSQL configuration items changed, nothing to reload.
#维护结束，恢复自动故障转移
[root@pg_node1 ~]# patronictl resume
Success: cluster management is resumed
[root@pg_node1 ~]# patronictl list
+ Cluster: twpg (6976142033405049133) ---+---------+----+-----------+
| Member | Host                | Role    | State   | TL | Lag in MB |
+--------+---------------------+---------+---------+----+-----------+
| pg1    | 192.168.210.15:5432 | Leader  | running | 13 |           |
| pg2    | 192.168.210.81:5432 | Replica | running | 13 |       0.0 |
| pg3    | 192.168.210.33:5432 | Replica | running | 13 |       0.0 |
+--------+---------------------+---------+---------+----+-----------+
#日志信息：
Jun 23 16:02:32 pg_node1 patroni: 2021-06-23 16:02:32,006 INFO: Lock owner: pg1; I am pg1
Jun 23 16:02:32 pg_node1 patroni: 2021-06-23 16:02:32,012 INFO: PAUSE: no action.  i am the leader with the lock
Jun 23 16:02:39 pg_node1 patroni: 2021-06-23 16:02:39,565 INFO: Lock owner: pg1; I am pg1
Jun 23 16:02:39 pg_node1 patroni: 2021-06-23 16:02:39,572 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
Jun 23 16:02:39 pg_node1 patroni: 2021-06-23 16:02:39,581 INFO: no action.  i am the leader with the lock
Jun 23 16:02:39 pg_node1 patroni: 2021-06-23 16:02:39,588 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jun 23 16:02:49 pg_node1 patroni: 2021-06-23 16:02:49,552 INFO: Lock owner: pg1; I am pg1
Jun 23 16:02:49 pg_node1 patroni: 2021-06-23 16:02:49,558 INFO: no action.  i am the leader with the lock

故障场景及处理方式：

故障位置	故障场景	patroni动作
备库	PG备库停止	拉起PG备库服务
备库	正常停止备库patroni	停止PG备库
备库	备库patroni异常停止	无动作
备库	备库无法连接ETCD	无动作
备库	非leader角色但PG处于生产模式	重启PG并切换到恢复模式作为备库
备库	备库主机重启	patroni启动拉起PG备库
主库	PG主库停止	启动PG，如果启动时间超过master_start_timeout，进行主备切换
主库	正常停止主库patroni	关闭主库，从备库中选举新主库
主库	主库patroni异常停止	看门狗重启主机，启动后如果拿到Leader锁，主备不切换，否则选举新主，切换主备
主库	主库无法连接ETCD	主库降级为备库，触发故障转移
-	ETCD故障	主库降级，集群中全部备库
-	同步模式下，无可用同步备库	临时切换为异步模式，在恢复同步模式之前自动故障转移不可用

参考：
https://patroni.readthedocs.io/en/latest/README.html
https://github.com/zalando/patroni/blob/master/docs/SETTINGS.rst
http://blog.itpub.net/30496307/viewspace-2764349/
https://mp.weixin.qq.com/s/edvWkTb-WF7YyVAFz5GCfw

微信图片_20210624202443.jpg

postgresql patroni

最后修改时间：2021-06-26 15:48:37

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

文章被以下合辑收录

墨天轮最受欢迎的技术文档-高可用架构篇（共121篇）

数据库高可用如何实现？这份合辑你或许可以用到！本文为大家整理了墨天轮社区上一些受欢迎的数据库高可用架构实践主题相关文档，包含Oracle、PostgreSQL、MySQL三类数据库，欢迎大家下载查看🌹