[postgresql] repmgr集群 ERROR: unable to retrieve record for local node %n

原创手机用户0512 2023-05-29

1555

[postgresql] repmgr集群添加新节点后发现异常，信息也不同步。

[pg12@node3 ~]$ repmgr  -f /home/pg12/conf/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=192.168.5.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=192.168.5.102 user=repmgr dbname=repmgr connect_timeout=2
 3  | node3 | standby |   running | ? node1  | default  | 100      | 1        | host=192.168.5.103 user=repmgr dbname=repmgr connect_timeout=2

尝试standby follow
按照文章
“[postgresql] repmgr集群状态显示 - node “node*” (ID: %n) is not attached to its upstream node”
尝试repmgr v失败

[pg12@node3 ~]$ repmgr standby follow -f /home/pg12/conf/repmgr.conf
ERROR: unable to retrieve record for local node 3
[pg12@node3 ~]$

[pg12@node3 ~]$ repmgr cluster show -f /home/pg12/conf/repmgr.conf
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                             
----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=192.168.5.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=192.168.5.102 user=repmgr dbname=repmgr connect_timeout=2
 3  | node3 | standby |   running | ? node1  | default  | 100      | 1        | host=192.168.5.103 user=repmgr dbname=repmgr connect_timeout=2
[pg12@node3 ~]$

显然无法使用standby follow达成同步。

尝试关闭节点，重新rejoin
如下即是尝试关闭问题node，然后尝试使用repmgr node rejoin解决问题。


[pg12@node3 ~]$repmgr -f /home/pg12/conf/repmgr.conf node service --action=stop   
DETAIL: executing server command "/home/pg12/soft/bin/pg_ctl  -D '/home/pg12/data' -W -m fast stop"
[pg12@node3 ~]$
[pg12@node3 ~]$ repmgr node rejoin  -f /home/pg12/conf/repmgr.conf  -d 'host=192.168.5.101 dbname=repmgr user=repmgr'  --config-fil 
es=postgresql.local.conf,postgresql.conf --verbose --force-rewind
NOTICE: using provided configuration file "/home/pg12/conf/repmgr.conf"
ERROR: unable to retrieve node record for the local node
HINT: check the local node is registered with the current primary "node1" (ID: 1)
Segmentation fault (core dumped)
[pg12@node3 ~]$ 
[pg12@node3 ~]$ repmgr cluster show -f /home/pg12/conf/repmgr.conf
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                             
----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=192.168.5.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=192.168.5.102 user=repmgr dbname=repmgr connect_timeout=2
 3  | node3 | standby |   running | ? node1  | default  | 100      | 1        | host=192.168.5.103 user=repmgr dbname=repmgr connect_timeout=2

显然无法实现恢复。

最后尝试注销，重新备考建设

注销node3节点的standby注册，然后重新克隆clone并注册standby：
repmgr -f /home/pg12/conf/repmgr.conf standby unregister
repmgr -f /home/pg12/conf/repmgr.conf cluster show

[pg12@node3 ~]$ repmgr  -f /home/pg12/conf/repmgr.conf standby unregister
INFO: connecting to local standby
INFO: connecting to primary database
NOTICE: unregistering node 3
INFO: standby unregistration complete
[pg12@node3 ~]$ repmgr -f /home/pg12/conf/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                             
----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=192.168.5.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=192.168.5.102 user=repmgr dbname=repmgr connect_timeout=2

确认取消注册后清除目前数据库，进行克隆（repmgr standby clone）：

rm -rf /home/pg12/data
repmgr -h 192.168.5.101 -U repmgr -d repmgr standby clone -D /home/pg12/data --dry-run
repmgr -h 192.168.5.101 -U repmgr -d repmgr standby clone -D /home/pg12/data
touch $PGDATA/standby.signal

此处$PGDATA 为设置环境变量，此处就是/home/pg12/data。

[pg12@node3 ~]$ rm -rf  /home/pg12/data

#--dry-run 选项为预检查，可以发现潜在一些问题
[pg12@node3 ~]$ repmgr -h 192.168.5.101  -U repmgr -d repmgr standby clone -D /home/pg12/data --dry-run
NOTICE: destination directory "/home/pg12/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.5.101 user=repmgr dbname=repmgr
DETAIL: current installation size is 33 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 9 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: all prerequisites for "standby clone" are met

#--dry-run  运行后没有什么报错，正式进行克隆
[pg12@node3 ~]$ repmgr -h 192.168.5.101  -U repmgr -d repmgr standby clone  -D /home/pg12/data
NOTICE: destination directory "/home/pg12/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.5.101 user=repmgr dbname=repmgr
DETAIL: current installation size is 33 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: creating directory "/home/pg12/data"...
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  pg_basebackup -l "repmgr base backup"  -D /home/pg12/data -h 192.168.5.101 -p 5432 -U repmgr -X stream 
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /home/pg12/data start
HINT: after starting the server, you need to register this standby with "repmgr standby register"

# 根据主库primary克隆备库后先不启动备库，在数据库目录创建空文件$PGDATA/standby.signal（限定于postgresql 12版本，这个识别文件作用是启动时使数据库判定自己为standby库）
[pg12@node3 ~]$ touch $PGDATA/standby.signal
[pg12@node3 ~]$ pg_ctl -D $PGDATA start

# 查看pg_is_in_recovery()，判断是否以备库身份启动。
[pg12@node3 ~]$ psql -h 192.168.5.103 -U repmgr repmgr -c 'select * from pg_is_in_recovery();'
 pg_is_in_recovery 
-------------------
 t
(1 row)

检查备库正常启动后，注册备库（repmgr standby register）：

[pg12@node3 ~]$ repmgr standby register -f /home/pg12/conf/repmgr.conf
INFO: connecting to local node "node3" (ID: 3)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
ERROR: local node not attached to primary node 1
HINT: specify the actual upstream node id with --upstream-node-id, or use -F/--force to continue anyway
[pg12@node3 ~]$ repmgr standby register -f /home/pg12/conf/repmgr.conf --force 
INFO: connecting to local node "node3" (ID: 3)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
WARNING: local node not attached to primary node 1
NOTICE: -F/--force supplied, continuing anyway
INFO: standby registration complete
NOTICE: standby node "node3" (ID: 3) successfully registered
[pg12@node3 ~]$ repmgr cluster show -f /home/pg12/conf/repmgr.conf
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                             
----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=192.168.5.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=192.168.5.102 user=repmgr dbname=repmgr connect_timeout=2
 3  | node3 | standby |   running | ! node1  | default  | 100      | 1        | host=192.168.5.103 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - node "node3" (ID: 3) is not attached to its upstream node "node1" (ID: 1)

#确定成功组成，然后处理问题not attached to its upstream node

[pg12@node3 ~]$ repmgr standby follow -f /home/pg12/conf/repmgr.conf
NOTICE: attempting to find and follow current primary
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 0/F00A488, follow target lsn is 0/F00A488
NOTICE: setting node 3's upstream to node 1
NOTICE: stopping server using "/home/pg12/soft/bin/pg_ctl  -D '/home/pg12/data' -w -m fast stop"
2023-05-26 21:56:37.093 CST [33220] LOG:  received fast shutdown request
2023-05-26 21:56:37.093 CST [33220] LOG:  aborting any active transactions
2023-05-26 21:56:37.093 CST [33225] FATAL:  terminating walreceiver process due to administrator command
2023-05-26 21:56:37.094 CST [33222] LOG:  shutting down
2023-05-26 21:56:37.096 CST [33220] LOG:  database system is shut down
NOTICE: starting server using "/home/pg12/soft/bin/pg_ctl  -w -D '/home/pg12/data' start"
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "node1" (ID: 1)
[pg12@node3 ~]$ repmgr cluster show -f /home/pg12/conf/repmgr.conf
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                             
----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=192.168.5.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=192.168.5.102 user=repmgr dbname=repmgr connect_timeout=2
 3  | node3 | standby |   running | node1    | default  | 100      | 1        | host=192.168.5.103 user=repmgr dbname=repmgr connect_timeout=2
[pg12@node3 ~]$

最后发现集群信息显示完美解决。
同时建表t1进行测试：

[pg12@node3 ~]$ psql -h 192.168.5.103 -U repmgr postgres -c 'select * from t1;'
 id |  comm  
----+--------
  1 | zhang3
(1 row)

[pg12@node3 ~]$ psql -h 192.168.5.101 -U repmgr postgres -c 'insert into t1 values(2,'li4');'
INSERT 0 1
[pg12@node3 ~]$ psql -h 192.168.5.103 -U repmgr postgres -c 'select * from t1;'
 id |  comm  
----+--------
  1 | zhang3
  2 | li4
(2 rows)

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

[postgresql] repmgr集群 ERROR: unable to retrieve record for local node %n

评论