暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

使用repmgrd实现postgresql failover和auto failover

数据库架构 2019-09-24
949

前面的文章介绍了postgresql基于repmgr的高可用及切换方案,这篇文章主要聊聊通过repmgrd实现failoverauto failover

 

前提是部署好postgresql主从,同时部署好repmgr

    [postgres@node1 ~]$ repmgr cluster show
    ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
    ----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
    1 | node1 | primary | * running | | default | 100 | 3 | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
    2 | node2 | standby | running | node1 | default | 100 | 3 | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

     

    failover

     

    停止主库,模拟主库故障

      [postgres@node1 ~]$ pg_ctl stop -D pgdata/
      waiting for server to shut down..... done
      server stopped

      备库查看是unreachable状态

        [postgres@node2 .ssh]$ repmgr cluster show
        ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
        ----+-------+---------+---------------+----------+----------+----------+----------+---------------------------------------------------------------
        1 | node1 | primary | ? unreachable | | default | 100 | ? | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
        2 | node2 | standby | running | ? node1 | default | 100 | 3 | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

        备库提升为主库

          [postgres@node2 ~]$ repmgr standby promote
          NOTICE: promoting standby to primary
          DETAIL: promoting server "node2" (ID: 2) using "pg_ctl -w -D '/pgdata' promote"
          waiting for server to promote.... done
          server promoted
          NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
          NOTICE: STANDBY PROMOTE successful
          DETAIL: server "node2" (ID: 2) was successfully promoted to primary

          新主库查看集群状态

            [postgres@node2 ~]$ repmgr cluster show
            ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
            ----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
            1 | node1 | primary | - failed | | default | 100 | ? | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
            2 | node2 | primary | * running | | default | 100 | 4 | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

            WARNING: following issues were detected
            - unable to connect to node "node1" (ID: 1)

            原主库执行rejoin操作重新加入集群

              [postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run
              [postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose
              INFO: looking for configuration file in etc
              INFO: configuration file found at: "/etc/repmgr.conf"
              INFO: prerequisites for using pg_rewind are met
              INFO: 2 files copied to "/tmp/repmgr-config-archive-node1"
              NOTICE: executing pg_rewind
              DETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"
              NOTICE: 2 files copied to pgdata
              INFO: directory "/tmp/repmgr-config-archive-node1" deleted
              INFO: deleting "recovery.done"
              NOTICE: setting node 1's upstream to node 2
              WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"
              DETAIL: PQping() returned "PQPING_NO_RESPONSE"
              NOTICE: starting server using "pg_ctl -w -D '/pgdata' start"
              INFO: demoted primary is pingable
              INFO: node 1 has attached to its upstream node
              NOTICE: NODE REJOIN successful
              DETAIL: node 1 is now attached to node 2

              查看集群状态

                [postgres@node1 pgdata]$ repmgr cluster show
                ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
                ----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
                1 | node1 | standby | running | node2 | default | 100 | 3 | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
                2 | node2 | primary | * running | | default | 100 | 4 | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

                 

                auto failover

                 

                可以利用repmgrd进程实现自动的failover,首先要在repmgr.conf文件中将location参数设置为一致,不设置的话默认也是一致的。同时启动repmgrd必须postgres.conf配置文件中设置shared_preload_libraries='repmgr'

                 

                修改主备库repmgr.conf文件

                  failover=automatic
                  promote_command='/pgsql/bin/repmgr standby promote -f etc/repmgr.conf --log-to-file'
                  follow_command='/pgsql/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
                  log_file=/home/postgres/repmgrd.log
                  monitoring_history=true (启用监控参数)
                  monitor_interval_secs=5(定义监视数据间隔写入时间参数)
                  reconnect_attempts=10(故障转移之前,尝试重新连接主库次数(默认为6)参数)
                  reconnect_interval=5(每间隔5s尝试重新连接一次参数)

                  重启主备库使修改生效

                    [postgres@node1 ~]$ repmgr node service --action=restart
                    DETAIL: executing server command "pg_ctl -w -D '/pgdata' restart"

                    主备库启动repmgrd

                      [postgres@node1 ~]$ repmgrd –f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid
                      [2019-09-20 11:51:23] [NOTICE] redirecting logging output to "/home/postgres/repmgrd.log"

                      模拟主库故障

                        [postgres@node1 ~]$ pg_ctl stop -D /pgdata/
                        waiting for server to shut down..... done
                        server stopped

                        查看备库日志,发现已经升为主库

                          [2019-09-20 12:02:52] [NOTICE] promoting standby to primary
                          [2019-09-20 12:02:52] [DETAIL] promoting server "node2" (ID: 2) using "pg_ctl -w -D '/pgdata' promote"
                          [2019-09-20 12:02:52] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
                          [2019-09-20 12:02:52] [NOTICE] STANDBY PROMOTE successful
                          [2019-09-20 12:02:52] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
                          [2019-09-20 12:02:52] [INFO] 0 followers to notify
                          [2019-09-20 12:02:52] [INFO] switching to primary monitoring mode
                          [2019-09-20 12:02:52] [NOTICE] monitoring cluster primary "node2" (ID: 2)

                          查看cluster状态,备库已经升主

                            [postgres@node2 ~]$ repmgr cluster show
                            ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
                            ----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
                            1 | node1 | primary | - failed | | default | 100 | ? | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
                            2 | node2 | primary | * running | | default | 100 | 5 | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

                            WARNING: following issues were detected
                            - unable to connect to node "node1" (ID: 1)

                            原主库执行rejoin加入集群

                              [postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run
                              [postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose
                              INFO: looking for configuration file in /etc
                              INFO: configuration file found at: "/etc/repmgr.conf"
                              INFO: prerequisites for using pg_rewind are met
                              INFO: 2 files copied to "/tmp/repmgr-config-archive-node1"
                              NOTICE: executing pg_rewind
                              DETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"
                              NOTICE: 2 files copied to /pgdata
                              INFO: directory "/tmp/repmgr-config-archive-node1" deleted
                              INFO: deleting "recovery.done"
                              NOTICE: setting node 1's upstream to node 2
                              WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"
                              DETAIL: PQping() returned "PQPING_NO_RESPONSE"
                              NOTICE: starting server using "pg_ctl -w -D '/pgdata' start"
                              INFO: demoted primary is pingable
                              INFO: node 1 has attached to its upstream node
                              NOTICE: NODE REJOIN successful
                              DETAIL: node 1 is now attached to node 2

                              查看集群状态

                                [postgres@node1 ~]$ repmgr cluster show
                                ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
                                ----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
                                1 | node1 | standby | running | node2 | default | 100 | 5 | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
                                2 | node2 | primary | * running | | default | 100 | 6 | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2


                                加油吧。长按下图二维码关注公众号更多精彩文章等着你。



                                文章转载自数据库架构,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                                评论