昨天临近下班一位同事找到我说有个流复制的监控脚本异常,LAG 字段获取不到,于是我翻了一下脚本,发现了如下猫腻 👇🏻
[postgres@xiongcc ~]$ psql -c 'select client_addr as slave_addr,pg_wal_lsn_diff(sent_lsn,write_lsn) as sent_lag,pg_wal_lsn_diff(sent_lsn,replay_lsn) as apply_lag from pg_stat_replication' -X -t -d 'dbname=test port=5432 host=100.68.66.143 user=monitor_replication password=monitor_replication application_name=slave'
psql: FATAL: password authentication failed for user "monitor_replication"
而在另外一台机器上,同样的操作连接同样的服务器却可以正常获取出延时
[postgres@xiongcc ~]$ psql -c 'select client_addr as slave_addr,pg_wal_lsn_diff(sent_lsn,write_lsn) as sent_lag,pg_wal_lsn_diff(sent_lsn,replay_lsn) as apply_lag from pg_stat_replication' -X -t -d 'dbname=test port=5432 host=100.68.66.143 user=monitor_replication password=monitor_replication application_name=slave'
xxx| 0| 0
xxx| 0| 0
那为何同样的密码和用户时灵时不灵呢?
分析
经过分析,原来是 pg_hba.conf 搞的鬼!让我们看个例子,随便新建个用户
postgres=# create user xiongcc with password '123456';
CREATE ROLE
然后故意输入一个错误的密码连接
[postgres@xiongcc ~]$ psql -d 'dbname=postgres user=xiongcc password=smart'
psql (14.2)
Type "help" for help.
postgres=> \conninfo
You are connected to database "postgres" as user "xiongcc" via socket in "/tmp" at port "5432".
可以看到即使输入了一个错误的密码依旧可以连接,让我们检查一下 pg_hba.conf,从 10 以后提供了 pg_hba_file_rules 这个视图,顾名思义,查看认证规则
postgres=# select * from pg_hba_file_rules ;
line_number | type | database | user_name | address | netmask | auth_method | options | error
-------------+-------+---------------+-----------+-----------+-----------------------------------------+-------------+---------+-------
89 | local | {all} | {all} | | | trust | |
91 | host | {all} | {all} | 127.0.0.1 | 255.255.255.255 | trust | |
93 | host | {all} | {all} | ::1 | ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff | trust | |
96 | local | {replication} | {all} | | | trust | |
97 | host | {replication} | {all} | 127.0.0.1 | 255.255.255.255 | trust | |
98 | host | {replication} | {all} | ::1 | ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff | trust | |
(6 rows)
该视图有一个很方便的点在于假如 pg_hba.conf 文件中添加了一条错误的配置,reload 是不会打印出来的(不太优化,需要在日志中查看),只有在重启的时候才会报错,而 pg_hba_file_rules 视图则会很贴心得告诉你,看个栗子 👇🏻
[postgres@xiongcc ~]$ echo 'host all all 10.10.0.0/16 hello' >> pgdata/pg_hba.conf ---错误的认证方式
[postgres@xiongcc ~]$ pg_ctl -D pgdata/ reload
server signaled
[postgres@xiongcc ~]$ psql
psql (14.2)
Type "help" for help.
postgres=# select * from pg_hba_file_rules ;
line_number | type | database | user_name | address | netmask | auth_method | options | error
-------------+-------+---------------+-----------+-----------+-----------------------------------------+-------------+---------+---------------------------------------
89 | local | {all} | {all} | | | trust | |
91 | host | {all} | {all} | 127.0.0.1 | 255.255.255.255 | trust | |
93 | host | {all} | {all} | ::1 | ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff | trust | |
96 | local | {replication} | {all} | | | trust | |
97 | host | {replication} | {all} | 127.0.0.1 | 255.255.255.255 | trust | |
98 | host | {replication} | {all} | ::1 | ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff | trust | |
99 | | | | | | | | invalid authentication method "hello"
(7 rows)
很明了,invalid authentication method "hello"。
回到主题,因为我是通过 unix socket 登录的,因此会使用第一条规则 trust。那把 trust 改为 md5 再试一下(需要先将错误的配置注释掉,有任何错误都会导致 reload 失败)
[postgres@xiongcc ~]$ pg_ctl -D pgdata/ reload
server signaled
[postgres@xiongcc ~]$ psql -d 'dbname=postgres user=xiongcc password=smart' ---错误密码
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: password authentication failed for user "xiongcc"
[postgres@xiongcc ~]$ psql -d 'dbname=postgres user=xiongcc password=123456' ---正确密码
psql (14.2)
Type "help" for help.
postgres=>
果然,这一次就会提示密码错误了,因此该案例的原因就分析清楚了:主库的 pg_hba.conf 对其中一台服务器配置了 trust,而另外一台配置了 md5,因此即使在连接串里指定了 password 亦或是 psql -W 强制输入密码的方式,服务端都会忽略你的密码,无条件信任,免去了认证这个过程。
[postgres@xiongcc ~]$ psql -d 'dbname=postgres user=xiongcc' -W ---此处乱输入一通也可以登录
Password:
psql (14.2)
Type "help" for help.
postgres=>
小结
关于 pg_hba.conf 有几点需要注意的事项:
默认的匹配规则是从上往下,匹配到第一条满足的规则就不再继续往下匹配,这一点和 postgresql.conf 有所不同,后者是后面的配置覆盖前面的配置(假如还额外指定了 include_dir、include 的话另说),举个栗子,假如想实现某个 IP 禁止登录但是又不阻碍同网段的其他 IP 登录的话,需要这么配置 👇🏻 假如顺序反了那就起不到作用了。
host all all 10.10.1.132/32 reject ---具体地址放在前面
host all all 10.10.0.0/16 md5 ---大的网段放在后面还有一点就是特殊的 "replication" 权限,常用于流复制场景,需要在 database 列配置 replication
pg_basebackup: error: FATAL: no pg_hba.conf entry for replication connection from host "xxx", user "xxx", SSL off
That's all.




