适用版本:PostgreSQL 12 / 13 / 14 / 15 / 16 / 17
工作模式:手工巡检 / 脚本自动化
核心场景:基于pg_health_check.sh巡检脚本,系统化整理 PostgreSQL 数据库健康检查全量 SQL,涵盖从基础信息到性能调优的 16 大模块
前言
作为 PostgreSQL DBA,日常巡检是保障数据库稳定运行的核心工作之一。本文参考开源巡检脚本 pg_health_check.sh v1.0,将其内嵌的 68 个 SQL 语句按功能模块系统化整理,每个 SQL 均附带详细注释说明、输出字段解读及告警阈值建议。
全文共 16 个巡检模块,覆盖系统信息、基础信息、空间使用、连接与会话、性能采样、共享缓冲区、VACUUM 与膨胀、事务与年龄、WAL 与归档、锁等待、索引分析、用户权限、安全配置、错误日志、备份策略、健康总结等维度。
| 巡检模块 | SQL 数量 | 难度 | 执行频率 |
|---|---|---|---|
| 一、基础信息 | 8 | ⭐ | 每日 |
| 二、空间使用分析 | 5 | ⭐ | 每日 |
| 三、连接与会话分析 | 6 | ⭐⭐ | 每日/实时 |
| 四、性能采样分析 | 5 | ⭐⭐⭐ | 实时采样 |
| 五、共享缓冲区与后台写入 | 3 | ⭐⭐ | 每日 |
| 六、VACUUM 与膨胀分析 | 5 | ⭐⭐⭐ | 每日 |
| 七、数据库年龄与事务分析 | 5 | ⭐⭐⭐ | 每日 |
| 八、WAL 与归档分析 | 5 | ⭐⭐ | 每日 |
| 九、锁等待分析 | 4 | ⭐⭐ | 实时 |
| 十、索引分析 | 5 | ⭐⭐ | 每周 |
| 十一、用户与权限审计 | 5 | ⭐ | 每周 |
| 十二、安全配置检查 | 3 | ⭐⭐ | 每月 |
| 十三、错误与慢查询日志 | 2 | ⭐⭐ | 每日 |
| 十四、备份策略检查 | 0(OS 层) | ⭐ | 每日 |
| 十五、健康检查汇总 | 7 | ⭐⭐ | 每日 |
| 合计 | 68 | - | - |
一、PostgreSQL 基础信息采集
1.1 查询数据库版本
-- 输出版本号、编译选项、操作系统信息等完整字符串
SELECT version();
输出示例:
PostgreSQL 14.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514, 64-bit
1.2 查询实例唯一标识
-- system_identifier 是初始化时生成的全局唯一标识
-- 用于确认主备是否来自同一实例、恢复后标识是否变化
SELECT system_identifier FROM pg_control_system();
小贴士: 该值在
pg_control文件中,实例整个生命周期不变,可用于校验 PITR 恢复后的实例身份。
1.3 数据库列表及大小
-- datname: 数据库名
-- size: 可读格式的大小
-- conn_limit: 连接数上限 (-1 表示无限制)
-- age: 事务 ID 年龄(越大越接近回卷风险)
-- age_remain: 距离 2^31 上限的剩余空间
SELECT datname AS database_name,
pg_size_pretty(pg_database_size(oid)) AS size,
datconnlimit AS conn_limit,
age(datfrozenxid) AS age,
2^31 - age(datfrozenxid) AS age_remain
FROM pg_database
ORDER BY pg_database_size(oid) DESC;
关键指标说明:
| 字段 | 含义 | 告警阈值 |
|---|---|---|
| age | 事务年龄 | > 2 亿 (2×10⁸) 需关注 |
| age_remain | 剩余可用事务 ID | < 5 亿需紧急处理 |
1.4 已安装扩展
-- 查看所有非系统内置扩展及其所属数据库和版本
SELECT d.datname AS database_name,
e.extname AS extension_name,
e.extversion AS version
FROM pg_extension e
JOIN pg_database d ON true
WHERE e.extname NOT IN ('pg_catalog', 'information_schema')
ORDER BY d.datname, e.extname;
1.5 用户定义的表空间
-- spcname: 表空间名称
-- location: 文件系统路径(NULL 表示默认表空间)
-- size: 表空间总占用大小
SELECT spcname AS tablespace_name,
pg_tablespace_location(oid) AS location,
pg_size_pretty(pg_tablespace_size(oid)) AS size
FROM pg_tablespace
ORDER BY pg_tablespace_size(oid) DESC;
1.6 用户修改过的参数(非默认值)
-- 筛选 source 不为 default 和 override 的参数
-- 即用户通过 postgresql.conf、ALTER SYSTEM 或 ALTER SET 修改过的参数
SELECT name AS parameter,
setting AS current_value,
unit,
boot_val AS default_value,
source
FROM pg_settings
WHERE source NOT IN ('default', 'override')
ORDER BY name;
DBA 必看: 这是快速了解实例有哪些自定义配置的最佳方式,建议每次巡检必查。
1.7 数据库级或角色级定制参数
-- 通过 ALTER DATABASE ... SET 或 ALTER ROLE ... SET 设置的参数
-- 这些参数仅对特定数据库或角色生效
SELECT d.datname AS database_name,
r.rolname AS role_name,
p.setconfig AS settings
FROM pg_db_role_setting p
LEFT JOIN pg_database d ON p.setdatabase = d.oid
LEFT JOIN pg_roles r ON p.setrole = r.oid;
1.8 对象统计概览
-- 统计用户创建的各类对象数量
-- relkind: r=普通表, v=视图, i=索引, S=序列, m=物化视图
SELECT '普通表' AS type, COUNT(*) AS cnt
FROM pg_class WHERE relkind = 'r' AND relnamespace > 11
UNION ALL
SELECT '视图', COUNT(*) FROM pg_class WHERE relkind = 'v' AND relnamespace > 11
UNION ALL
SELECT '索引', COUNT(*) FROM pg_class WHERE relkind = 'i' AND relnamespace > 11
UNION ALL
SELECT '序列', COUNT(*) FROM pg_class WHERE relkind = 'S' AND relnamespace > 11
UNION ALL
SELECT '物化视图', COUNT(*) FROM pg_class WHERE relkind = 'm' AND relnamespace > 11
UNION ALL
SELECT '触发器', COUNT(*) FROM pg_trigger WHERE tgisinternal = false;
二、空间使用分析
2.1 表空间大小排名
-- 查看所有表空间的磁盘占用情况
SELECT spcname AS tablespace,
pg_size_pretty(pg_tablespace_size(oid)) AS total_size,
CASE WHEN pg_tablespace_location(oid) IS NULL
THEN 'default location' ELSE pg_tablespace_location(oid)
END AS location
FROM pg_tablespace
ORDER BY pg_tablespace_size(oid) DESC;
2.2 数据库大小排名
-- 排除模板库,按大小降序排列
SELECT datname AS database_name,
pg_size_pretty(pg_database_size(oid)) AS size,
pg_database_size(oid) AS size_bytes
FROM pg_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY size_bytes DESC;
2.3 TOP 20 大表
-- 排除系统 schema,返回表大小和估算行数
-- reltuples 是优化器统计值,可能与实际行数有偏差(需 ANALYZE 后准确)
SELECT
n.nspname AS schema_name,
c.relname AS table_name,
pg_size_pretty(pg_relation_size(c.oid)) AS table_size,
pg_relation_size(c.oid) AS size_bytes,
c.reltuples::bigint AS estimated_rows
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
ORDER BY pg_relation_size(c.oid) DESC
LIMIT 20;
2.4 TOP 20 大索引
-- 返回索引所属表、索引名、大小等信息
-- 通过 pg_index 关联索引与基表的关系
SELECT
n.nspname AS schema_name,
c.relname AS table_name,
i.relname AS index_name,
pg_size_pretty(pg_relation_size(i.oid)) AS index_size,
pg_relation_size(i.oid) AS size_bytes
FROM pg_class i
JOIN pg_namespace n ON n.oid = i.relnamespace
JOIN pg_index ix ON ix.indexrelid = i.oid
JOIN pg_class c ON c.oid = ix.indrelid
WHERE i.relkind = 'i'
AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
ORDER BY pg_relation_size(i.oid) DESC
LIMIT 20;
2.5 表膨胀率概览(索引/表大小比)
-- index_to_table_ratio 反映索引占表的比例
-- 比例过高可能存在冗余索引,过低可能缺少必要索引
-- 仅统计大于 1MB 的表
SELECT
n.nspname AS schema_name,
c.relname AS table_name,
pg_size_pretty(pg_relation_size(c.oid)) AS current_size,
ROUND(100.0 * pg_relation_size(c.oid) /
NULLIF(pg_total_relation_size(c.oid) - pg_relation_size(c.oid), 0), 2) AS index_to_table_ratio
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
AND pg_relation_size(c.oid) > 1024 * 1024
ORDER BY pg_relation_size(c.oid) DESC
LIMIT 20;
三、连接与会话分析
3.1 当前连接状态汇总
-- state 可能值: active, idle, idle in transaction,
-- idle in transaction (aborted), fastpath function call, disabled
SELECT
COALESCE(state, 'unknown') AS state,
COUNT(*) AS connection_count
FROM pg_stat_activity
GROUP BY state
ORDER BY connection_count DESC;
状态含义速查:
| state | 含义 | 是否需要关注 |
|---|---|---|
| active | 正在执行查询 | 正常 |
| idle | 空闲等待新命令 | 正常 |
| idle in transaction | 事务中空闲 | ⚠️ 可能持有锁 |
| idle in transaction (aborted) | 事务中出错未回滚 | 🔴 需立即处理 |
3.2 连接详情(活跃前 20,按等待时间排序)
-- seconds_in_state: 在当前状态持续的时间(秒)
-- current_query: 截取前 100 字符的当前 SQL
SELECT
pid,
usename AS username,
datname AS database,
state,
client_addr,
application_name,
EXTRACT(EPOCH FROM (now() - state_change))::int AS seconds_in_state,
LEFT(REPLACE(query, chr(10), ' '), 100) AS current_query
FROM pg_stat_activity
WHERE state IS NOT NULL
ORDER BY state_change ASC
LIMIT 20;
3.3 连接数统计与余量
-- max_connections: 最大允许连接数
-- reserved_for_superuser: 为超级用户保留的连接槽位
-- available_for_normal: 普通用户剩余可用连接数
SELECT
MAX(setting)::int AS max_connections,
(SELECT COUNT(*) FROM pg_stat_activity) AS current_connections,
(SELECT setting FROM pg_settings WHERE name = 'superuser_reserved_connections')::int AS reserved_for_superuser,
MAX(setting)::int - (SELECT COUNT(*) FROM pg_stat_activity) AS available_for_normal
FROM pg_settings
WHERE name = 'max_connections';
告警建议: 当
available_for_normal < 10时应关注,< 0 则意味着普通用户已无法新建连接。
3.4 各用户连接数与限额
-- rolconnlimit: 角色级别的连接数限制 (-1 表示不限制)
-- current_connects: 当前该用户的实际连接数
SELECT
a.rolname AS username,
a.rolconnlimit AS conn_limit,
COALESCE(b.connects, 0) AS current_connects
FROM pg_authid a
LEFT JOIN (
SELECT usename, COUNT(*) AS connects
FROM pg_stat_activity
GROUP BY usename
) b ON a.rolname = b.usename
WHERE a.rolname NOT IN ('postgres')
ORDER BY b.connects DESC NULLS LAST;
3.5 各数据库连接数与限额
-- 类似用户维度,从数据库维度查看连接分布
SELECT
a.datname AS database_name,
a.datconnlimit AS conn_limit,
COALESCE(b.connects, 0) AS current_connects
FROM pg_database a
LEFT JOIN (
SELECT datname, COUNT(*) AS connects
FROM pg_stat_activity
GROUP BY datname
) b ON a.datname = b.datname
WHERE a.datname NOT IN ('template0', 'template1')
ORDER BY b.connects DESC NULLS LAST;
3.6 长时间空闲连接(空闲超 5 分钟)
-- 空闲超过 5 分钟的连接可能是应用连接池泄漏
-- 可考虑配置 idle_in_transaction_session_timeout 自动清理
SELECT
pid,
usename AS username,
datname AS database,
client_addr,
application_name,
EXTRACT(EPOCH FROM (now() - state_change))::int AS idle_seconds
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < now() - interval '5 minutes'
ORDER BY state_change ASC
LIMIT 20;
四、性能采样分析(两次快照对比)
原理说明: 本模块通过前后两次采样
pg_stat_database的差值计算实时 QPS、TPS 和命中率,类似 Oracle AWR 快照机制。
4.1 采样 — 缓冲区块访问总量
-- 总缓冲区访问次数(命中 + 未命中)
-- 需间隔 10~30 秒采集两次,做差计算增量
SELECT blks_hit + blks_read FROM pg_stat_database WHERE datname = current_database();
4.2 采样 — 缓冲区命中次数
-- 仅缓冲区命中次数,与 4.1 配合计算命中率
SELECT blks_hit FROM pg_stat_database WHERE datname = current_database();
4.3 采样 — 元组变更总量
-- 插入 + 更新 + 删除的元组总数(累计值)
SELECT (n_tup_ins + n_tup_upd + n_tup_del) FROM pg_stat_database WHERE datname = current_database();
4.4 采样 — 事务提交回滚总数
-- 提交 + 回滚的事务总数(累计值)
SELECT (xact_commit + xact_rollback) FROM pg_stat_database WHERE datname = current_database();
使用方式: 间隔 10~30 秒后重复执行 4.1 ~ 4.4,通过差值计算:
- Buffer Hit Ratio = hit_delta / blks_delta,建议 > 99%
- TPS = xact_delta / interval_seconds
- Tuple Changes/sec = tup_delta / interval_seconds
4.5 数据库级性能指标汇总
-- 单次查询即可获取当前数据库的关键性能累计指标
-- 注意是累计值,需两次采样做差才有实时意义
SELECT
datname,
xact_commit,
xact_rollback,
blks_read,
blks_hit,
ROUND(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) AS cache_hit_ratio,
tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted,
conflicts,
deadlocks,
blk_read_time::int AS read_ms,
blk_write_time::int AS write_ms,
temp_files,
temp_bytes
FROM pg_stat_database
WHERE datname = current_database();
关键指标告警阈值:
| 指标 | 建议范围 | 告警条件 |
|---|---|---|
| cache_hit_ratio | > 99% | < 95% 需增加 shared_buffers |
| deadlocks | 0 | > 0 需排查业务逻辑 |
| conflicts | 低 | 备库冲突过高需关注 |
| temp_files | 0 | 大量临时文件说明 work_mem 不足 |
五、共享缓冲区与后台写入(BGWriter)
5.1 内存与 IO 相关关键参数
-- 这些参数直接影响内存使用和磁盘写入策略
SELECT name, setting, unit, short_desc
FROM pg_settings
WHERE name IN (
'shared_buffers', -- 共享缓冲区大小
'effective_cache_size', -- 优化器估计的可用缓存
'work_mem', -- 排序/Hash 操作内存上限
'maintenance_work_mem', -- 维护操作(VACUUM/CREATE INDEX)内存
'temp_buffers', -- 临时表使用的缓冲区
'max_connections', -- 最大连接数
'bgwriter_delay', -- BGWriter 轮询间隔(ms)
'bgwriter_lru_maxpages', -- BGWriter 每次 LRU 刷写的最大页数
'bgwriter_lru_multiplier', -- BGWriter 刷写倍率
'bgwriter_flush_after', -- BGWriter 累积刷写字节阈值
'checkpoint_completion_target', -- 检查点完成目标
'checkpoint_timeout', -- 检查点间隔时间
'min_wal_size', -- WAL 最小大小
'max_wal_size', -- WAL 最大大小
'wal_buffers' -- WAL 缓冲区大小
)
ORDER BY name;
参数调优建议:
| 参数 | 推荐值 | 说明 |
|---|---|---|
| shared_buffers | 物理内存的 25% | 不超过 40%,Linux 上受 shmmax 限制 |
| effective_cache_size | 物理内存的 50%~75% | 仅供优化器参考,不实际分配 |
| work_mem | (shared_buffers / max_connections) × 0.25 | 并发高时需适当降低 |
| maintenance_work_mem | 256MB~1GB | VACUUM 和 CREATE INDEX 使用 |
5.2 BGWriter 统计信息
-- 查看后台写入进程的完整统计信息
-- checkpoints_timed: 计划内检查点次数
-- checkpoints_req: 请求触发的检查点次数(过多说明 WAL 写入过快)
-- buffers_checkpoint: 检查点刷写的缓冲区数
-- buffers_clean: BGWriter 主动刷写的缓冲区数
-- buffers_backend: 后端进程自行刷写的缓冲区数(过高说明 BGWriter 不够积极)
SELECT * FROM pg_stat_bgwriter;
进阶技巧:
buffers_backend过高意味着后端进程经常需要自己刷脏页,说明 BGWriter 不够积极。可适当调低bgwriter_delay或调高bgwriter_lru_maxpages。
5.3 数据库级缓冲区命中统计
-- 逐库查看缓存命中率、死锁、IO 时间等
SELECT
s.datname AS database,
s.numbackends AS backends,
s.xact_commit AS commits,
s.xact_rollback AS rollbacks,
s.blks_read,
s.blks_hit,
ROUND(100.0 * s.blks_hit / NULLIF(s.blks_hit + s.blks_read, 0), 2) AS cache_hit_ratio,
s.tup_returned, s.tup_fetched, s.tup_inserted, s.tup_updated, s.tup_deleted,
s.conflicts,
s.deadlocks,
s.blk_read_time::int AS read_ms,
s.blk_write_time::int AS write_ms,
s.temp_files AS temp_file_count,
s.temp_bytes AS temp_bytes
FROM pg_stat_database s
WHERE s.datname NOT IN ('template0', 'template1')
ORDER BY s.datname;
六、VACUUM 与膨胀分析
6.1 Autovacuum 配置参数
-- 检查自动清理是否开启及相关阈值设置
SELECT name, setting, short_desc
FROM pg_settings
WHERE name IN (
'autovacuum', -- 是否开启自动清理
'autovacuum_max_workers', -- 自动清理工作进程数
'autovacuum_naptime', -- 自动清理轮询间隔
'autovacuum_vacuum_threshold', -- 触发 VACUUM 的死元组数阈值
'autovacuum_vacuum_scale_factor',-- 触发 VACUUM 的死元组比例
'autovacuum_analyze_threshold', -- 触发 ANALYZE 的变更行数阈值
'autovacuum_analyze_scale_factor',-- 触发 ANALYZE 的变更比例
'autovacuum_vacuum_cost_delay', -- 自动清理的延迟(ms)
'autovacuum_vacuum_cost_limit', -- 自动清理的成本限制
'vacuum_cleanup_index_scale_factor' -- 索引清理的比例因子
)
ORDER BY name;
关键参数建议:
| 参数 | 默认值 | 建议值 | 说明 |
|---|---|---|---|
| autovacuum | on | on | 生产环境必须开启 |
| autovacuum_vacuum_scale_factor | 0.2 | 0.05~0.1 | 大表建议更低 |
| autovacuum_max_workers | 3 | 4~6 | 表多时适当增加 |
| autovacuum_naptime | 60s | 30s~60s | 高频写入库可缩短 |
6.2 表膨胀 Top15(需 VACUUM)
-- dead_ratio_pct: 死元组占活元组的百分比
-- 值过高说明 VACUUM 不及时,需要关注
SELECT
schemaname,
relname AS table_name,
n_dead_tup AS dead_tuples,
n_live_tup AS live_tuples,
CASE WHEN n_live_tup > 0
THEN ROUND(100.0 * n_dead_tup / n_live_tup, 2)
ELSE 0
END AS dead_ratio_pct,
last_vacuum,
last_autovacuum,
vacuum_count,
autovacuum_count
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 15;
告警阈值:
dead_ratio_pct > 10%需要关注,> 50% 应立即 VACUUM。
6.3 未及时 VACUUM 的表
-- 超过 7 天未执行 VACUUM 且死元组 > 10000 的表
-- "从未 VACUUM" 的标记说明自动清理从未触及该表
SELECT
schemaname,
relname AS table_name,
n_dead_tup,
last_autovacuum,
CASE WHEN last_autovacuum IS NULL AND last_vacuum IS NULL
THEN '从未 VACUUM'
ELSE ''
END AS note
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
AND (last_autovacuum IS NULL OR last_autovacuum < now() - interval '7 days')
AND (last_vacuum IS NULL OR last_vacuum < now() - interval '7 days')
ORDER BY n_dead_tup DESC
LIMIT 10;
6.4 表膨胀详细分析(膨胀倍数 Top10)
-- bloat_ratio 反映索引空间与表空间的比例
-- 比例异常高说明索引可能严重膨胀
-- 仅分析大于 10MB 的表
SELECT
schemaname,
relname AS table_name,
pg_size_pretty(pg_relation_size(relid)) AS table_size,
ROUND(100.0 * pg_relation_size(relid) /
NULLIF(pg_total_relation_size(relid) - pg_relation_size(relid), 0), 2) AS bloat_ratio
FROM pg_stat_user_tables
WHERE pg_relation_size(relid) > 10 * 1024 * 1024
ORDER BY bloat_ratio DESC NULLS LAST
LIMIT 10;
6.5 索引膨胀 Top10
-- 扫描次数少于 100 次但体积超过 1MB 的索引
-- avg_bytes_per_scan: 每次扫描平均消耗字节数,过大说明索引效率低
SELECT
n.nspname AS schemaname,
i.relname AS index_name,
c.relname AS table_name,
pg_size_pretty(pg_relation_size(i.oid)) AS index_size,
s.idx_scan,
s.idx_tup_read,
s.idx_tup_fetch,
pg_relation_size(i.oid) / NULLIF(s.idx_scan + 1, 0) AS avg_bytes_per_scan
FROM pg_stat_user_indexes s
JOIN pg_class i ON i.oid = s.indexrelid
JOIN pg_class c ON c.oid = s.relid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE s.idx_scan < 100
AND pg_relation_size(i.oid) > 1024 * 1024
AND n.nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(i.oid) DESC
LIMIT 10;
VACUUM 模块建议:
- Dead tuples 过多会导致查询变慢,建议调整
autovacuum_vacuum_scale_factor(推荐 0.05~0.1) - 大表建议设置单独的
autovacuum_vacuum_cost_delay或手工定期 VACUUM - 未使用索引占用空间且影响写入性能,建议评估后删除
七、数据库年龄与事务分析
7.1 数据库年龄
-- current_age: 事务 ID 年龄,即距离冻结点的距离
-- age_remain: 距离 2^31 上限的剩余事务数
-- 当 age 接近 2^31 时会发生事务 ID 回卷,数据库将只读
SELECT
datname AS database_name,
age(datfrozenxid) AS current_age,
2^31 - age(datfrozenxid) AS age_remain,
CASE WHEN age(datfrozenxid) > 2^30
THEN '⚠️ 告警: 年龄即将耗尽'
WHEN age(datfrozenxid) > 2^29
THEN '⚠️ 警告: 年龄接近上限'
ELSE '正常'
END AS status
FROM pg_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY age(datfrozenxid) DESC;
🔴 XID 回卷风险是 PostgreSQL 最严重的故障之一! 当
age_remain接近 0 时,数据库将停止接受写入。务必定期检查此指标。
7.2 表年龄 Top10
-- 表级别的冻结年龄,age > 100 万才显示
-- last_vacuum_time: 最近一次 VACUUM(含自动)的时间
SELECT
n.nspname AS schemaname,
c.relname AS table_name,
c.relkind,
age(c.relfrozenxid) AS current_age,
2^31 - age(c.relfrozenxid) AS age_remain,
GREATEST(s.last_vacuum, s.last_autovacuum) AS last_vacuum_time
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_stat_user_tables s ON s.relid = c.oid
WHERE c.relkind IN ('r', 't', 'm')
AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
AND age(c.relfrozenxid) > 1000000
ORDER BY age(c.relfrozenxid) DESC
LIMIT 10;
7.3 长事务(运行超 30 分钟)
-- 长事务会阻止 VACUUM 清理死元组,是膨胀的主要诱因
-- backend_xid / backend_xmin: 表示该后端正在持有事务快照
SELECT
pid,
usename AS username,
datname AS database,
state,
EXTRACT(EPOCH FROM (now() - xact_start))::int / 60 AS xact_minutes,
EXTRACT(EPOCH FROM (now() - query_start))::int / 60 AS query_minutes,
LEFT(REPLACE(query, chr(10), ' '), 120) AS query
FROM pg_stat_activity
WHERE state NOT IN ('idle')
AND (backend_xid IS NOT NULL OR backend_xmin IS NOT NULL)
AND xact_start < now() - interval '30 minutes'
ORDER BY xact_start;
处理建议: 确认后可使用
SELECT pg_terminate_backend(pid)终止长事务,同时排查应用代码中的事务边界问题。
7.4 两阶段提交(2PC)预提交事务
-- 2PC 事务长时间未提交会阻止 VACUUM,导致表膨胀
-- prepared_minutes: 预提交事务已等待的分钟数
SELECT
transaction,
gid,
prepared,
owner,
database AS db_name,
EXTRACT(EPOCH FROM (now() - prepared)) / 60 AS prepared_minutes
FROM pg_prepared_xacts
ORDER BY prepared;
7.5 复制槽位(可能导致 WAL 积压)
-- 不活跃的复制槽会阻止 WAL 文件回收,导致 pg_wal 膨胀
-- lag_bytes: 当前 WAL 位置与槽位重启 LSN 的字节差
SELECT
slot_name,
plugin,
slot_type,
COALESCE(d.datname, 'N/A') AS database_name,
active,
restart_lsn,
confirmed_flush_lsn,
CASE WHEN restart_lsn IS NOT NULL AND pg_is_in_recovery() = false
THEN pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
)
ELSE NULL
END AS lag_bytes
FROM pg_replication_slots rs
LEFT JOIN pg_database d ON rs.database::oid = d.oid;
⚠️ 重要:
active = false的复制槽是最常见的 WAL 膨胀原因,必须及时清理:SELECT pg_drop_replication_slot('slot_name');
八、WAL 与归档分析
8.1 WAL 配置参数
-- 核心参数解读:
-- wal_level: replica 支持流复制,logical 支持逻辑解码
-- archive_mode: on 才能实现 PITR
-- max_wal_size: 控制 WAL 总量上限,过小导致频繁检查点
SELECT name, setting, short_desc
FROM pg_settings
WHERE name IN (
'wal_level', 'max_wal_senders', 'max_replication_slots',
'wal_keep_size', 'min_wal_size', 'max_wal_size',
'archive_mode', 'archive_command', 'archive_timeout',
'wal_compression', 'wal_log_hints', 'wal_init_zero'
)
ORDER BY name;
8.2 归档统计
-- archived_count: 已成功归档的 WAL 数
-- failed_count: 归档失败次数(非 0 需立即排查)
-- last_archived_wal / last_archived_time: 最近归档信息
SELECT * FROM pg_stat_archiver;
8.3 当前 WAL 位置
-- current_lsn: 当前写入的 WAL 位置
-- current_wal_file: 当前正在写入的 WAL 文件名
-- bytes_in_wal: 自起始以来的 WAL 字节数
SELECT
pg_current_wal_lsn() AS current_lsn,
pg_walfile_name(pg_current_wal_lsn()) AS current_wal_file,
pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0'::pg_lsn) AS bytes_in_wal;
8.4 流复制状态(主库视角)
-- 主库查看所有备库的复制延迟
-- lag_bytes: 主库已发送但备库尚未回放的 WAL 字节数
-- sync_state: sync=同步, async=异步, potential=候选同步
SELECT
client_addr,
usename,
application_name,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
sync_state
FROM pg_stat_replication;
备库视角:
-- 在备库上执行,查看 WAL 接收和回放状态
-- replay_lag: 接收到回放的延迟
SELECT
pid,
status,
receive_start_lsn,
received_lsn,
last_msg_send_time,
last_msg_receipt_time,
latest_end_lsn,
slot_name,
pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS replay_lag
FROM pg_stat_wal_receiver;
九、锁等待分析
9.1 当前锁统计
-- 按 locktype / relation / mode 分组统计锁数量
-- locktype 类型: relation, extend, page, tuple, transactionid, virtualxid, object, userlock, advisory
SELECT
locktype,
database::regclass AS database,
relation::regclass AS relation,
page, tuple,
virtualxid, transactionid,
granted,
mode,
COUNT(*) AS lock_count
FROM pg_locks
WHERE locktype NOT IN ('virtualxid')
GROUP BY 1,2,3,4,5,6,7,8,9
ORDER BY lock_count DESC, granted
LIMIT 30;
9.2 等待中的锁(谁阻塞了谁)
-- 找到被阻塞的会话及其阻塞源
-- waiting_pid: 被阻塞的会话 PID
-- blocking_pid: 持有锁的会话 PID
SELECT
w.pid AS waiting_pid,
w.locktype AS waiting_locktype,
COALESCE(w.relation::regclass::text, w.transactionid::text, w.virtualxid) AS waiting_object,
w.mode AS waiting_mode,
w.granted AS waiting_granted,
b.pid AS blocking_pid,
b.usename AS blocking_user,
LEFT(REPLACE(b.query, chr(10), ' '), 80) AS blocking_query
FROM pg_locks w
JOIN pg_stat_activity b ON b.pid = w.pid
WHERE NOT w.granted
ORDER BY w.pid;
9.3 长时间持锁(超 10 秒)
-- 已授予但持锁时间超 10 秒的会话
-- 可能影响其他事务的并发性能
SELECT
l.pid,
l.locktype,
l.mode,
l.relation::regclass,
l.granted,
a.usename,
a.datname,
a.query,
EXTRACT(EPOCH FROM (now() - a.xact_start))::int AS xact_seconds,
EXTRACT(EPOCH FROM (now() - a.query_start))::int AS query_seconds
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE l.granted
AND a.query_start < now() - interval '10 seconds'
ORDER BY a.query_start;
9.4 死锁检测
-- 查看各数据库的死锁累计次数
-- 仅显示死锁数 > 0 的数据库
SELECT
datname,
SUM(deadlocks)::int AS deadlock_count
FROM pg_stat_database
GROUP BY datname
HAVING SUM(deadlocks) > 0;
死锁排查建议: 死锁通常是应用事务逻辑问题(交叉更新、缺少 SELECT FOR UPDATE 等),需在应用层修复。
deadlocks是累计值,重启后清零。
十、索引分析
10.1 未使用索引 Top20
-- idx_scan = 0 说明自上次统计重置后从未使用过
-- 体积 > 64KB 才显示(过滤小索引)
-- 这些索引只占用空间、拖慢 DML,建议评估删除
SELECT
n.nspname AS schemaname,
c.relname AS table_name,
i.relname AS index_name,
pg_size_pretty(pg_relation_size(i.oid)) AS index_size,
s.idx_scan,
s.idx_tup_read,
s.idx_tup_fetch
FROM pg_stat_user_indexes s
JOIN pg_class i ON i.oid = s.indexrelid
JOIN pg_class c ON c.oid = s.relid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE s.idx_scan = 0
AND pg_relation_size(i.oid) > 65536
AND i.relname NOT LIKE 'pg_toast_%'
AND n.nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(i.oid) DESC
LIMIT 20;
10.2 索引使用率偏低(扫描 1~100 次但占用空间大)
-- bytes_per_scan: 每次扫描消耗的字节数
-- 体积 > 10MB 且扫描极少的索引性价比极低
SELECT
n.nspname AS schemaname,
c.relname AS table_name,
i.relname AS index_name,
pg_size_pretty(pg_relation_size(i.oid)) AS index_size,
s.idx_scan,
pg_relation_size(i.oid) / NULLIF(s.idx_scan, 0) AS bytes_per_scan
FROM pg_stat_user_indexes s
JOIN pg_class i ON i.oid = s.indexrelid
JOIN pg_class c ON c.oid = s.relid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE s.idx_scan BETWEEN 1 AND 100
AND pg_relation_size(i.oid) > 10 * 1024 * 1024
AND n.nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(i.oid) DESC
LIMIT 10;
10.3 重复索引
-- 查找同一列集上有多个相同定义的非唯一、非主键索引
-- 重复索引纯属浪费,应删除多余的那个
SELECT
t.relname AS table_name,
array_agg(DISTINCT i.indexrelid::regclass) AS duplicate_indexes,
pg_get_indexdef(i.indexrelid) AS index_definition,
COUNT(*) AS count
FROM pg_index i
JOIN pg_class t ON t.oid = i.indrelid
WHERE i.indisunique = false
AND i.indisprimary = false
GROUP BY t.relname, i.indexrelid, pg_get_indexdef(i.indexrelid)
HAVING COUNT(*) > 1;
10.4 无效索引
-- indisvalid = false 的索引是 CONCURRENTLY 创建失败后遗留的
-- 不会被查询使用,但仍然占用磁盘空间
-- 处理: DROP INDEX CONCURRENTLY index_name; 然后重建
SELECT
n.nspname AS schemaname,
c.relname AS table_name,
i.relname AS invalid_index_name,
ix.indexrelid::regclass AS invalid_index
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_class c ON c.oid = ix.indrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE ix.indisvalid = false
AND n.nspname NOT IN ('pg_catalog', 'information_schema');
10.5 索引过多的表(> 5 个索引)
-- 索引过多会加重写入开销和表膨胀
-- 建议单表索引数控制在 5 个以内
SELECT
n.nspname AS schemaname,
c.relname AS table_name,
COUNT(*) AS index_count,
pg_size_pretty(SUM(pg_relation_size(i.oid))) AS total_index_size,
pg_size_pretty(pg_relation_size(c.oid)) AS table_size
FROM pg_stat_user_indexes s
JOIN pg_class i ON i.oid = s.indexrelid
JOIN pg_class c ON c.oid = s.relid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')
GROUP BY n.nspname, c.relname, c.oid
HAVING COUNT(*) > 5
ORDER BY COUNT(*) DESC
LIMIT 15;
十一、用户与权限审计
11.1 用户列表
-- rolsuper: 是否超级用户
-- rolcanlogin: 是否允许登录
-- rolvaliduntil: 密码有效期
-- password_status: 自动判断密码是否过期/即将过期
SELECT
rolname AS username,
rolsuper AS is_superuser,
rolinherit AS inherits,
rolcreaterole AS can_create_role,
rolcreatedb AS can_create_db,
rolcanlogin AS can_login,
rolreplication AS is_replication,
rolconnlimit AS conn_limit,
rolvaliduntil AS valid_until,
CASE WHEN rolvaliduntil < now() THEN '⚠️ 已过期'
WHEN rolvaliduntil < now() + interval '30 days' THEN '⚠️ 即将过期'
ELSE '正常'
END AS password_status
FROM pg_roles
WHERE rolname NOT IN ('pg_read_all_settings', 'pg_read_all_stats',
'pg_stat_scan_tables', 'pg_signal_backend')
ORDER BY rolcanlogin DESC, rolname;
11.2 超级用户检查
-- 生产环境应尽量减少超级用户数量
-- 应用账号不应使用 superuser 权限
SELECT
rolname AS superuser_name,
oid,
rolvaliduntil
FROM pg_authid
WHERE rolsuper = true
ORDER BY rolname;
11.3 密码过期检查
-- 检查所有可登录用户的密码有效期
-- 永不过期(NULL) 的用户建议设置有效期
SELECT
rolname AS username,
rolvaliduntil,
CASE WHEN rolvaliduntil IS NULL THEN '永不过期'
WHEN rolvaliduntil < now() THEN '⚠️ 已过期'
WHEN rolvaliduntil < now() + interval '30 days' THEN '⚠️ 30天内过期'
ELSE '正常'
END AS status
FROM pg_roles
WHERE rolcanlogin = true
AND rolname != 'postgres'
ORDER BY rolvaliduntil;
11.4 public 权限检查
-- 检查是否有对 public 开放了非默认权限
-- public 权限过宽是常见安全隐患
SELECT
n.nspname AS schema_name,
c.relname AS object_name,
c.relkind AS object_type,
'非默认权限' AS privilege_type
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
JOIN pg_default_acl d ON d.defaclnamespace = n.oid
WHERE d.defaclacl::text ~ '.*=.*/.*'
AND n.nspname NOT IN ('pg_catalog', 'information_schema');
11.5 默认权限检查
-- 查看 ALTER DEFAULT PRIVILEGES 设置的默认权限
-- 确保新建对象不会自动继承过宽权限
SELECT n.nspname, d.defaclnamespace::regnamespace AS schema_name,
d.defaclobjtype, d.defaclacl
FROM pg_default_acl d
JOIN pg_namespace n ON n.oid = d.defaclnamespace
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')
LIMIT 20;
十二、安全配置检查
12.1 安全相关参数
-- SSL、密码加密、日志审计等安全参数一览
SELECT name, setting, short_desc
FROM pg_settings
WHERE name IN (
'ssl', 'ssl_ca_file', 'ssl_cert_file', 'ssl_key_file', 'ssl_ciphers',
'password_encryption', 'bcrypt_rounds',
'shared_preload_libraries',
'log_connections', 'log_disconnections', 'log_duration',
'log_line_prefix', 'log_statement', 'log_min_duration_statement',
'log_checkpoints', 'log_lock_waits', 'log_temp_files',
'row_security', 'enable_row_security'
)
ORDER BY name;
12.2 SSL 配置检查
-- 当 ssl = on 时查看 SSL 证书配置
-- 生产环境强烈建议开启 SSL 加密
SELECT name, setting FROM pg_settings
WHERE name LIKE 'ssl_%' AND setting != '';
12.3 密码加密方式
-- password_encryption 应为 scram-sha-256
-- md5 已不安全,新版本默认 scram-sha-256
-- 升级步骤: SET password_encryption = 'scram-sha-256'; 然后 ALTER USER xxx PASSWORD 'xxx';
SELECT name, setting FROM pg_settings
WHERE name = 'password_encryption';
安全加固清单:
| 检查项 | 推荐值 | 风险等级 |
|---|---|---|
| ssl | on | 🔴 生产必须 |
| password_encryption | scram-sha-256 | 🔴 md5 已不安全 |
| log_connections | on | 🟡 审计需要 |
| log_disconnections | on | 🟡 审计需要 |
| log_statement | ddl | 🟡 至少记录 DDL |
| log_min_duration_statement | 1000 (ms) | 🟢 慢查询审计 |
十三、错误与慢查询日志
13.1 日志配置参数
-- logging_collector: 是否开启日志收集器
-- log_min_duration_statement: 慢查询阈值(-1 表示关闭)
-- log_rotation_age/size: 日志轮转策略
SELECT name, setting, short_desc
FROM pg_settings
WHERE name IN (
'log_destination', 'logging_collector', 'log_directory',
'log_filename', 'log_rotation_age', 'log_rotation_size',
'log_min_messages', 'log_min_error_statement', 'log_min_duration_statement',
'log_connections', 'log_disconnections', 'log_duration',
'log_line_prefix', 'log_statement', 'log_temp_files',
'log_lock_waits', 'log_checkpoints', 'log_autovacuum_min_duration'
)
ORDER BY name;
13.2 pg_stat_statements Top 20(需安装扩展)
-- 按 total_exec_time 降序排列,定位最耗时的 SQL
-- calls: 执行次数
-- mean_exec_time: 平均执行时间(ms)
-- shared_blks_read/hit: 共享缓冲区读写块数
-- 需先: CREATE EXTENSION pg_stat_statements;
-- 并在 shared_preload_libraries 中加入 pg_stat_statements
SELECT
LEFT(query, 150) AS query,
calls,
total_exec_time,
mean_exec_time,
max_exec_time,
rows,
shared_blks_hit,
shared_blks_read,
shared_blks_written
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
安装提示: 需在
postgresql.conf中设置shared_preload_libraries = 'pg_stat_statements',然后重启实例并执行CREATE EXTENSION pg_stat_statements;
十四、备份策略检查
本模块为 OS 层面检查,无 SQL 脚本,通过 shell 命令验证备份状态。
检查清单:
| 检查项 | 方法 | 说明 |
|---|---|---|
| 备份目录 | ls -la /data/backup/ |
确认备份文件存在且更新 |
| 最近备份时间 | stat /data/backup/latest.dump |
48 小时内有新备份 |
| crontab 定时任务 | crontab -l \| grep pg_dump |
确认有定时备份 |
| WAL 归档 | ls $PGDATA/pg_wal/archive_status/ |
无 .ready 残留 |
| 备份恢复测试 | 定期执行 PITR 演练 | 至少每季度一次 |
十五、健康检查汇总
以下 7 条汇总 SQL 快速评估实例整体健康状态。
15.1 Autovacuum 是否开启
SELECT setting FROM pg_settings WHERE name = 'autovacuum';
-- 期望值: on
15.2 归档模式是否开启
SELECT setting FROM pg_settings WHERE name = 'archive_mode';
-- 期望值: on(生产环境)
15.3 SSL 是否启用
SELECT setting FROM pg_settings WHERE name = 'ssl';
-- 期望值: on(生产环境)
15.4 密码加密方式
SELECT setting FROM pg_settings WHERE name = 'password_encryption';
-- 期望值: scram-sha-256
15.5 连接数余量
SELECT
MAX(setting)::int AS max_connections,
(SELECT COUNT(*) FROM pg_stat_activity) AS current_connections
FROM pg_settings
WHERE name = 'max_connections';
-- current / max < 80% 为健康
15.6 缓冲区命中率
SELECT ROUND(100.0 * SUM(blks_hit) / NULLIF(SUM(blks_hit) + SUM(blks_read), 0), 2) AS hit_ratio_pct
FROM pg_stat_database
WHERE datname = current_database();
-- > 99% 优秀,> 95% 正常,< 90% 需优化
15.7 死锁次数
SELECT SUM(deadlocks)::int AS total_deadlocks FROM pg_stat_database;
-- 期望值: 0(累计值,重启后清零)
健康检查汇总速查表
| 检查项 | 正常值 | 告警条件 | 处理建议 |
|---|---|---|---|
| Autovacuum | on | off | 立即开启 |
| Archive Mode | on | off | 开启归档支持 PITR |
| SSL | on | off | 生产环境必须开启 |
| 密码加密 | scram-sha-256 | md5 | 升级加密方式 |
| 连接数使用率 | < 80% | > 90% | 增加 max_connections 或优化连接池 |
| Buffer 命中率 | > 99% | < 95% | 增加 shared_buffers 或优化查询 |
| 死锁次数 | 0 | > 0 | 排查应用事务逻辑 |
| 数据库年龄 | < 10 亿 | > 15 亿 | 执行 VACUUM FREEZE |
| 复制槽 Lag | < 1GB | > 10GB | 检查备库状态或清理无效槽 |
| 未归档 WAL | 0 | > 0 | 排查归档命令 |
总结 & 注意事项
- 执行权限:本文所有 SQL 均需要超级用户(
postgres)或pg_read_all_stats角色权限才能获取完整信息 - 性能影响:所有查询均基于统计视图(
pg_stat_*、pg_settings),对生产库无性能影响;仅性能采样模块需要 10~30 秒间隔 - 版本差异:部分视图在 PostgreSQL 12 以下版本不可用(如
pg_stat_wal_receiver在 9.6+ 可用,pg_control_system()在 12+ 可用) - 统计重置:
pg_stat_*视图为累计值,实例重启后清零;可使用pg_stat_reset()手动重置 - XID 回卷:这是 PostgreSQL 特有的严重风险,Oracle/MySQL DBA 转型时需重点关注,务必确保 autovacuum 正常运行
- WAL 膨胀:不活跃的复制槽和失败的归档是 WAL 膨胀的两大主因,巡检时重点检查
- 连接泄漏:
idle in transaction是 PG 特有的问题(Oracle 中不存在),建议配置idle_in_transaction_session_timeout - 定期巡检:建议每日自动执行巡检脚本,周报对比关键指标变化趋势
版本差异说明:
- PostgreSQL 12+:所有 SQL 均可直接使用
- PostgreSQL 9.6~11:
pg_control_system()不可用,需用pg_controldata命令替代;wal_keep_size参数名为wal_keep_segments- PostgreSQL 14+:
pg_stat_statements中total_time更名为total_exec_time




