暂无图片
暂无图片
10
暂无图片
暂无图片
暂无图片

PostgreSQL DBA 巡检必备:68 个常用 SQL 脚本

原创 lucky、糯米汤圆 2026-04-28
537

适用版本:PostgreSQL 12 / 13 / 14 / 15 / 16 / 17
工作模式:手工巡检 / 脚本自动化
核心场景:基于 pg_health_check.sh 巡检脚本,系统化整理 PostgreSQL 数据库健康检查全量 SQL,涵盖从基础信息到性能调优的 16 大模块


前言

作为 PostgreSQL DBA,日常巡检是保障数据库稳定运行的核心工作之一。本文参考开源巡检脚本 pg_health_check.sh v1.0,将其内嵌的 68 个 SQL 语句按功能模块系统化整理,每个 SQL 均附带详细注释说明、输出字段解读及告警阈值建议。

全文共 16 个巡检模块,覆盖系统信息、基础信息、空间使用、连接与会话、性能采样、共享缓冲区、VACUUM 与膨胀、事务与年龄、WAL 与归档、锁等待、索引分析、用户权限、安全配置、错误日志、备份策略、健康总结等维度。

巡检模块 SQL 数量 难度 执行频率
一、基础信息 8 每日
二、空间使用分析 5 每日
三、连接与会话分析 6 ⭐⭐ 每日/实时
四、性能采样分析 5 ⭐⭐⭐ 实时采样
五、共享缓冲区与后台写入 3 ⭐⭐ 每日
六、VACUUM 与膨胀分析 5 ⭐⭐⭐ 每日
七、数据库年龄与事务分析 5 ⭐⭐⭐ 每日
八、WAL 与归档分析 5 ⭐⭐ 每日
九、锁等待分析 4 ⭐⭐ 实时
十、索引分析 5 ⭐⭐ 每周
十一、用户与权限审计 5 每周
十二、安全配置检查 3 ⭐⭐ 每月
十三、错误与慢查询日志 2 ⭐⭐ 每日
十四、备份策略检查 0(OS 层) 每日
十五、健康检查汇总 7 ⭐⭐ 每日
合计 68 - -

一、PostgreSQL 基础信息采集

1.1 查询数据库版本

-- 输出版本号、编译选项、操作系统信息等完整字符串 SELECT version();

输出示例:

PostgreSQL 14.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514, 64-bit

1.2 查询实例唯一标识

-- system_identifier 是初始化时生成的全局唯一标识 -- 用于确认主备是否来自同一实例、恢复后标识是否变化 SELECT system_identifier FROM pg_control_system();

小贴士: 该值在 pg_control 文件中,实例整个生命周期不变,可用于校验 PITR 恢复后的实例身份。

1.3 数据库列表及大小

-- datname: 数据库名 -- size: 可读格式的大小 -- conn_limit: 连接数上限 (-1 表示无限制) -- age: 事务 ID 年龄(越大越接近回卷风险) -- age_remain: 距离 2^31 上限的剩余空间 SELECT datname AS database_name, pg_size_pretty(pg_database_size(oid)) AS size, datconnlimit AS conn_limit, age(datfrozenxid) AS age, 2^31 - age(datfrozenxid) AS age_remain FROM pg_database ORDER BY pg_database_size(oid) DESC;

关键指标说明:

字段 含义 告警阈值
age 事务年龄 > 2 亿 (2×10⁸) 需关注
age_remain 剩余可用事务 ID < 5 亿需紧急处理

1.4 已安装扩展

-- 查看所有非系统内置扩展及其所属数据库和版本 SELECT d.datname AS database_name, e.extname AS extension_name, e.extversion AS version FROM pg_extension e JOIN pg_database d ON true WHERE e.extname NOT IN ('pg_catalog', 'information_schema') ORDER BY d.datname, e.extname;

1.5 用户定义的表空间

-- spcname: 表空间名称 -- location: 文件系统路径(NULL 表示默认表空间) -- size: 表空间总占用大小 SELECT spcname AS tablespace_name, pg_tablespace_location(oid) AS location, pg_size_pretty(pg_tablespace_size(oid)) AS size FROM pg_tablespace ORDER BY pg_tablespace_size(oid) DESC;

1.6 用户修改过的参数(非默认值)

-- 筛选 source 不为 default 和 override 的参数 -- 即用户通过 postgresql.conf、ALTER SYSTEM 或 ALTER SET 修改过的参数 SELECT name AS parameter, setting AS current_value, unit, boot_val AS default_value, source FROM pg_settings WHERE source NOT IN ('default', 'override') ORDER BY name;

DBA 必看: 这是快速了解实例有哪些自定义配置的最佳方式,建议每次巡检必查。

1.7 数据库级或角色级定制参数

-- 通过 ALTER DATABASE ... SET 或 ALTER ROLE ... SET 设置的参数 -- 这些参数仅对特定数据库或角色生效 SELECT d.datname AS database_name, r.rolname AS role_name, p.setconfig AS settings FROM pg_db_role_setting p LEFT JOIN pg_database d ON p.setdatabase = d.oid LEFT JOIN pg_roles r ON p.setrole = r.oid;

1.8 对象统计概览

-- 统计用户创建的各类对象数量 -- relkind: r=普通表, v=视图, i=索引, S=序列, m=物化视图 SELECT '普通表' AS type, COUNT(*) AS cnt FROM pg_class WHERE relkind = 'r' AND relnamespace > 11 UNION ALL SELECT '视图', COUNT(*) FROM pg_class WHERE relkind = 'v' AND relnamespace > 11 UNION ALL SELECT '索引', COUNT(*) FROM pg_class WHERE relkind = 'i' AND relnamespace > 11 UNION ALL SELECT '序列', COUNT(*) FROM pg_class WHERE relkind = 'S' AND relnamespace > 11 UNION ALL SELECT '物化视图', COUNT(*) FROM pg_class WHERE relkind = 'm' AND relnamespace > 11 UNION ALL SELECT '触发器', COUNT(*) FROM pg_trigger WHERE tgisinternal = false;

二、空间使用分析

2.1 表空间大小排名

-- 查看所有表空间的磁盘占用情况 SELECT spcname AS tablespace, pg_size_pretty(pg_tablespace_size(oid)) AS total_size, CASE WHEN pg_tablespace_location(oid) IS NULL THEN 'default location' ELSE pg_tablespace_location(oid) END AS location FROM pg_tablespace ORDER BY pg_tablespace_size(oid) DESC;

2.2 数据库大小排名

-- 排除模板库,按大小降序排列 SELECT datname AS database_name, pg_size_pretty(pg_database_size(oid)) AS size, pg_database_size(oid) AS size_bytes FROM pg_database WHERE datname NOT IN ('template0', 'template1') ORDER BY size_bytes DESC;

2.3 TOP 20 大表

-- 排除系统 schema,返回表大小和估算行数 -- reltuples 是优化器统计值,可能与实际行数有偏差(需 ANALYZE 后准确) SELECT n.nspname AS schema_name, c.relname AS table_name, pg_size_pretty(pg_relation_size(c.oid)) AS table_size, pg_relation_size(c.oid) AS size_bytes, c.reltuples::bigint AS estimated_rows FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE c.relkind = 'r' AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast') ORDER BY pg_relation_size(c.oid) DESC LIMIT 20;

2.4 TOP 20 大索引

-- 返回索引所属表、索引名、大小等信息 -- 通过 pg_index 关联索引与基表的关系 SELECT n.nspname AS schema_name, c.relname AS table_name, i.relname AS index_name, pg_size_pretty(pg_relation_size(i.oid)) AS index_size, pg_relation_size(i.oid) AS size_bytes FROM pg_class i JOIN pg_namespace n ON n.oid = i.relnamespace JOIN pg_index ix ON ix.indexrelid = i.oid JOIN pg_class c ON c.oid = ix.indrelid WHERE i.relkind = 'i' AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast') ORDER BY pg_relation_size(i.oid) DESC LIMIT 20;

2.5 表膨胀率概览(索引/表大小比)

-- index_to_table_ratio 反映索引占表的比例 -- 比例过高可能存在冗余索引,过低可能缺少必要索引 -- 仅统计大于 1MB 的表 SELECT n.nspname AS schema_name, c.relname AS table_name, pg_size_pretty(pg_relation_size(c.oid)) AS current_size, ROUND(100.0 * pg_relation_size(c.oid) / NULLIF(pg_total_relation_size(c.oid) - pg_relation_size(c.oid), 0), 2) AS index_to_table_ratio FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE c.relkind = 'r' AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast') AND pg_relation_size(c.oid) > 1024 * 1024 ORDER BY pg_relation_size(c.oid) DESC LIMIT 20;

三、连接与会话分析

3.1 当前连接状态汇总

-- state 可能值: active, idle, idle in transaction, -- idle in transaction (aborted), fastpath function call, disabled SELECT COALESCE(state, 'unknown') AS state, COUNT(*) AS connection_count FROM pg_stat_activity GROUP BY state ORDER BY connection_count DESC;

状态含义速查:

state 含义 是否需要关注
active 正在执行查询 正常
idle 空闲等待新命令 正常
idle in transaction 事务中空闲 ⚠️ 可能持有锁
idle in transaction (aborted) 事务中出错未回滚 🔴 需立即处理

3.2 连接详情(活跃前 20,按等待时间排序)

-- seconds_in_state: 在当前状态持续的时间(秒) -- current_query: 截取前 100 字符的当前 SQL SELECT pid, usename AS username, datname AS database, state, client_addr, application_name, EXTRACT(EPOCH FROM (now() - state_change))::int AS seconds_in_state, LEFT(REPLACE(query, chr(10), ' '), 100) AS current_query FROM pg_stat_activity WHERE state IS NOT NULL ORDER BY state_change ASC LIMIT 20;

3.3 连接数统计与余量

-- max_connections: 最大允许连接数 -- reserved_for_superuser: 为超级用户保留的连接槽位 -- available_for_normal: 普通用户剩余可用连接数 SELECT MAX(setting)::int AS max_connections, (SELECT COUNT(*) FROM pg_stat_activity) AS current_connections, (SELECT setting FROM pg_settings WHERE name = 'superuser_reserved_connections')::int AS reserved_for_superuser, MAX(setting)::int - (SELECT COUNT(*) FROM pg_stat_activity) AS available_for_normal FROM pg_settings WHERE name = 'max_connections';

告警建议:available_for_normal < 10 时应关注,< 0 则意味着普通用户已无法新建连接。

3.4 各用户连接数与限额

-- rolconnlimit: 角色级别的连接数限制 (-1 表示不限制) -- current_connects: 当前该用户的实际连接数 SELECT a.rolname AS username, a.rolconnlimit AS conn_limit, COALESCE(b.connects, 0) AS current_connects FROM pg_authid a LEFT JOIN ( SELECT usename, COUNT(*) AS connects FROM pg_stat_activity GROUP BY usename ) b ON a.rolname = b.usename WHERE a.rolname NOT IN ('postgres') ORDER BY b.connects DESC NULLS LAST;

3.5 各数据库连接数与限额

-- 类似用户维度,从数据库维度查看连接分布 SELECT a.datname AS database_name, a.datconnlimit AS conn_limit, COALESCE(b.connects, 0) AS current_connects FROM pg_database a LEFT JOIN ( SELECT datname, COUNT(*) AS connects FROM pg_stat_activity GROUP BY datname ) b ON a.datname = b.datname WHERE a.datname NOT IN ('template0', 'template1') ORDER BY b.connects DESC NULLS LAST;

3.6 长时间空闲连接(空闲超 5 分钟)

-- 空闲超过 5 分钟的连接可能是应用连接池泄漏 -- 可考虑配置 idle_in_transaction_session_timeout 自动清理 SELECT pid, usename AS username, datname AS database, client_addr, application_name, EXTRACT(EPOCH FROM (now() - state_change))::int AS idle_seconds FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '5 minutes' ORDER BY state_change ASC LIMIT 20;

四、性能采样分析(两次快照对比)

原理说明: 本模块通过前后两次采样 pg_stat_database 的差值计算实时 QPS、TPS 和命中率,类似 Oracle AWR 快照机制。

4.1 采样 — 缓冲区块访问总量

-- 总缓冲区访问次数(命中 + 未命中) -- 需间隔 10~30 秒采集两次,做差计算增量 SELECT blks_hit + blks_read FROM pg_stat_database WHERE datname = current_database();

4.2 采样 — 缓冲区命中次数

-- 仅缓冲区命中次数,与 4.1 配合计算命中率 SELECT blks_hit FROM pg_stat_database WHERE datname = current_database();

4.3 采样 — 元组变更总量

-- 插入 + 更新 + 删除的元组总数(累计值) SELECT (n_tup_ins + n_tup_upd + n_tup_del) FROM pg_stat_database WHERE datname = current_database();

4.4 采样 — 事务提交回滚总数

-- 提交 + 回滚的事务总数(累计值) SELECT (xact_commit + xact_rollback) FROM pg_stat_database WHERE datname = current_database();

使用方式: 间隔 10~30 秒后重复执行 4.1 ~ 4.4,通过差值计算:

  • Buffer Hit Ratio = hit_delta / blks_delta,建议 > 99%
  • TPS = xact_delta / interval_seconds
  • Tuple Changes/sec = tup_delta / interval_seconds

4.5 数据库级性能指标汇总

-- 单次查询即可获取当前数据库的关键性能累计指标 -- 注意是累计值,需两次采样做差才有实时意义 SELECT datname, xact_commit, xact_rollback, blks_read, blks_hit, ROUND(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) AS cache_hit_ratio, tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted, conflicts, deadlocks, blk_read_time::int AS read_ms, blk_write_time::int AS write_ms, temp_files, temp_bytes FROM pg_stat_database WHERE datname = current_database();

关键指标告警阈值:

指标 建议范围 告警条件
cache_hit_ratio > 99% < 95% 需增加 shared_buffers
deadlocks 0 > 0 需排查业务逻辑
conflicts 备库冲突过高需关注
temp_files 0 大量临时文件说明 work_mem 不足

五、共享缓冲区与后台写入(BGWriter)

5.1 内存与 IO 相关关键参数

-- 这些参数直接影响内存使用和磁盘写入策略 SELECT name, setting, unit, short_desc FROM pg_settings WHERE name IN ( 'shared_buffers', -- 共享缓冲区大小 'effective_cache_size', -- 优化器估计的可用缓存 'work_mem', -- 排序/Hash 操作内存上限 'maintenance_work_mem', -- 维护操作(VACUUM/CREATE INDEX)内存 'temp_buffers', -- 临时表使用的缓冲区 'max_connections', -- 最大连接数 'bgwriter_delay', -- BGWriter 轮询间隔(ms) 'bgwriter_lru_maxpages', -- BGWriter 每次 LRU 刷写的最大页数 'bgwriter_lru_multiplier', -- BGWriter 刷写倍率 'bgwriter_flush_after', -- BGWriter 累积刷写字节阈值 'checkpoint_completion_target', -- 检查点完成目标 'checkpoint_timeout', -- 检查点间隔时间 'min_wal_size', -- WAL 最小大小 'max_wal_size', -- WAL 最大大小 'wal_buffers' -- WAL 缓冲区大小 ) ORDER BY name;

参数调优建议:

参数 推荐值 说明
shared_buffers 物理内存的 25% 不超过 40%,Linux 上受 shmmax 限制
effective_cache_size 物理内存的 50%~75% 仅供优化器参考,不实际分配
work_mem (shared_buffers / max_connections) × 0.25 并发高时需适当降低
maintenance_work_mem 256MB~1GB VACUUM 和 CREATE INDEX 使用

5.2 BGWriter 统计信息

-- 查看后台写入进程的完整统计信息 -- checkpoints_timed: 计划内检查点次数 -- checkpoints_req: 请求触发的检查点次数(过多说明 WAL 写入过快) -- buffers_checkpoint: 检查点刷写的缓冲区数 -- buffers_clean: BGWriter 主动刷写的缓冲区数 -- buffers_backend: 后端进程自行刷写的缓冲区数(过高说明 BGWriter 不够积极) SELECT * FROM pg_stat_bgwriter;

进阶技巧: buffers_backend 过高意味着后端进程经常需要自己刷脏页,说明 BGWriter 不够积极。可适当调低 bgwriter_delay 或调高 bgwriter_lru_maxpages

5.3 数据库级缓冲区命中统计

-- 逐库查看缓存命中率、死锁、IO 时间等 SELECT s.datname AS database, s.numbackends AS backends, s.xact_commit AS commits, s.xact_rollback AS rollbacks, s.blks_read, s.blks_hit, ROUND(100.0 * s.blks_hit / NULLIF(s.blks_hit + s.blks_read, 0), 2) AS cache_hit_ratio, s.tup_returned, s.tup_fetched, s.tup_inserted, s.tup_updated, s.tup_deleted, s.conflicts, s.deadlocks, s.blk_read_time::int AS read_ms, s.blk_write_time::int AS write_ms, s.temp_files AS temp_file_count, s.temp_bytes AS temp_bytes FROM pg_stat_database s WHERE s.datname NOT IN ('template0', 'template1') ORDER BY s.datname;

六、VACUUM 与膨胀分析

6.1 Autovacuum 配置参数

-- 检查自动清理是否开启及相关阈值设置 SELECT name, setting, short_desc FROM pg_settings WHERE name IN ( 'autovacuum', -- 是否开启自动清理 'autovacuum_max_workers', -- 自动清理工作进程数 'autovacuum_naptime', -- 自动清理轮询间隔 'autovacuum_vacuum_threshold', -- 触发 VACUUM 的死元组数阈值 'autovacuum_vacuum_scale_factor',-- 触发 VACUUM 的死元组比例 'autovacuum_analyze_threshold', -- 触发 ANALYZE 的变更行数阈值 'autovacuum_analyze_scale_factor',-- 触发 ANALYZE 的变更比例 'autovacuum_vacuum_cost_delay', -- 自动清理的延迟(ms) 'autovacuum_vacuum_cost_limit', -- 自动清理的成本限制 'vacuum_cleanup_index_scale_factor' -- 索引清理的比例因子 ) ORDER BY name;

关键参数建议:

参数 默认值 建议值 说明
autovacuum on on 生产环境必须开启
autovacuum_vacuum_scale_factor 0.2 0.05~0.1 大表建议更低
autovacuum_max_workers 3 4~6 表多时适当增加
autovacuum_naptime 60s 30s~60s 高频写入库可缩短

6.2 表膨胀 Top15(需 VACUUM)

-- dead_ratio_pct: 死元组占活元组的百分比 -- 值过高说明 VACUUM 不及时,需要关注 SELECT schemaname, relname AS table_name, n_dead_tup AS dead_tuples, n_live_tup AS live_tuples, CASE WHEN n_live_tup > 0 THEN ROUND(100.0 * n_dead_tup / n_live_tup, 2) ELSE 0 END AS dead_ratio_pct, last_vacuum, last_autovacuum, vacuum_count, autovacuum_count FROM pg_stat_user_tables WHERE n_dead_tup > 1000 ORDER BY n_dead_tup DESC LIMIT 15;

告警阈值: dead_ratio_pct > 10% 需要关注,> 50% 应立即 VACUUM。

6.3 未及时 VACUUM 的表

-- 超过 7 天未执行 VACUUM 且死元组 > 10000 的表 -- "从未 VACUUM" 的标记说明自动清理从未触及该表 SELECT schemaname, relname AS table_name, n_dead_tup, last_autovacuum, CASE WHEN last_autovacuum IS NULL AND last_vacuum IS NULL THEN '从未 VACUUM' ELSE '' END AS note FROM pg_stat_user_tables WHERE n_dead_tup > 10000 AND (last_autovacuum IS NULL OR last_autovacuum < now() - interval '7 days') AND (last_vacuum IS NULL OR last_vacuum < now() - interval '7 days') ORDER BY n_dead_tup DESC LIMIT 10;

6.4 表膨胀详细分析(膨胀倍数 Top10)

-- bloat_ratio 反映索引空间与表空间的比例 -- 比例异常高说明索引可能严重膨胀 -- 仅分析大于 10MB 的表 SELECT schemaname, relname AS table_name, pg_size_pretty(pg_relation_size(relid)) AS table_size, ROUND(100.0 * pg_relation_size(relid) / NULLIF(pg_total_relation_size(relid) - pg_relation_size(relid), 0), 2) AS bloat_ratio FROM pg_stat_user_tables WHERE pg_relation_size(relid) > 10 * 1024 * 1024 ORDER BY bloat_ratio DESC NULLS LAST LIMIT 10;

6.5 索引膨胀 Top10

-- 扫描次数少于 100 次但体积超过 1MB 的索引 -- avg_bytes_per_scan: 每次扫描平均消耗字节数,过大说明索引效率低 SELECT n.nspname AS schemaname, i.relname AS index_name, c.relname AS table_name, pg_size_pretty(pg_relation_size(i.oid)) AS index_size, s.idx_scan, s.idx_tup_read, s.idx_tup_fetch, pg_relation_size(i.oid) / NULLIF(s.idx_scan + 1, 0) AS avg_bytes_per_scan FROM pg_stat_user_indexes s JOIN pg_class i ON i.oid = s.indexrelid JOIN pg_class c ON c.oid = s.relid JOIN pg_namespace n ON n.oid = c.relnamespace WHERE s.idx_scan < 100 AND pg_relation_size(i.oid) > 1024 * 1024 AND n.nspname NOT IN ('pg_catalog', 'information_schema') ORDER BY pg_relation_size(i.oid) DESC LIMIT 10;

VACUUM 模块建议:

  • Dead tuples 过多会导致查询变慢,建议调整 autovacuum_vacuum_scale_factor(推荐 0.05~0.1)
  • 大表建议设置单独的 autovacuum_vacuum_cost_delay 或手工定期 VACUUM
  • 未使用索引占用空间且影响写入性能,建议评估后删除

七、数据库年龄与事务分析

7.1 数据库年龄

-- current_age: 事务 ID 年龄,即距离冻结点的距离 -- age_remain: 距离 2^31 上限的剩余事务数 -- 当 age 接近 2^31 时会发生事务 ID 回卷,数据库将只读 SELECT datname AS database_name, age(datfrozenxid) AS current_age, 2^31 - age(datfrozenxid) AS age_remain, CASE WHEN age(datfrozenxid) > 2^30 THEN '⚠️ 告警: 年龄即将耗尽' WHEN age(datfrozenxid) > 2^29 THEN '⚠️ 警告: 年龄接近上限' ELSE '正常' END AS status FROM pg_database WHERE datname NOT IN ('template0', 'template1') ORDER BY age(datfrozenxid) DESC;

🔴 XID 回卷风险是 PostgreSQL 最严重的故障之一!age_remain 接近 0 时,数据库将停止接受写入。务必定期检查此指标。

7.2 表年龄 Top10

-- 表级别的冻结年龄,age > 100 万才显示 -- last_vacuum_time: 最近一次 VACUUM(含自动)的时间 SELECT n.nspname AS schemaname, c.relname AS table_name, c.relkind, age(c.relfrozenxid) AS current_age, 2^31 - age(c.relfrozenxid) AS age_remain, GREATEST(s.last_vacuum, s.last_autovacuum) AS last_vacuum_time FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace LEFT JOIN pg_stat_user_tables s ON s.relid = c.oid WHERE c.relkind IN ('r', 't', 'm') AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast') AND age(c.relfrozenxid) > 1000000 ORDER BY age(c.relfrozenxid) DESC LIMIT 10;

7.3 长事务(运行超 30 分钟)

-- 长事务会阻止 VACUUM 清理死元组,是膨胀的主要诱因 -- backend_xid / backend_xmin: 表示该后端正在持有事务快照 SELECT pid, usename AS username, datname AS database, state, EXTRACT(EPOCH FROM (now() - xact_start))::int / 60 AS xact_minutes, EXTRACT(EPOCH FROM (now() - query_start))::int / 60 AS query_minutes, LEFT(REPLACE(query, chr(10), ' '), 120) AS query FROM pg_stat_activity WHERE state NOT IN ('idle') AND (backend_xid IS NOT NULL OR backend_xmin IS NOT NULL) AND xact_start < now() - interval '30 minutes' ORDER BY xact_start;

处理建议: 确认后可使用 SELECT pg_terminate_backend(pid) 终止长事务,同时排查应用代码中的事务边界问题。

7.4 两阶段提交(2PC)预提交事务

-- 2PC 事务长时间未提交会阻止 VACUUM,导致表膨胀 -- prepared_minutes: 预提交事务已等待的分钟数 SELECT transaction, gid, prepared, owner, database AS db_name, EXTRACT(EPOCH FROM (now() - prepared)) / 60 AS prepared_minutes FROM pg_prepared_xacts ORDER BY prepared;

7.5 复制槽位(可能导致 WAL 积压)

-- 不活跃的复制槽会阻止 WAL 文件回收,导致 pg_wal 膨胀 -- lag_bytes: 当前 WAL 位置与槽位重启 LSN 的字节差 SELECT slot_name, plugin, slot_type, COALESCE(d.datname, 'N/A') AS database_name, active, restart_lsn, confirmed_flush_lsn, CASE WHEN restart_lsn IS NOT NULL AND pg_is_in_recovery() = false THEN pg_size_pretty( pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) ) ELSE NULL END AS lag_bytes FROM pg_replication_slots rs LEFT JOIN pg_database d ON rs.database::oid = d.oid;

⚠️ 重要: active = false 的复制槽是最常见的 WAL 膨胀原因,必须及时清理:SELECT pg_drop_replication_slot('slot_name');


八、WAL 与归档分析

8.1 WAL 配置参数

-- 核心参数解读: -- wal_level: replica 支持流复制,logical 支持逻辑解码 -- archive_mode: on 才能实现 PITR -- max_wal_size: 控制 WAL 总量上限,过小导致频繁检查点 SELECT name, setting, short_desc FROM pg_settings WHERE name IN ( 'wal_level', 'max_wal_senders', 'max_replication_slots', 'wal_keep_size', 'min_wal_size', 'max_wal_size', 'archive_mode', 'archive_command', 'archive_timeout', 'wal_compression', 'wal_log_hints', 'wal_init_zero' ) ORDER BY name;

8.2 归档统计

-- archived_count: 已成功归档的 WAL 数 -- failed_count: 归档失败次数(非 0 需立即排查) -- last_archived_wal / last_archived_time: 最近归档信息 SELECT * FROM pg_stat_archiver;

8.3 当前 WAL 位置

-- current_lsn: 当前写入的 WAL 位置 -- current_wal_file: 当前正在写入的 WAL 文件名 -- bytes_in_wal: 自起始以来的 WAL 字节数 SELECT pg_current_wal_lsn() AS current_lsn, pg_walfile_name(pg_current_wal_lsn()) AS current_wal_file, pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0'::pg_lsn) AS bytes_in_wal;

8.4 流复制状态(主库视角)

-- 主库查看所有备库的复制延迟 -- lag_bytes: 主库已发送但备库尚未回放的 WAL 字节数 -- sync_state: sync=同步, async=异步, potential=候选同步 SELECT client_addr, usename, application_name, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes, sync_state FROM pg_stat_replication;

备库视角:

-- 在备库上执行,查看 WAL 接收和回放状态 -- replay_lag: 接收到回放的延迟 SELECT pid, status, receive_start_lsn, received_lsn, last_msg_send_time, last_msg_receipt_time, latest_end_lsn, slot_name, pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS replay_lag FROM pg_stat_wal_receiver;

九、锁等待分析

9.1 当前锁统计

-- 按 locktype / relation / mode 分组统计锁数量 -- locktype 类型: relation, extend, page, tuple, transactionid, virtualxid, object, userlock, advisory SELECT locktype, database::regclass AS database, relation::regclass AS relation, page, tuple, virtualxid, transactionid, granted, mode, COUNT(*) AS lock_count FROM pg_locks WHERE locktype NOT IN ('virtualxid') GROUP BY 1,2,3,4,5,6,7,8,9 ORDER BY lock_count DESC, granted LIMIT 30;

9.2 等待中的锁(谁阻塞了谁)

-- 找到被阻塞的会话及其阻塞源 -- waiting_pid: 被阻塞的会话 PID -- blocking_pid: 持有锁的会话 PID SELECT w.pid AS waiting_pid, w.locktype AS waiting_locktype, COALESCE(w.relation::regclass::text, w.transactionid::text, w.virtualxid) AS waiting_object, w.mode AS waiting_mode, w.granted AS waiting_granted, b.pid AS blocking_pid, b.usename AS blocking_user, LEFT(REPLACE(b.query, chr(10), ' '), 80) AS blocking_query FROM pg_locks w JOIN pg_stat_activity b ON b.pid = w.pid WHERE NOT w.granted ORDER BY w.pid;

9.3 长时间持锁(超 10 秒)

-- 已授予但持锁时间超 10 秒的会话 -- 可能影响其他事务的并发性能 SELECT l.pid, l.locktype, l.mode, l.relation::regclass, l.granted, a.usename, a.datname, a.query, EXTRACT(EPOCH FROM (now() - a.xact_start))::int AS xact_seconds, EXTRACT(EPOCH FROM (now() - a.query_start))::int AS query_seconds FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid WHERE l.granted AND a.query_start < now() - interval '10 seconds' ORDER BY a.query_start;

9.4 死锁检测

-- 查看各数据库的死锁累计次数 -- 仅显示死锁数 > 0 的数据库 SELECT datname, SUM(deadlocks)::int AS deadlock_count FROM pg_stat_database GROUP BY datname HAVING SUM(deadlocks) > 0;

死锁排查建议: 死锁通常是应用事务逻辑问题(交叉更新、缺少 SELECT FOR UPDATE 等),需在应用层修复。deadlocks 是累计值,重启后清零。


十、索引分析

10.1 未使用索引 Top20

-- idx_scan = 0 说明自上次统计重置后从未使用过 -- 体积 > 64KB 才显示(过滤小索引) -- 这些索引只占用空间、拖慢 DML,建议评估删除 SELECT n.nspname AS schemaname, c.relname AS table_name, i.relname AS index_name, pg_size_pretty(pg_relation_size(i.oid)) AS index_size, s.idx_scan, s.idx_tup_read, s.idx_tup_fetch FROM pg_stat_user_indexes s JOIN pg_class i ON i.oid = s.indexrelid JOIN pg_class c ON c.oid = s.relid JOIN pg_namespace n ON n.oid = c.relnamespace WHERE s.idx_scan = 0 AND pg_relation_size(i.oid) > 65536 AND i.relname NOT LIKE 'pg_toast_%' AND n.nspname NOT IN ('pg_catalog', 'information_schema') ORDER BY pg_relation_size(i.oid) DESC LIMIT 20;

10.2 索引使用率偏低(扫描 1~100 次但占用空间大)

-- bytes_per_scan: 每次扫描消耗的字节数 -- 体积 > 10MB 且扫描极少的索引性价比极低 SELECT n.nspname AS schemaname, c.relname AS table_name, i.relname AS index_name, pg_size_pretty(pg_relation_size(i.oid)) AS index_size, s.idx_scan, pg_relation_size(i.oid) / NULLIF(s.idx_scan, 0) AS bytes_per_scan FROM pg_stat_user_indexes s JOIN pg_class i ON i.oid = s.indexrelid JOIN pg_class c ON c.oid = s.relid JOIN pg_namespace n ON n.oid = c.relnamespace WHERE s.idx_scan BETWEEN 1 AND 100 AND pg_relation_size(i.oid) > 10 * 1024 * 1024 AND n.nspname NOT IN ('pg_catalog', 'information_schema') ORDER BY pg_relation_size(i.oid) DESC LIMIT 10;

10.3 重复索引

-- 查找同一列集上有多个相同定义的非唯一、非主键索引 -- 重复索引纯属浪费,应删除多余的那个 SELECT t.relname AS table_name, array_agg(DISTINCT i.indexrelid::regclass) AS duplicate_indexes, pg_get_indexdef(i.indexrelid) AS index_definition, COUNT(*) AS count FROM pg_index i JOIN pg_class t ON t.oid = i.indrelid WHERE i.indisunique = false AND i.indisprimary = false GROUP BY t.relname, i.indexrelid, pg_get_indexdef(i.indexrelid) HAVING COUNT(*) > 1;

10.4 无效索引

-- indisvalid = false 的索引是 CONCURRENTLY 创建失败后遗留的 -- 不会被查询使用,但仍然占用磁盘空间 -- 处理: DROP INDEX CONCURRENTLY index_name; 然后重建 SELECT n.nspname AS schemaname, c.relname AS table_name, i.relname AS invalid_index_name, ix.indexrelid::regclass AS invalid_index FROM pg_index ix JOIN pg_class i ON i.oid = ix.indexrelid JOIN pg_class c ON c.oid = ix.indrelid JOIN pg_namespace n ON n.oid = c.relnamespace WHERE ix.indisvalid = false AND n.nspname NOT IN ('pg_catalog', 'information_schema');

10.5 索引过多的表(> 5 个索引)

-- 索引过多会加重写入开销和表膨胀 -- 建议单表索引数控制在 5 个以内 SELECT n.nspname AS schemaname, c.relname AS table_name, COUNT(*) AS index_count, pg_size_pretty(SUM(pg_relation_size(i.oid))) AS total_index_size, pg_size_pretty(pg_relation_size(c.oid)) AS table_size FROM pg_stat_user_indexes s JOIN pg_class i ON i.oid = s.indexrelid JOIN pg_class c ON c.oid = s.relid JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname NOT IN ('pg_catalog', 'information_schema') GROUP BY n.nspname, c.relname, c.oid HAVING COUNT(*) > 5 ORDER BY COUNT(*) DESC LIMIT 15;

十一、用户与权限审计

11.1 用户列表

-- rolsuper: 是否超级用户 -- rolcanlogin: 是否允许登录 -- rolvaliduntil: 密码有效期 -- password_status: 自动判断密码是否过期/即将过期 SELECT rolname AS username, rolsuper AS is_superuser, rolinherit AS inherits, rolcreaterole AS can_create_role, rolcreatedb AS can_create_db, rolcanlogin AS can_login, rolreplication AS is_replication, rolconnlimit AS conn_limit, rolvaliduntil AS valid_until, CASE WHEN rolvaliduntil < now() THEN '⚠️ 已过期' WHEN rolvaliduntil < now() + interval '30 days' THEN '⚠️ 即将过期' ELSE '正常' END AS password_status FROM pg_roles WHERE rolname NOT IN ('pg_read_all_settings', 'pg_read_all_stats', 'pg_stat_scan_tables', 'pg_signal_backend') ORDER BY rolcanlogin DESC, rolname;

11.2 超级用户检查

-- 生产环境应尽量减少超级用户数量 -- 应用账号不应使用 superuser 权限 SELECT rolname AS superuser_name, oid, rolvaliduntil FROM pg_authid WHERE rolsuper = true ORDER BY rolname;

11.3 密码过期检查

-- 检查所有可登录用户的密码有效期 -- 永不过期(NULL) 的用户建议设置有效期 SELECT rolname AS username, rolvaliduntil, CASE WHEN rolvaliduntil IS NULL THEN '永不过期' WHEN rolvaliduntil < now() THEN '⚠️ 已过期' WHEN rolvaliduntil < now() + interval '30 days' THEN '⚠️ 30天内过期' ELSE '正常' END AS status FROM pg_roles WHERE rolcanlogin = true AND rolname != 'postgres' ORDER BY rolvaliduntil;

11.4 public 权限检查

-- 检查是否有对 public 开放了非默认权限 -- public 权限过宽是常见安全隐患 SELECT n.nspname AS schema_name, c.relname AS object_name, c.relkind AS object_type, '非默认权限' AS privilege_type FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace JOIN pg_default_acl d ON d.defaclnamespace = n.oid WHERE d.defaclacl::text ~ '.*=.*/.*' AND n.nspname NOT IN ('pg_catalog', 'information_schema');

11.5 默认权限检查

-- 查看 ALTER DEFAULT PRIVILEGES 设置的默认权限 -- 确保新建对象不会自动继承过宽权限 SELECT n.nspname, d.defaclnamespace::regnamespace AS schema_name, d.defaclobjtype, d.defaclacl FROM pg_default_acl d JOIN pg_namespace n ON n.oid = d.defaclnamespace WHERE n.nspname NOT IN ('pg_catalog', 'information_schema') LIMIT 20;

十二、安全配置检查

12.1 安全相关参数

-- SSL、密码加密、日志审计等安全参数一览 SELECT name, setting, short_desc FROM pg_settings WHERE name IN ( 'ssl', 'ssl_ca_file', 'ssl_cert_file', 'ssl_key_file', 'ssl_ciphers', 'password_encryption', 'bcrypt_rounds', 'shared_preload_libraries', 'log_connections', 'log_disconnections', 'log_duration', 'log_line_prefix', 'log_statement', 'log_min_duration_statement', 'log_checkpoints', 'log_lock_waits', 'log_temp_files', 'row_security', 'enable_row_security' ) ORDER BY name;

12.2 SSL 配置检查

-- 当 ssl = on 时查看 SSL 证书配置 -- 生产环境强烈建议开启 SSL 加密 SELECT name, setting FROM pg_settings WHERE name LIKE 'ssl_%' AND setting != '';

12.3 密码加密方式

-- password_encryption 应为 scram-sha-256 -- md5 已不安全,新版本默认 scram-sha-256 -- 升级步骤: SET password_encryption = 'scram-sha-256'; 然后 ALTER USER xxx PASSWORD 'xxx'; SELECT name, setting FROM pg_settings WHERE name = 'password_encryption';

安全加固清单:

检查项 推荐值 风险等级
ssl on 🔴 生产必须
password_encryption scram-sha-256 🔴 md5 已不安全
log_connections on 🟡 审计需要
log_disconnections on 🟡 审计需要
log_statement ddl 🟡 至少记录 DDL
log_min_duration_statement 1000 (ms) 🟢 慢查询审计

十三、错误与慢查询日志

13.1 日志配置参数

-- logging_collector: 是否开启日志收集器 -- log_min_duration_statement: 慢查询阈值(-1 表示关闭) -- log_rotation_age/size: 日志轮转策略 SELECT name, setting, short_desc FROM pg_settings WHERE name IN ( 'log_destination', 'logging_collector', 'log_directory', 'log_filename', 'log_rotation_age', 'log_rotation_size', 'log_min_messages', 'log_min_error_statement', 'log_min_duration_statement', 'log_connections', 'log_disconnections', 'log_duration', 'log_line_prefix', 'log_statement', 'log_temp_files', 'log_lock_waits', 'log_checkpoints', 'log_autovacuum_min_duration' ) ORDER BY name;

13.2 pg_stat_statements Top 20(需安装扩展)

-- 按 total_exec_time 降序排列,定位最耗时的 SQL -- calls: 执行次数 -- mean_exec_time: 平均执行时间(ms) -- shared_blks_read/hit: 共享缓冲区读写块数 -- 需先: CREATE EXTENSION pg_stat_statements; -- 并在 shared_preload_libraries 中加入 pg_stat_statements SELECT LEFT(query, 150) AS query, calls, total_exec_time, mean_exec_time, max_exec_time, rows, shared_blks_hit, shared_blks_read, shared_blks_written FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;

安装提示: 需在 postgresql.conf 中设置 shared_preload_libraries = 'pg_stat_statements',然后重启实例并执行 CREATE EXTENSION pg_stat_statements;


十四、备份策略检查

本模块为 OS 层面检查,无 SQL 脚本,通过 shell 命令验证备份状态。

检查清单:

检查项 方法 说明
备份目录 ls -la /data/backup/ 确认备份文件存在且更新
最近备份时间 stat /data/backup/latest.dump 48 小时内有新备份
crontab 定时任务 crontab -l \| grep pg_dump 确认有定时备份
WAL 归档 ls $PGDATA/pg_wal/archive_status/ 无 .ready 残留
备份恢复测试 定期执行 PITR 演练 至少每季度一次

十五、健康检查汇总

以下 7 条汇总 SQL 快速评估实例整体健康状态。

15.1 Autovacuum 是否开启

SELECT setting FROM pg_settings WHERE name = 'autovacuum'; -- 期望值: on

15.2 归档模式是否开启

SELECT setting FROM pg_settings WHERE name = 'archive_mode'; -- 期望值: on(生产环境)

15.3 SSL 是否启用

SELECT setting FROM pg_settings WHERE name = 'ssl'; -- 期望值: on(生产环境)

15.4 密码加密方式

SELECT setting FROM pg_settings WHERE name = 'password_encryption'; -- 期望值: scram-sha-256

15.5 连接数余量

SELECT MAX(setting)::int AS max_connections, (SELECT COUNT(*) FROM pg_stat_activity) AS current_connections FROM pg_settings WHERE name = 'max_connections'; -- current / max < 80% 为健康

15.6 缓冲区命中率

SELECT ROUND(100.0 * SUM(blks_hit) / NULLIF(SUM(blks_hit) + SUM(blks_read), 0), 2) AS hit_ratio_pct FROM pg_stat_database WHERE datname = current_database(); -- > 99% 优秀,> 95% 正常,< 90% 需优化

15.7 死锁次数

SELECT SUM(deadlocks)::int AS total_deadlocks FROM pg_stat_database; -- 期望值: 0(累计值,重启后清零)

健康检查汇总速查表

检查项 正常值 告警条件 处理建议
Autovacuum on off 立即开启
Archive Mode on off 开启归档支持 PITR
SSL on off 生产环境必须开启
密码加密 scram-sha-256 md5 升级加密方式
连接数使用率 < 80% > 90% 增加 max_connections 或优化连接池
Buffer 命中率 > 99% < 95% 增加 shared_buffers 或优化查询
死锁次数 0 > 0 排查应用事务逻辑
数据库年龄 < 10 亿 > 15 亿 执行 VACUUM FREEZE
复制槽 Lag < 1GB > 10GB 检查备库状态或清理无效槽
未归档 WAL 0 > 0 排查归档命令

总结 & 注意事项

  1. 执行权限:本文所有 SQL 均需要超级用户(postgres)或 pg_read_all_stats 角色权限才能获取完整信息
  2. 性能影响:所有查询均基于统计视图(pg_stat_*pg_settings),对生产库无性能影响;仅性能采样模块需要 10~30 秒间隔
  3. 版本差异:部分视图在 PostgreSQL 12 以下版本不可用(如 pg_stat_wal_receiver 在 9.6+ 可用,pg_control_system() 在 12+ 可用)
  4. 统计重置pg_stat_* 视图为累计值,实例重启后清零;可使用 pg_stat_reset() 手动重置
  5. XID 回卷:这是 PostgreSQL 特有的严重风险,Oracle/MySQL DBA 转型时需重点关注,务必确保 autovacuum 正常运行
  6. WAL 膨胀:不活跃的复制槽和失败的归档是 WAL 膨胀的两大主因,巡检时重点检查
  7. 连接泄漏idle in transaction 是 PG 特有的问题(Oracle 中不存在),建议配置 idle_in_transaction_session_timeout
  8. 定期巡检:建议每日自动执行巡检脚本,周报对比关键指标变化趋势

版本差异说明:

  • PostgreSQL 12+:所有 SQL 均可直接使用
  • PostgreSQL 9.6~11:pg_control_system() 不可用,需用 pg_controldata 命令替代;wal_keep_size 参数名为 wal_keep_segments
  • PostgreSQL 14+:pg_stat_statementstotal_time 更名为 total_exec_time
最后修改时间:2026-04-28 09:40:57
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论