记一次 1000+每秒 UPDATE 引发的 Redo 风暴:AWR 定位 log file switch (checkpoint incomplete) 并落地解决

一、问题现象
12 月 8 日上午 11 点多,监控告警提示生产数据库磁盘使用率超过 90%。初步判断:磁盘使用突增,需要关注数据库整体负载与日志增长情况。
二、问题排查
2.1 操作系统磁盘空间
查看 OS 空间与 Oracle 目录占用,发现归档/恢复区相关目录增长较快(fast_recovery_area 约 43G)。
[root@md-db-out ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_root-lv_root 461G 399G 39G 92% /
tmpfs 32G 160K 32G 1% /dev/shm
/dev/sda1 485M 39M 421M 9% /boot
[root@md-db-out ~]# su - oracle
[oracle@md-db-out ~]$ cd /u01/app/oracle/
[oracle@md-db-out oracle]$ du -h --max-depth=1
4.0K ./checkpoints
55G ./diag
4.6G ./admin
70G ./oradata
43G ./fast_recovery_area
1.8M ./cfgtoollogs
4.4G ./product
177G .
[oracle@md-db-out oracle]$
虽然oradata目录不小但故障前没有添加过表空间数据文件,所以这次告警更像是“归档/恢复区增长过快”触发
2.2 归档日志切换频率异常
归档日志增长过快后,进一步统计日志切换频率,发现上午 9 点开始切换次数从 11 次 → 755 次(提升近 68 倍)。
[oracle@md-db-out oracle]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Mon Dec 8 11:27:01 2025
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> set lines 300
SQL> column h0 format 999
column h1 format 999
column h2 format 999
column h3 format 999
column h4 format 999
column h5 format 999
column h6 format 999
column h7 format 999
column h8 format 999
column h9 format 999
column h10 format 999
column h11 format 999
column h12 format 999
column h13 format 999
column h14 format 999
column h15 format 999
column h16 format 999
column h17 format 999
column h18 format 999
column h19 format 999
column h20 format 999
column h21 format 999
column h22 format 999
column h23 format 999
column avg format 999.99
column day format a6
SELECT TRUNC (first_time) "Date", TO_CHAR (first_time, 'Dy') "Day", COUNT (1) "Total",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '00', 1, 0)) h0,
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '01', 1, 0)) "h1",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '02', 1, 0)) "h2",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '03', 1, 0)) "h3",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '04', 1, 0)) "h4",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '05', 1, 0)) "h5",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '06', 1, 0)) "h6",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '07', 1, 0)) "h7",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '08', 1, 0)) "h8",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '09', 1, 0)) "h9",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '10', 1, 0)) "h10",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '11', 1, 0)) "h11",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '12', 1, 0)) "h12",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '13', 1, 0)) "h13",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '14', 1, 0)) "h14",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '15', 1, 0)) "h15",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '16', 1, 0)) "h16",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '17', 1, 0)) "h17",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '18', 1, 0)) "h18",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '19', 1, 0)) "h19",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '20', 1, 0)) "h20",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '21', 1, 0)) "h21",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '22', 1, 0)) "h22",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '23', 1, 0)) "h23", to_char(ROUND (COUNT (1) / 24, 2),'fm99999999990.00') "Avg"
FROM gv$log_history
WHERE first_time >= trunc(SYSDATE) - 7
and thread# = inst_id
GROUP BY TRUNC (first_time), TO_CHAR (first_time, 'Dy')
ORDER BY 1 DESC;
SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Date Day Total H0 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14 h15 h16 h17 h18 h19 h20 h21 h22 h23 Avg
------------------- ------ ---------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---------------
2025-12-08 00:00:00 Mon 1823 48 6 8 7 8 7 8 7 11 755 884 74 0 0 0 0 0 0 0 0 0 0 0 0 75.96
2025-12-07 00:00:00 Sun 480 38 8 9 8 6 8 9 7 9 9 9 8 8 8 9 8 9 8 8 9 7 7 10 261 20.00
2025-12-06 00:00:00 Sat 520 32 8 8 7 7 10 9 9 9 9 42 9 9 9 8 9 9 9 8 9 9 8 8 266 21.67
2025-12-05 00:00:00 Fri 568 33 9 11 8 7 9 9 9 9 19 19 18 13 11 15 17 23 16 11 10 7 9 10 266 23.67
2025-12-04 00:00:00 Thu 581 42 9 10 9 8 9 9 8 12 24 18 17 11 24 16 18 20 14 12 9 9 8 11 254 24.21
2025-12-03 00:00:00 Wed 585 55 8 10 8 7 10 9 10 12 14 19 20 11 13 19 20 20 17 11 10 8 9 10 255 24.38
2025-12-02 00:00:00 Tue 603 48 9 10 7 7 10 8 8 12 31 18 21 11 12 22 32 21 27 20 14 8 10 10 227 25.13
2025-12-01 00:00:00 Mon 554 51 7 7 6 7 7 8 7 10 14 18 17 11 13 16 18 17 17 14 8 8 9 11 253 23.08
8 rows selected.
SQL>
三、AWR报告分析
拉取10:00~11:00的AWR报告
SQL> @$ORACLE_HOME/rdbms/admin/awrrpt.sql
3.1 整体负载概况
DB Time:29,845s ≈ 497 分钟,AAS ≈ 29,845 / (59×60) ≈ 8.4 个活跃会话
DB CPU ≈ 6,541s → 每秒约 1.84 CPU 秒,用满 16 核只用了 约 11–12% 的 CPU
OS 负载:%busy ~ 11.8%,%idle ~ 88%,%iowait ~ 5.4%


结论:CPU 很空闲,系统总体是“等待型”而不是“CPU 打满型”。
3.2 Top Wait Events
查看Top Wait 事件,找到真正卡在哪里。Top 10 Foreground Events:
log file switch (checkpoint incomplete)
-
Waits:12,509
-
Total Wait Time:≈ 12,288s(占 DB time ~41%)
-
Avg wait:982 ms / 次
log file sync
- Waits:4,112,201
- Total Wait Time:≈ 6,929s(占 DB time ~23%)
- Avg wait:2 ms / 次
- 大约 1,160 次 commit/s 左右
Wait Class 汇总也印证了这一点:
- Configuration:45%(主要就是各种 log file switch、SQ contention 等)
- Commit:23%(log file sync)
- DB CPU:22%
- Concurrency:8.6%
- User I/O:7%
结论:绝大部分时间都耗在 redo / checkpoint 配置和并发控制上。

3.3 定位 SQL:高频 UPDATE
在 “SQL ordered by Elapsed Time” 中,你定位到 SQL ID:0vq0s6rm8fawn。它在 60 分钟内执行 3,763,703 次,约 1000+ 次/秒 UPDATE,直接带来大量 redo / 归档与频繁 log switch。

SQL如下:
UPDATE SOA_EX_DRUG SET DRUG_ID=:1 , CMM_NAME=:2 , CMMDT_NM=:3 , MED_CTGY_TYPE=:4 , NORM=:5 , MIN_PCKNG_CMPNY=:6 , APRV_NO=:7 , PRDCT_PLT=:8 , status=:9 , CNTRY_NAME=:10 , DEVICES=:11 , PRV_APRV_NO=:12 , CMM_NAME_PY=:13 , PRDCT_CREDIT=:14 , PRDCT_ID=:15 , PACK_QUANTITY=:16 , MGT_CAT_NAME=:17 , PROD_CAT_NAME=:18 , REGIST_FORM_NAME=:19 , CLASS_SUBCATALOG_V17_NAME=:20 , LEVEL1_PROD_CAT_V17_NAME=:21 , LEVEL2_PROD_CAT_V17_NAME=:22 , SPU_FLAG=:23 WHERE ID=:24
进一步用 dba_hist_sqlstat 验证:12/5 之前 executions 仅 2 万级,到了 12/8 突增到 300 万级(增长 167 倍)。
SQL> set linesize 300
SQL> select a.snap_id,b.begin_interval_time,a.executions_delta from dba_hist_sqlstat a,dba_hist_snapshot b where a.sql_id='0vq0s6rm8fawn' and a.snap_id=b.snap_id order by a.snap_id;
SNAP_ID BEGIN_INTERVAL_TIME EXECUTIONS_DELTA
---------- -------------------------------------------------------------------------------- ----------------
64968 02-DEC-25 02.00.04.126 PM 22446
64969 02-DEC-25 03.00.06.304 PM 35605
64989 03-DEC-25 11.00.53.713 AM 25956
65011 04-DEC-25 09.00.47.148 AM 22293
65018 04-DEC-25 04.00.05.309 PM 15736
65040 05-DEC-25 02.00.01.717 PM 5283
65042 05-DEC-25 04.00.06.122 PM 51054
65107 08-DEC-25 09.00.51.667 AM 3144734
65108 08-DEC-25 10.00.56.171 AM 3763703
65109 08-DEC-25 11.00.03.946 AM 318640
65110 08-DEC-25 12.00.35.037 PM 3151397
65111 08-DEC-25 01.00.43.053 PM 3999423
65112 08-DEC-25 02.00.29.114 PM 2629413
13 rows selected.
SQL>
一直到下午14点多查看oratop还是显示SQL Id:0vq0s6rm8fawn频繁执行。造成大量log file switch (checkpoint incomplete)等待。

结论:高频的更新SQL语句,导致log file switch (checkpoint incomplete)一小时内 12,509 次,平均每次接近 1 秒
四、问题解决
4.1 应用侧
第一步:与开发沟通,把高频 SQL 告知对方,推动应用修改。
应用修改后,日志切换频率正常。如下:
SQL> SELECT TRUNC (first_time) "Date", TO_CHAR (first_time, 'Dy') "Day", COUNT (1) "Total",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '00', 1, 0)) h0,
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '01', 1, 0)) "h1",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '02', 1, 0)) "h2",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '03', 1, 0)) "h3",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '04', 1, 0)) "h4",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '05', 1, 0)) "h5",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '06', 1, 0)) "h6",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '07', 1, 0)) "h7",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '08', 1, 0)) "h8",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '09', 1, 0)) "h9",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '10', 1, 0)) "h10",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '11', 1, 0)) "h11",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '12', 1, 0)) "h12",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '13', 1, 0)) "h13",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '14', 1, 0)) "h14",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '15', 1, 0)) "h15",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '16', 1, 0)) "h16",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '17', 1, 0)) "h17",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '18', 1, 0)) "h18",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '19', 1, 0)) "h19",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '20', 1, 0)) "h20",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '21', 1, 0)) "h21",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '22', 1, 0)) "h22",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '23', 1, 0)) "h23", to_char(ROUND (COUNT (1) / 24, 2),'fm99999999990.00') "Avg"
FROM gv$log_history
WHERE first_time >= trunc(SYSDATE) - 7
and thread# = inst_id
GROUP BY TRUNC (first_time), TO_CHAR (first_time, 'Dy')
ORDER BY 1 DESC; 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Date Day Total H0 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14 h15 h16 h17 h18 h19 h20 h21 h22 h23 Avg
------------------- ------ ---------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---------------
2025-12-10 00:00:00 Wed 333 56 9 10 7 7 8 7 9 18 22 20 16 13 12 16 18 19 16 10 7 8 7 9 9 13.88
2025-12-09 00:00:00 Tue 607 0 0 0 0 0 0 0 0 67 72 16 15 10 12 17 19 44 27 11 10 7 8 7 265 25.29
2025-12-08 00:00:00 Mon 4364 48 6 8 7 8 7 8 7 11 755 884 83 725 884 622 17 17 14 9 9 7 7 10 211 181.83
2025-12-07 00:00:00 Sun 480 38 8 9 8 6 8 9 7 9 9 9 8 8 8 9 8 9 8 8 9 7 7 10 261 20.00
2025-12-06 00:00:00 Sat 520 32 8 8 7 7 10 9 9 9 9 42 9 9 9 8 9 9 9 8 9 9 8 8 266 21.67
2025-12-05 00:00:00 Fri 568 33 9 11 8 7 9 9 9 9 19 19 18 13 11 15 17 23 16 11 10 7 9 10 266 23.67
2025-12-04 00:00:00 Thu 581 42 9 10 9 8 9 9 8 12 24 18 17 11 24 16 18 20 14 12 9 9 8 11 254 24.21
2025-12-03 00:00:00 Wed 585 55 8 10 8 7 10 9 10 12 14 19 20 11 13 19 20 20 17 11 10 8 9 10 255 24.38
SQL>
4.2 数据库侧:扩大 redo log size(从 50MB 扩到 2GB)
当时的 redo 配置:3 组、每组 50MB(默认)。然后新增 3 组 2GB(group 4/5/6),切换后删除旧组(1/2/3)。检查 checkpoint / mttr 参数,查看 FAST_START_MTTR_TARGET / LOG_CHECKPOINT_INTERVAL / LOG_CHECKPOINT_TIMEOUT避免设置得太小,导致频繁 checkpoint。
-- 1) 查看现状
SQL> select group#, bytes/1024/1024 mb, status from v$log;
GROUP# MB STATUS
---------- ---------- ----------------
1 50 CURRENT
2 50 INACTIVE
3 50 INACTIVE
SQL> col member for a80
SQL> select group#, member from v$logfile;
GROUP# MEMBER
---------- --------------------------------------------------------------------------------
1 /u01/app/oracle/oradata/mddb/redo01.log
2 /u01/app/oracle/oradata/mddb/redo02.log
3 /u01/app/oracle/oradata/mddb/redo03.log
-- 2) 新增大 redo
SQL> alter database add logfile group 4 '/u01/app/oracle/oradata/mddb/redo04.log' size 2g;
Database altered.
SQL> alter database add logfile group 5 '/u01/app/oracle/oradata/mddb/redo05.log' size 2g;
Database altered.
SQL> alter database add logfile group 6 '/u01/app/oracle/oradata/mddb/redo06.log' size 2g;
Database altered.
-- 3) 手工切换,确保旧组不再 ACTIVE/CURRENT,再删除
SQL> alter system switch logfile;
System altered.
SQL> /
System altered.
SQL> select group#, bytes/1024/1024 mb, status from v$log;
GROUP# MB STATUS
---------- ---------- ----------------
1 50 INACTIVE
2 50 INACTIVE
3 50 INACTIVE
4 2048 INACTIVE
5 2048 ACTIVE
6 2048 CURRENT
6 rows selected.
SQL> alter database drop logfile group 1;
Database altered.
SQL> alter database drop logfile group 2;
Database altered.
SQL> alter database drop logfile group 3;
Database altered.
-- 4) 验证
SQL> select group#, bytes/1024/1024 mb, status from v$log;
GROUP# MB STATUS
---------- ---------- ----------------
4 2048 INACTIVE
5 2048 ACTIVE
6 2048 CURRENT
-- 5) 检查参数,为默认,系统自动触发/控制 checkpoint
SQL> show parameter fast_start_mttr_target
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
fast_start_mttr_target integer 0
SQL> show parameter log_check
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
log_checkpoint_interval integer 0
log_checkpoint_timeout integer 1800
log_checkpoints_to_alert boolean FALSE
SQL>
五、总结
12 月 8 日 11 点生产库磁盘使用率超过 90%,排查发现主要由归档日志快速增长导致(/u01 下 fast_recovery_area 占用明显)。进一步查看日志切换,9 点后归档切换从 11 次飙升到 755 次。拉取 10:00–11:00 AWR 后确认系统属于“等待型”,主要等待集中在 log file switch (checkpoint incomplete)(12,509 次,约 41% DB time)和 log file sync(4,112,201 次,约 23% DB time)。同期 Top SQL 中 0vq0s6rm8fawn 在 60 分钟内执行 3,763,703 次(约 1000+ 次/秒),并且相较 12/5 执行次数突增 167 倍,是导致 redo/归档暴涨的关键诱因。处置上先通知开发处理高频 SQL,同时将 redo log 从 50MB 3 组扩容为 2GB 3 组并完成切换与清理,缓解日志切换与 checkpoint 压力。




