暂无图片
暂无图片
5
暂无图片
暂无图片
暂无图片

记录一次CPU使用率过高故障的分析与处理

原创 孙莹 2025-09-01
797

记录一次CPU使用率过高故障的分析与处理

_intel_CPU_sinking_into_a_hi.png

一、问题现象

下午接到系统运行卡顿的反馈后,首先对主机性能进行了检查,发现CPU使用率异常偏高,系统整体负载压力巨大(环境是CentOS7.9+Oracle11.2.0.4单机)。具体情况如下图所示:

20250901_144449_254.png

二、初步分析

通过sqlplus连接数据库后,查看当前会话等待事件,发现存在大量的resmgr:cpu quantum等待:

[oracle@qxdb ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.4.0 Production on Mon Sep 1 12:49:57 2025 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> select event,count(*) from v$session group by event; EVENT COUNT(*) ---------------------------------------------------------------- ---------- SQL*Net message from client 51 resmgr:cpu quantum 398 Streams AQ: waiting for messages in the queue 2 read by other session 13 rdbms ipc message 16 smon timer 1 pmon timer 1 Streams AQ: qmn slave idle wait 1 latch free 2 SQL*Net message to client 1 Streams AQ: waiting for time management or cleanup tasks 1 EVENT COUNT(*) ---------------------------------------------------------------- ---------- Streams AQ: qmn coordinator idle wait 1 VKTM Logical Idle Wait 1 DIAG idle wait 2 VKRM Idle 1 15 rows selected. SQL>

resmgr:cpu quantum等待事件表明会话正在等待Oracle资源管理器(Resource Manager)分配CPU资源。

参考Oracle官方文档(MOS)的相关说明:

  • High “Resmgr:Cpu Quantum” Wait Events In 11g Even When Resource Manager Is Disabled [ID 949033.1]
  • 11g: Scheduler Maintenance Tasks or Autotasks [ID 756734.1]
  • Large Waits With The Wait Event “Resmgr:Cpu Quantum” [ID 806893.1]

三、初步处理

根据上述分析,尝试调整资源管理计划参数,并关闭自动任务以减少资源争用:

-- 检查当前资源管理计划 SQL> show parameter resource_manager_plan NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ resource_manager_plan string SCHEDULER[0x3009]:DEFAULT_MAIN -- 清空资源管理计划 TENANCE_PLAN SQL> alter system set resource_manager_plan=''; System altered. -- 清除维护窗口的资源计划 SQL> execute dbms_scheduler.set_attribute('WEEKNIGHT_WINDOW','RESOURCE_PLAN',''); PL/SQL procedure successfully completed. SQL> execute dbms_scheduler.set_attribute('WEEKEND_WINDOW','RESOURCE_PLAN',''); PL/SQL procedure successfully completed. SQL> execute dbms_scheduler.set_attribute('MONDAY_WINDOW','RESOURCE_PLAN',''); PL/SQL procedure successfully completed. SQL> execute dbms_scheduler.set_attribute('TUESDAY_WINDOW','RESOURCE_PLAN',''); PL/SQL procedure successfully completed. SQL> execute dbms_scheduler.set_attribute('WEDNESDAY_WINDOW','RESOURCE_PLAN',''); PL/SQL procedure successfully completed. SQL> execute dbms_scheduler.set_attribute('THURSDAY_WINDOW','RESOURCE_PLAN',''); PL/SQL procedure successfully completed. SQL> execute dbms_scheduler.set_attribute('FRIDAY_WINDOW','RESOURCE_PLAN',''); PL/SQL procedure successfully completed. SQL> execute dbms_scheduler.set_attribute('SATURDAY_WINDOW','RESOURCE_PLAN',''); PL/SQL procedure successfully completed. SQL> execute dbms_scheduler.set_attribute('SUNDAY_WINDOW','RESOURCE_PLAN',''); PL/SQL procedure successfully completed. -- 禁用自动空间顾问和SQL调优顾问任务 SQL> BEGIN DBMS_AUTO_TASK_ADMIN.DISABLE( client_name => 'auto space advisor', operation => NULL, window_name => NULL); END; / 2 3 4 5 6 7 PL/SQL procedure successfully completed. SQL> BEGIN DBMS_AUTO_TASK_ADMIN.DISABLE( client_name => 'sql tuning advisor', operation => NULL, window_name => NULL); END; / 2 3 4 5 6 7 PL/SQL procedure successfully completed. SQL> select client_name,status from dba_autotask_client; CLIENT_NAME STATUS ---------------------------------------------------------------- -------- auto optimizer stats collection ENABLED auto space advisor DISABLED sql tuning advisor DISABLED -- 重启数据库使配置生效 SQL> shut immediate Database closed. Database dismounted. ORACLE instance shut down. SQL> startup ORACLE instance started. Total System Global Area 7.2688E+10 bytes Fixed Size 2260728 bytes Variable Size 6979322120 bytes Database Buffers 6.5498E+10 bytes Redo Buffers 208642048 bytes Database mounted. Database opened. SQL> show parameter resource_manager_plan NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ resource_manager_plan string SQL>

然而,重启后CPU使用率仍然居高不下:

20250901_134814_315.png

四、深入分析

再次检查会话等待事件,发现等待类型已发生变化,出现了大量的latch freelatch: cache buffers chains等待:

SQL> select event,count(*) from v$session group by event; EVENT COUNT(*) ---------------------------------------------------------------- ---------- SQL*Net message from client 41 Streams AQ: waiting for messages in the queue 5 read by other session 3 null event 2 rdbms ipc message 15 latch: cache buffers chains 30 db file sequential read 7 pmon timer 1 smon timer 1 job scheduler coordinator slave wait 1 latch free 92 EVENT COUNT(*) ---------------------------------------------------------------- ---------- Streams AQ: qmn slave idle wait 1 SQL*Net message to client 1 Disk file operations I/O 1 Streams AQ: qmn coordinator idle wait 1 VKTM Logical Idle Wait 1 Streams AQ: waiting for time management or cleanup tasks 1 DIAG idle wait 2 18 rows selected. SQL>

通过关联SQL_ID定位到具体问题语句:

SQL> select event,sql_id,count(1) from v$session where event='latch free' group by event,sql_id 2 ; EVENT SQL_ID ---------------------------------------------------------------- ------------- COUNT(1) ---------- latch free d4guxxn00hu01 1 latch free gm684ycvk15qd 1 latch free 0z45dukpd4nan 144 EVENT SQL_ID ---------------------------------------------------------------- ------------- COUNT(1) ---------- latch free gptqjj4hx10rm 1 SQL> select sql_text from gv$sql where sql_id='0z45dukpd4nan'; SQL_TEXT -------------------------------------------------------------------------------- SELECT COUNT(1) FROM SCM_INOUT_LOT_HISTORY_EXT WHERE STATDATE = TRUNC(LAST_DAY(A DD_MONTHS(SYSDATE,-1))) AND COMPID = :B2 AND OWNERID = :B1 SQL>

同时发现数据库中存在大量作业运行:

SQL> select count(1) from dba_jobs; COUNT(1) ---------- 6201 SQL> select count(1) from dba_jobs_runnings; COUNT(1) ---------- 16 SQL>

20250901_job.png
分析问题SQL的执行计划,发现对SCM_INOUT_LOT_HISTORY_EXT表(约2亿条记录)进行了全表扫描:

SQL> conn MED_GKCS Enter password: Connected. SQL> set autotrace trace exp SQL> SELECT COUNT(1) FROM SCM_INOUT_LOT_HISTORY_EXT WHERE STATDATE = TRUNC(LAST_DAY(ADD_MONTHS(SYSDATE,-1))) AND COMPID = 2 AND OWNERID = 1; Execution Plan ---------------------------------------------------------- Plan hash value: 343819289 -------------------------------------------------------------------------------- ---------------- | Id | Operation | Name | Rows | Bytes | Cost (% CPU)| Time | -------------------------------------------------------------------------------- ---------------- | 0 | SELECT STATEMENT | | 1 | 14 | 10820 (1)| 00:02:10 | | 1 | SORT AGGREGATE | | 1 | 14 | | | |* 2 | TABLE ACCESS FULL| SCM_INOUT_LOT_HISTORY_EXT | 1 | 14 | 10820 (1)| 00:02:10 | -------------------------------------------------------------------------------- ---------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("OWNERID"=1 AND "STATDATE"=TRUNC(LAST_DAY(ADD_MONTHS(SYSDATE@!,-1) )) AND "COMPID"=2) SQL> set autot off SQL> SELECT COUNT(1) FROM SCM_INOUT_LOT_HISTORY_EXT; COUNT(1) ---------- 195323892 SQL>

五、根本原因与最终处理

综合以上分析,确定故障的根本原因:

  • 开发人员误操作导致作业异常重复创建,产生大量并发进程

  • 关键SQL语句缺乏合适索引,导致全表扫描

  • 大量并发全表扫描操作引发缓冲链闩锁争用

采取以下处理措施:

1.紧急重启数据库并停止监听:

SQL> conn / as sysdba Connected. SQL> shut immediate Database closed. Database dismounted. ORACLE instance shut down. SQL> exit Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options [oracle@qxdb ~]$ lsnrctl stop LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 01-SEP-2025 13:30:31 Copyright (c) 1991, 2013, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=172.16.168.135)(PORT=1521))) The command completed successfully [oracle@qxdb ~]$

2.创建缺失索引并更新统计信息:

[oracle@qxdb ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.4.0 Production on Mon Sep 1 13:30:36 2025 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to an idle instance. SQL> startup ORACLE instance started. Total System Global Area 7.2688E+10 bytes Fixed Size 2260728 bytes Variable Size 6979322120 bytes Database Buffers 6.5498E+10 bytes Redo Buffers 208642048 bytes Database mounted. Database opened. SQL> conn MED_GKCS Enter password: Connected. SQL> create index idx_scm_inout_lot_hist_ext_st on SCM_INOUT_LOT_HISTORY_EXT(STATDATE); Index created. SQL> exec dbms_stats.gather_table_stats('MED_GKCS','SCM_INOUT_LOT_HISTORY_EXT',cascade=>true); PL/SQL procedure successfully completed. SQL>

3.验证执行计划已改用索引:

SQL> set autotrace trace exp SQL> SELECT COUNT(1) FROM SCM_INOUT_LOT_HISTORY_EXT WHERE STATDATE = TRUNC(LAST_DAY(ADD_MONTHS(SYSDATE,-1))) AND COMPID = 2 AND OWNERID = 1; Execution Plan ---------------------------------------------------------- Plan hash value: 651601031 -------------------------------------------------------------------------------- ------------------------------ | Id | Operation | Name | Rows | B ytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------- ------------------------------ | 0 | SELECT STATEMENT | | 1 | 14 | 7753 (1)| 00:01:34 | | 1 | SORT AGGREGATE | | 1 | 14 | | | |* 2 | TABLE ACCESS BY INDEX ROWID| SCM_INOUT_LOT_HISTORY_EXT | 1 | 14 | 7753 (1)| 00:01:34 | |* 3 | INDEX RANGE SCAN | IDX_SCM_INOUT_LOT_HIST_EXT_ST | 542K| | 1467 (1)| 00:00:18 | -------------------------------------------------------------------------------- ------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("OWNERID"=1 AND "COMPID"=2) 3 - access("STATDATE"=TRUNC(LAST_DAY(ADD_MONTHS(SYSDATE@!,-1)))) SQL> exit Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options [oracle@qxdb ~]$

4.重启监听并清理异常作业:

[oracle@qxdb ~]$ lsnrctl start LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 01-SEP-2025 13:59:38 Copyright (c) 1991, 2013, Oracle. All rights reserved. Starting /app/oracle/product/11.2.0.4/db_1/bin/tnslsnr: please wait... TNSLSNR for Linux: Version 11.2.0.4.0 - Production System parameter file is /app/oracle/product/11.2.0.4/db_1/network/admin/listener.ora Log messages written to /app/oracle/diag/tnslsnr/qxdb/listener/alert/log.xml Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.168.135)(PORT=1521))) Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=172.16.168.135)(PORT=1521))) STATUS of the LISTENER ------------------------ Alias LISTENER Version TNSLSNR for Linux: Version 11.2.0.4.0 - Production Start Date 01-SEP-2025 13:59:38 Uptime 0 days 0 hr. 0 min. 0 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /app/oracle/product/11.2.0.4/db_1/network/admin/listener.ora Listener Log File /app/oracle/diag/tnslsnr/qxdb/listener/alert/log.xml Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.168.135)(PORT=1521))) Services Summary... Service "csqxcs" has 1 instance(s). Instance "csqxcs", status UNKNOWN, has 1 handler(s) for this service... The command completed successfully [oracle@qxdb ~]$ -- 联系开发人员确认并清理异常作业 [oracle@qxdb ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.4.0 Production on Mon Sep 1 14:00:01 2025 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> select count(1) from dba_jobs_running; COUNT(1) ---------- 0 SQL> select count(1) from dba_jobs; COUNT(1) ---------- 16 SQL>

六、处理结果

系统恢复正常运行,CPU使用率降至正常水平:

20250901_165355_776.png

七、经验总结

本次CPU使用率过高故障是由多方面因素共同导致的:

  • 直接原因:开发误操作导致作业异常重复创建,产生大量并发进程

  • 性能瓶颈:关键SQL缺乏适当索引,导致全表扫描和缓冲链闩锁争用

  • 加剧因素:资源管理器配置可能放大了资源争用问题

处理经验:

  1. 系统性能问题需要综合多方面因素进行分析,不能仅关注表面现象
  2. 等待事件分析是定位数据库性能问题的重要方法
  3. 对于大规模数据表,适当的索引设计至关重要
  4. 建立健全的监控体系,能够及时发现作业异常等异常情况
  5. 重要的配置变更和作业调度应建立严格的审核机制

以上就是本次故障处理全过程,分享出来希望能给各位小伙伴一些帮助☺️。

最后修改时间:2025-09-02 10:11:38
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论