暂无图片
暂无图片
7
暂无图片
暂无图片
暂无图片

记录一次CPU使用率过高故障分析SQL解决案例

原创 孙莹 2024-06-20
916

06204.jpeg

今天处理一个阿里云ECS上自建Oracle数据库,CPU飙升到100%的问题,把整个过程分享给各位小伙伴,希望能帮助到您。

问题现象

早上接到业务系统运维人员电话,告知系统很卡无法运行。问题数据库是一台在阿里云ECS上自建的Oracle数据库。登录到操作系统上排查,先把操作系统环境看一下。4cpu16G内存100G的磁盘还有剩余。操作系统环境基本上没有问题。

Last login: Thu Jun 20 09:03:44 2024 from 223.166.7.46 Welcome to Alibaba Cloud Elastic Compute Service ! [root@iZctgmtjcrowglZ ~]# su - oracle [oracle@iZctgmtjcrowglZ ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz Stepping: 1 CPU MHz: 2499.996 BogoMIPS: 4999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 40960K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt nt_good arat [oracle@iZctgmtjcrowglZ ~]$ free -h total used free shared buff/cache available Mem: 15G 614M 5.3G 430M 9.5G 14G Swap: 8.0G 0B 8.0G [oracle@iZctgmtjcrowglZ ~]$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 7.7G 0 7.7G 0% /dev tmpfs 7.8G 431M 7.3G 6% /dev/shm tmpfs 7.8G 828K 7.8G 1% /run tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup /dev/vda1 99G 72G 23G 76% / tmpfs 1.6G 0 1.6G 0% /run/user/54321 tmpfs 1.6G 0 1.6G 0% /run/user/0 [oracle@iZctgmtjcrowglZ ~]$

用oratop检查数据库发现CPU使用率100%,大部分都在等待db file sequential read,再用查看前十位使用CPU的进程,发现很多查询SQL的进程耗CPU。都指向了5w3u3qw4sp938这个sql_id。

6201.jpg

[oracle@iZctgmtjcrowglZ ~]$ ps aux|head -1;ps aux|grep -v PID|sort -rn -k +3|head -10 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND oracle 5431 69.1 1.3 6559252 211708 ? Rs 08:38 39:11 oracleorcl (LOCAL=NO) oracle 6075 62.6 0.8 6558228 137344 ? Rs 08:48 29:13 oracleorcl (LOCAL=NO) oracle 5447 60.9 1.2 6560788 195392 ? Rs 08:38 34:21 oracleorcl (LOCAL=NO) oracle 6687 53.7 0.8 6558232 141320 ? Ss 08:57 19:47 oracleorcl (LOCAL=NO) oracle 6077 52.4 1.6 6562856 268668 ? Rs 08:48 24:26 oracleorcl (LOCAL=NO) oracle 7447 46.2 1.0 6559252 177336 ? Rs 09:07 12:28 oracleorcl (LOCAL=NO) oracle 6689 42.9 1.1 6560804 186488 ? Rs 08:57 15:50 oracleorcl (LOCAL=NO) oracle 7454 37.2 1.6 6561316 264144 ? Ss 09:07 10:02 oracleorcl (LOCAL=NO) oracle 7812 4.6 1.3 6554140 214660 ? Ss 09:12 1:01 oracleorcl (LOCAL=NO) oracle 9232 1.0 0.0 155464 3896 pts/3 R+ 09:34 0:00 ps aux [oracle@iZctgmtjcrowglZ ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.4.0 Production on Thu Jun 20 09:35:13 2024 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> select sql_id from v$session where paddr in (select addr from v$process where spid in ('5447','7454','5431','6075','6689')); SQL_ID ------------- 5w3u3qw4sp938 5w3u3qw4sp938 5w3u3qw4sp938 5w3u3qw4sp938 5w3u3qw4sp938 SQL> SELECT count(*) from v$session where sql_id = '5w3u3qw4sp938'; COUNT(*) ---------- 8 SQL>

问题分析

既然已经定位到SQL,那查出具体SQL语句和绑定变量,再用绑定变量看一下执行计划,带入常量发现执行了1个多小时还未出结果,选择取消。

SQL> select sql_text from v$sql where sql_id='5w3u3qw4sp938'; SQL_TEXT -------------------------------------------------------------------------------- SELECT COUNT(1) FROM SCM_SALBILL_HDR SSH,INPT_SCM_GEN_BILL ISGB WHERE SSH.GENINI D = ISGB.ID AND SSH.GENINBILLNO = :B2 AND SSH.GENIOTYPE = :B1 AND SSH.STOPFLAG = '00' SQL> set lineszie 200 SQL> col name for a20 SQL> col value_string for a40 SQL> select name,datatype_string,value_string from V$SQL_BIND_CAPTURE where SQL_ID='5w3u3qw4sp938'; NAME DATATYPE_STRING VALUE_STRING -------------------- ------------------------------ ---------------------------------------- :B2 VARCHAR2(128) 240619-118247943692633 :B1 NUMBER 22201 SQL> conn CMSDTP Enter password: Connected. SQL> VARIABLE B1 NUMBER; SQL> VARIABLE B2 VARCHAR2; SQL> SELECT COUNT(1) FROM SCM_SALBILL_HDR SSH,INPT_SCM_GEN_BILL ISGB WHERE SSH.GENINID = ISGB.ID AND SSH.GENINBILLNO = :B2 AND SSH.GENIOTYPE = :B1 AND SSH.STOPFLAG = '00'; COUNT(1) ---------- 0 Execution Plan ---------------------------------------------------------- Plan hash value: 343233996 -------------------------------------------------------------------------------- ---------------------- | Id | Operation | Name | Rows | Bytes | C ost (%CPU)| Time | -------------------------------------------------------------------------------- ---------------------- | 0 | SELECT STATEMENT | | 1 | 18 | 0 (0)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 18 | | | | 2 | NESTED LOOPS | | 1 | 18 | 0 (0)| 00:00:01 | | 3 | NESTED LOOPS | | 1 | 18 | 0 (0)| 00:00:01 | | 4 | INDEX FULL SCAN | PK_INPT_SCM_GEN_BILL | 1 | 13 | 0 (0)| 00:00:01 | |* 5 | INDEX RANGE SCAN | IDX_SCM_SALBILL_GIO | 1 | | 0 (0)| 00:00:01 | |* 6 | TABLE ACCESS BY INDEX ROWID| SCM_SALBILL_HDR | 1 | 5 | 0 (0)| 00:00:01 | -------------------------------------------------------------------------------- ---------------------- Predicate Information (identified by operation id): --------------------------------------------------- 5 - access("SSH"."GENIOTYPE"=TO_NUMBER(:B1)) 6 - filter("SSH"."GENINID" IS NOT NULL AND "SSH"."GENINBILLNO"=:B2 AND "SSH"."STOPFLAG"='00' AND "SSH"."GENINID"="ISGB"."ID") Statistics ---------------------------------------------------------- 8 recursive calls 0 db block gets 339 consistent gets 333 physical reads 0 redo size 525 bytes sent via SQL*Net to client 520 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed SQL> SET AUTOT OFF SQL> SELECT COUNT(1) FROM SCM_SALBILL_HDR SSH,INPT_SCM_GEN_BILL ISGB WHERE SSH.GENINID = ISGB.ID AND SSH.GENINBILLNO = '240619-118247943692633' AND SSH.GENIOTYPE = 22201 AND SSH.STOPFLAG = '00'; ^CSELECT COUNT(1) FROM SCM_SALBILL_HDR SSH,INPT_SCM_GEN_BILL ISGB WHERE SSH.GENINID = ISGB.ID AND SSH.GENINBILLNO = '240619-118247943692633' AND SSH.GENIOTYPE = 22201 AND SSH.STOPFLAG = '00' * ERROR at line 1: ORA-01013: user requested cancel of current operation SQL>

同时我们用AWRSQRPT产生报告,可以发现大量CPU Time消耗在IO等待上。同时执行计划也和上面一样。

6202.jpg

继续查看两个表数据量发现并不大,为什么要执行那么长时间耗那么多CPU?按照上面执行计划,继续查发现IDX_SCM_SALBILL_GIO索引居然没有了。我们再来看看这个表上和SQL相关的字段索引情况。发现和SQL相关的SCM_SALBILL_HDR表上字段有太多索引。后来问开发人员,昨天他们又对这些字段添加索引了。至此我们大致可以判断添加过度的索引,没有做分析表,统计信息缺失导致执行计划错误,产生大量CPU和IO。

SQL> select count(1) from INPT_SCM_GEN_BILL; COUNT(1) ---------- 178760 SQL> select count(1) from SCM_SALBILL_HDR; COUNT(1) ---------- 182387 SQL> select owner,index_name,table_name from dba_indexes where index_name='IDX_SCM_SALBILL_GIO'; no rows selected SQL> col COLUMN_NAME for a40 SQL> select table_owner,table_name,index_name,column_name from dba_ind_columns where table_owner='CMSDTP' and table_name='SCM_SALBILL_HDR' and column_name in ('GENINID','GENINBILLNO','GENIOTYPE'); TABLE_OWNER TABLE_NAME INDEX_NAME COLUMN_NAME ------------------------------ ------------------------------ ---------------------------------------- ---------------------------------------- CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_GENINID GENINID CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_GIBN GENINBILLNO CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GGGS GENINID CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GGGS GENINBILLNO CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GGGS GENIOTYPE CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GEN1 GENIOTYPE CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GEN1 GENINID CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GENIOTYPE GENIOTYPE CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GENIOTYPE GENINBILLNO 9 rows selected. SQL> col OBJECT_NAME for a40 SQL> select owner,object_name,status,created from dba_objects where object_name in (select index_name from dba_ind_columns where table_owner='CMSDTP' and table_name='SCM_SALBILL_HDR' and column_name in ('GENINID','GENINBILLNO','GENIOTYPE')); OWNER OBJECT_NAME STATUS CREATED ------------------------------ ---------------------------------------- ------- ------------ CMSDTP IDX_SCM_SALBILL_GENINID VALID 22-APR-24 CMSDTP IDX_SCM_SALBILL_GIBN VALID 19-JUN-24 CMSDTP IDX_SCM_SALBILL_HDR_GGGS VALID 19-JUN-24 CMSDTP IDX_SCM_SALBILL_HDR_GEN1 VALID 23-SEP-22 CMSDTP IDX_SCM_SALBILL_HDR_GENIOTYPE VALID 23-SEP-22 SQL>

处理过程

找到问题所在后,清理多余的索引,创建和SQL相关字段合理的索引,再对表分析一下更新统计信息。重新查询秒出结果,解决问题。

SQL> drop index CMSDTP.IDX_SCM_SALBILL_GENINID; Index dropped. SQL> drop index CMSDTP.IDX_SCM_SALBILL_GIBN; Index dropped. SQL> drop index CMSDTP.IDX_SCM_SALBILL_HDR_GENIOTYPE; Index dropped. SQL> drop index CMSDTP.IDX_SCM_SALBILL_HDR_GEN1; Index dropped. SQL> drop index CMSDTP.IDX_SCM_SALBILL_HDR_GGGS; Index dropped. SQL> select table_owner,table_name,index_owner,index_name,column_name from dba_ind_columns where table_owner='CMSDTP' and table_name='SCM_SALBILL_HDR' and column_name in ('GENINID','GENINBILLNO','GENIOTYPE'); no rows selected SQL> create index CMSDTP.IDX_SCM_SALBILL_HDR_GEN1 on CMSDTP.SCM_SALBILL_HDR(GENINID,GENIOTYPE); Index created. SQL> create index CMSDTP.IDX_SCM_SALBILL_HDR_GENIOTYPE on CMSDTP.SCM_SALBILL_HDR(GENINBILLNO,GENIOTYPE); Index created. SQL> conn / as sysdba Connected. SQL> exec dbms_stats.gather_table_stats('CMSDTP','SCM_SALBILL_HDR',cascade=>true); PL/SQL procedure successfully completed. SQL> conn CMSDTP Enter password: Connected. SQL> SET AUTOTRACE ON SQL> SELECT COUNT(1) FROM SCM_SALBILL_HDR SSH,INPT_SCM_GEN_BILL ISGB WHERE SSH.GENINID = ISGB.ID AND SSH.GENINBILLNO = '240619-118247943692633' AND SSH.GENIOTYPE = 22201 AND SSH.STOPFLAG = '00'; COUNT(1) ---------- 1 Execution Plan ---------------------------------------------------------- Plan hash value: 46481803 ---------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 48 | 1 (0)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 48 | | | | 2 | NESTED LOOPS | | 1 | 48 | 1 (0)| 00:00:01 | | 3 | NESTED LOOPS | | 1 | 48 | 1 (0)| 00:00:01 | | 4 | INDEX FULL SCAN | PK_INPT_SCM_GEN_BILL | 1 | 13 | 0 (0)| 00:00:01 | |* 5 | INDEX RANGE SCAN | IDX_SCM_SALBILL_HDR_GEN1 | 1 | | 1 (0)| 00:00:01 | |* 6 | TABLE ACCESS BY INDEX ROWID| SCM_SALBILL_HDR | 1 | 35 | 1 (0)| 00:00:01 | ---------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 5 - access("SSH"."GENINID"="ISGB"."ID" AND "SSH"."GENIOTYPE"=22201) 6 - filter("SSH"."GENINBILLNO"='240619-118247943692633' AND "SSH"."STOPFLAG"='00') Statistics ---------------------------------------------------------- 8 recursive calls 0 db block gets 654515 consistent gets 84959 physical reads 0 redo size 526 bytes sent via SQL*Net to client 520 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed SQL>

CPU使用率也恢复正常

06203.jpg

总结

本案例还是一条SQL引发的高CPU情况,是开发人员没有根据实际情况添加索引导致。这个案例整体的分析过程不算太繁琐,SQL语句也还算是简单,也没有用到10046事件来跟踪分析。但有几点还是要注意,索引应根据对应的SQL语句还有表的数据量来合理添加,添加后SQL涉及表的统计信息需要重新收集

最后修改时间:2024-06-21 10:04:52
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论