问题描述
上周正在开会讨论一套库的优化方案,突然有又收到了另一个性能告警,有套库的CPU突然暴涨负载非常高,随后其它业务因为CPU慢,业务逐渐积压,最终导致数据库主机命令都无响应, 最后定位了原因是与该SQL的一个对象的DDL drop partition, 因该SQL hard parse, 并且分区数量减少, 又因分区级统计信息不准确, CBO改变了执行计划使用了一个相对差的执行计划, 针对这个案例我做了大量的测试,这里对敏感信息做了处理简单的记录一下。
专家解答
如果遇到这个案例应该搞清楚几个问题:
1,什么原因使执行计划改变
2,如何最快的修正执行计划
3,如何避免
4,如何在测试环境还原问题现象
期间做了以下尝试
1, 尝试kill 该sql的部分session,但session增度超过kill的速度
2,查看了SQL的执行计划,从SQL text简单看那个执行计划是也算合理
3,AWR中以前未采集到该SQL的执行计划,SQL CACHE和手动EXPLIAN plan 分析也是现在运行的执行计划(index join)
4, 尝试对表执行了一个grant使SQL重解析,问题依旧
5,尝试用SQL TEXT去其它同类库执行发现执行计划确实不一样(没用index join),当然后来我们从ASH中查到该sql确实是刚刚改变的执行计划
6, 尝试用SQL profile手动指定执行计划,我以前用该手段屡试不爽但是这次还是有点意外,因为执行计划中使用index join, oracle给我们提供了index_join hint但是没有提供no_index_join hint, sql中没有使用Table alias name,所以我使用的Table name让CBO不去使用其中一个索引,用了no_index(tablename indexname), 当然事后也在twitter及OTN上向Jonathan Lewis大师请教,得到JL一些宝贵意见,因为当时没有太多时间,不成功就尝试其它方法。后面我会记录多个成功的方法。
7, 收集表的统计信息,因为这是一个分区表,表级已超过了2亿记录,也不能删统计信息,因为此时动态采样对于大表更动弹不得, 所以先对表以分区为单位收集, 因为当前系统极慢,短时间不可能收集完,丢一个空口去收。
8, 尝试手动set几个分区索引的统计信息,因为没有参考但还执行计划还是未改变
就在尝试第8步过程中,没有再给机会实例重启了。往往考验都是这样,当你接手时系统已病入膏肓,一大群人会围在你身边不停的问怎么样了,顶着压力还要淡定条理的处理问题,再没太多的机会尝试,几分钟可能实例就宕掉了。当然如果您有好的建议可以去我的weibo或twitter或mail告诉我。
数据库起动后,立即收集了表及索引的统计信息,explian 确认执行计划已改变不再使用index join 和其它库相同,起应用故障恢复。
先来看一下问题时的SQL及执行计划
SQL> @xi 4x2qck7u5m4mk % eXplain the execution plan for sqlid 4x2qck7u5m4mk child %... PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------ SQL_ID 4x2qck7u5m4mk, child number 4 ------------------------------------- SELECT ATTRID, ATTRVALUE FROM ANBOB_T1 WHERE RECOID = :RECOID AND REGION = 399 Plan hash value: 3978804789 --------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | E-Rows | Pstart| Pstop | OMem | 1Mem | Used-Mem | --------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | | | | | |* 1 | VIEW | index$_join$_001 | 1387K| | | | | | |* 2 | HASH JOIN | | | | | 1000K| 1000K| 927K (0)| | 3 | PARTITION RANGE ITERATOR| | 1387K| 65 | 74 | | | | |* 4 | INDEX RANGE SCAN | IDX_ANBOB_T1 | 1387K| 65 | 74 | | | | | 5 | PARTITION RANGE ITERATOR| | 1387K| 65 | 74 | | | | |* 6 | INDEX FAST FULL SCAN | IDX_ANBOB_T1_VALUE | 1387K| 65 | 74 | | | | --------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("RECOID"=TO_NUMBER(:RECOID)) 2 - access(ROWID=ROWID) 4 - access("RECOID"=TO_NUMBER(:RECOID)) 6 - filter("REGION"=399) Note ----- - dynamic sampling used for this statement (level=2) SQL> @ind ANBOB.ANBOB_T1 Display indexes where table or index name matches %ANBOB.ANBOB_T1%... TABLE_OWNER TABLE_NAME INDEX_NAME POS# COLUMN_NAME DSC --------------- ------------- ------------------ ---- ------------------------------ ---- ANBOB ANBOB_T1 IDX_ANBOB_T1 1 RECOID IDX_ANBOB_T1_VALUE 1 ATTRID 2 ATTRVALUE 3 REGION INDEX_OWNER TABLE_NAME INDEX_NAME IDXTYPE UNIQ STATUS PART TEMP H LFBLKS NDK NUM_ROWS CLUF LAST_ANALYZED DEGREE VISIBILIT -------------------- ------------------------------ ------------------------------ ---------- ---- -------- ---- ---- -- ---------- ------------- ---------- ---------- ----------------- ------ --------- ANBOB ANBOB_T1 IDX_ANBOB_T1 NORMAL NO N/A YES N 1 0 0 0 0 20141214 22:30:50 1 VISIBLE ANBOB_T1 IDX_ANBOB_T1_VALUE NORMAL NO N/A YES N 1 0 0 0 0 20150527 00:25:59 1 VISIBLE
Note:
就是一个简单的select语句,SQL涉及表的所有字段都有索引,不过分布在两个索引上,ORACLE 优化器会选择了使用index join的执行计划,避免了对table的访问路径, 因当时cpu过高,查询慢没成功记录两个索引的分区信息,不过可以确认表有个索引最后一次收集是20141214。Index join 也叫index hash join ,只指从多个索引中通过hash join取到值减少对表的访问。INDEX JOIN 的执行计划存在bug,从上面可以看到两个索引及hash后的e-rows是相同的,关于Index JOIN原理不再多描述。
#收集统计信息后的执行计划 SQL> explain plan for SELECT ATTRID, ATTRVALUE FROM ANBOB.ANBOB_T1 WHERE RECOID = '1111' AND REGION = 399; Explained. SQL> @x2 PLAN_TABLE_OUTPUT ----------------------------------------------------------------------------------------------------------- Plan hash value: 491038779 ---------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | ---------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 28 | 22 (0)| 00:00:01 | | | | 1 | PARTITION RANGE ITERATOR | | 1 | 28 | 22 (0)| 00:00:01 | 65 | 74 | |* 2 | TABLE ACCESS BY LOCAL INDEX ROWID| ANBOB_T1 | 1 | 28 | 22 (0)| 00:00:01 | 65 | 74 | |* 3 | INDEX RANGE SCAN | IDX_ANBOB_T1 | 1 | | 21 (0)| 00:00:01 | 65 | 74 | ---------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("REGION"=399) 3 - access("RECOID"=1111)
Note:
可以看到CBO使用了另一执行计划,不再使用INDEX_JOIN,COST非常小且响应时间非常快, 细心的可能发现index join的执行计划中谓词中有TO_NUMBER, 那应该是程序绑定变量类型可能是非number类型, CBO窥探到了变量,所以不是因为变量的问题使用了hash join的执行计划。就是因为统计信息的问题。刚好我们这个数据库有BCV同步,现在就可以在测试库上还原问题,再分析,其实只要还原统计信息就可以。下面我们在测试库还原这个问题。
SQL> alter session set NLS_TIMESTAMP_TZ_FORMAT='yyyy-mm-dd:hh24:mi:ss'; Session altered. SQL> @tab ANBOB.ANBOB_T1 Show tables matching condition "%ANBOB.ANBOB_T1%" (if schema is not specified then current user's tables only are shown)... OWNER TABLE_NAME TYPE NUM_ROWS BLOCKS EMPTY AVGSPC ROWLEN TAB_LAST_ANALYZED DEGREE COMPRESS -------------------- ------------------------------ ---- ------------ ------------- --------- ------ ------ ----------------- ---------- -------- ANBOB ANBOB_T1 PTAB 283829133 1796808 0 0 35 20160217 16:50:55 1 SQL> select table_name,PARTITION_NAME,stats_update_time from dba_tab_stats_history where owner='ANBOB' AND table_name='ANBOB_T1'; TABLE_NAME PARTITION_NAME STATS_UPDATE_TIME --------------------- ------------------------------ -------------------------- ... ANBOB_T1 PART_398_MAX 2016-02-17:16:50:55 ANBOB_T1 PART_399_201507 2016-02-17:16:50:26 ANBOB_T1 PART_399_201507 2016-02-17:16:50:55 ANBOB_T1 PART_399_201508 2016-02-17:16:50:26 ANBOB_T1 PART_399_201508 2016-02-17:16:50:55 ANBOB_T1 PART_399_201509 2016-02-17:16:50:26 ANBOB_T1 PART_399_201509 2016-02-17:16:50:55 ANBOB_T1 PART_399_201510 2016-02-17:16:50:26 ANBOB_T1 PART_399_201510 2016-02-17:16:50:55 ANBOB_T1 PART_399_201511 2016-02-17:16:50:26 ANBOB_T1 PART_399_201511 2016-02-17:16:50:55 ANBOB_T1 PART_399_201512 2016-02-17:16:50:26 ANBOB_T1 PART_399_201512 2016-02-17:16:50:55 ANBOB_T1 PART_399_201601 2016-02-17:16:50:26 ANBOB_T1 PART_399_201601 2016-02-17:16:50:55 ANBOB_T1 PART_399_201602 2016-02-17:16:50:26 ... SQL> select DBMS_STATS.GET_STATS_HISTORY_AVAILABILITY from dual; GET_STATS_HISTORY_AVAILABILITY --------------------------------------------------------------------------- 2016-01-23:09:28:05 SQL> select DBMS_STATS.GET_STATS_HISTORY_RETENTION from dual; GET_STATS_HISTORY_RETENTION --------------------------- 31
Tip:
统计信息历史保留期当前是31天也是默认值,我们可以恢复到问题时刻的统计信息,对表统计信息的保留周期也可以修改使用DBMS_STATS.ALTER_STATS_HISTORY_RETENTION,注意恢复后还可以恢复回来,恢复操作不会修改dba_tab_stats_history的记录。
— 恢复到问题时间点时的统计信息
SQL> begin dbms_stats.restore_table_stats('ANBOB','ANBOB_T1', to_timestamp('2016-02-17:15:00:55','yyyy-mm-dd:hh24:mi:ss')); end; / PL/SQL procedure successfully completed. SQL> @tab ANBOB.ANBOB_T1 Show tables matching condition "%ANBOB.ANBOB_T1%" (if schema is not specified then current user's tables only are shown)... OWNER TABLE_NAME TYPE NUM_ROWS BLOCKS EMPTY AVGSPC ROWLEN TAB_LAST_ANALYZED DEGREE COMPRESS --------- --------------- ---- ------------ ------------- --------- ------ ------ ----------------- ---------- -------- ANBOB ANBOB_T1 PTAB 1 SQL> @ind ANBOB.ANBOB_T1 Display indexes where table or index name matches %ANBOB.ANBOB_T1%... TABLE_OWNER TABLE_NAME INDEX_NAME POS# COLUMN_NAME DSC --------------- ------------- -------------------------- ---- ------------------------------ ---- ANBOB ANBOB_T1 IDX_ANBOB_T1 1 RECOID IDX_ANBOB_T1_VALUE 1 ATTRID 2 ATTRVALUE 3 REGION INDEX_OWNER TABLE_NAME INDEX_NAME IDXTYPE UNIQ STATUS PART TEMP H LFBLKS NDK NUM_ROWS CLUF LAST_ANALYZED DEGREE VISIBILIT -------------------- ------------------------------ ------------------------------ ---------- ---- -------- ---- ---- -- ---------- ------------- ---------- ---------- ----------------- ------ --------- ANBOB ANBOB_T1 IDX_ANBOB_T1 NORMAL NO N/A YES N 1 0 0 0 0 20141214 22:30:50 1 VISIBLE ANBOB_T1 IDX_ANBOB_T1_VALUE NORMAL NO N/A YES N 1 0 0 0 0 20150527 00:25:59 1 VISIBLE SQL> @tabpart2 ANBOB.ANBOB_T1 TABLE_OWNER TABLE_NAME POS PARTITION_NAME NUM_ROWS BLOCKS LAST_ANALYZED GLO INT ------------------ ---------------- ---------- --------------------- ---------- ---------- ----------------- --- --- ... ANBOB 65 PART_399_201507 NO NO ANBOB 66 PART_399_201508 NO NO ANBOB 67 PART_399_201509 NO NO ANBOB 68 PART_399_201510 NO NO ANBOB 69 PART_399_201511 NO NO ANBOB 70 PART_399_201512 NO NO ANBOB 71 PART_399_201601 NO NO ANBOB 72 PART_399_201602 NO NO ANBOB 73 PART_399_MAX NO NO ... SQL> @indpart ANBOB.IDX_ANBOB_T1 PARTITION_POSITION INDEX_NAME PARTITION_NAME H LFBLKS NDK NUM_ROWS LAST_ANALYZED STATUS ------------------ -------------------- ------------------- -- ---------- ------ ---------- ----------------- -------- ... 65 IDX_ANBOB_T1 PART_399_201507 1 0 0 0 20150625 01:23:33 USABLE 66 IDX_ANBOB_T1 PART_399_201508 USABLE 67 IDX_ANBOB_T1 PART_399_201509 USABLE 68 IDX_ANBOB_T1 PART_399_201510 USABLE 69 IDX_ANBOB_T1 PART_399_201511 USABLE 70 IDX_ANBOB_T1 PART_399_201512 USABLE 71 IDX_ANBOB_T1 PART_399_201601 1 0 0 0 20151222 16:31:44 USABLE 72 IDX_ANBOB_T1 PART_399_201602 USABLE 73 IDX_ANBOB_T1 PART_399_MAX USABLE ... SQL> @indpart ANBOB.IDX_ANBOB_T1_VALUE PARTITION_POSITION INDEX_NAME PARTITION_NAME H LFBLKS NDK NUM_ROWS LAST_ANALYZED STATUS ------------------ ---------------------- ----------------- -- ---------- ------ ---------- ----------------- -------- ... 65 IDX_ANBOB_T1_VALUE PART_399_201507 1 0 0 0 20150625 01:23:33 USABLE 66 IDX_ANBOB_T1_VALUE PART_399_201508 USABLE 67 IDX_ANBOB_T1_VALUE PART_399_201509 USABLE 68 IDX_ANBOB_T1_VALUE PART_399_201510 USABLE 69 IDX_ANBOB_T1_VALUE PART_399_201511 USABLE 70 IDX_ANBOB_T1_VALUE PART_399_201512 USABLE 71 IDX_ANBOB_T1_VALUE PART_399_201601 1 0 0 0 20151222 16:31:44 USABLE 72 IDX_ANBOB_T1_VALUE PART_399_201602 USABLE 73 IDX_ANBOB_T1_VALUE PART_399_MAX USABLE ... SQL> explain plan for SELECT /*anbob*/ ATTRID, ATTRVALUE FROM ANBOB.ANBOB_T1 WHERE RECOID = :RECOID AND REGION = 399; Explained. SQL> @x2 PLAN_TABLE_OUTPUT --------------------------------------------------------------------------------------- Plan hash value: 3978804789 ------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | ------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 985K| 178M| 197 (1)| 00:00:03 | | | |* 1 | VIEW | index$_join$_001 | 985K| 178M| 197 (1)| 00:00:03 | | | |* 2 | HASH JOIN | | | | | | | | | 3 | PARTITION RANGE ITERATOR| | 985K| 178M| 136 (92)| 00:00:02 | 65 | 74 | |* 4 | INDEX RANGE SCAN | IDX_ANBOB_T1 | 985K| 178M| 136 (92)| 00:00:02 | 65 | 74 | | 5 | PARTITION RANGE ITERATOR| | 985K| 178M| 232 (1)| 00:00:03 | 65 | 74 | |* 6 | INDEX FAST FULL SCAN | IDX_ANBOB_T1_VALUE | 985K| 178M| 232 (1)| 00:00:03 | 65 | 74 | ------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("RECOID"=TO_NUMBER(:RECOID)) 2 - access(ROWID=ROWID) 4 - access("RECOID"=TO_NUMBER(:RECOID)) 6 - filter("REGION"=399) Note ----- - dynamic sampling used for this statement (level=2)
Note:
到这里故障时的SQL的执行计划恢复了,我们又恢复了index join执行计划, 因篇幅原因先到这里,下篇会继续使用sql profile修正该执行计划…