故障描述:
客户数据库Oracle 12.1.0.2 RAC某个节点16点30分钟左右SQL_ID gr5jm3av0n5sx出现了大量的cursor:mutex x、cursor:mutex s等待,CPU使用率接近100%
系统环境:Oracle 12.1C RAC HP-UNIX
1.数据库出现大量的cursor:mutex x和cursor :mutex s等待

2.查看相关SQL_ID gr5jm3av0n5sx的相关信息
SQL_ID gr5jm3av0n5sx, child number 107
-------------------------------------
UPDATE T_BPM_PROCESSINSTANCE T_BPM_PROCESSINSTANCE
SET PROCESSSTATE = :p1,
PREVIOUSSTATE = :p2,
CLUSTERNODEID = :p3,
CURACTIVITYDEFID = :p4,
CURACTIVITYID = :p5,
PREVACTIVITYINSTID = :p6,
CURACTIVITYSTATE = :p7,
ELEMENTLOOPCOUNTER = :p8,
STARTTIME = :p9,
LASTUPDATETIME = :p10,
ACTIVITYCOUNT = :p11,
CURSEQUENCENUM = :p12,
TIMERSTATE = :p13,
PREDACTIVITIES = :p14,
SUSPENDACTIVITIES = :p15,
SKIPPEDACTIVITIES = :p16,
JOINEDELEMENTIDS = :p17,
JOINEDTRANSITIONS = :p18,
METADATA = :p19,
ISEXECUTING = :p20,
ASYNCELEMENTDEFID = :p21,
ASYNCRETRYCOUNT = :p22,
ASYNCELEMENTINSTID = :p23,
CURELMNTSTARTTIME = :p24,
BUSINESSSTATUS = :p25,
PI_EXTN_05 = :p26,
PI_EXTN_08 = :p27,
PI_EXTN_12 = :p28,
PI_EXTN_13 = :p29,
PI_EXTN_21 = :p30,
PI_EXTN_26 = :p31,
PI_EXTN_27 = :p32
WHERE PROCESSINSTANCEID = :p33
AND L2_PARTITION_KEY = :p34
Plan hash value: 2404250912
--------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------------------------
| 0 | UPDATE STATEMENT | | | | 4 (100)| | | |
| 1 | UPDATE | T_BPM_PROCESSINSTANCE | | | | | | |
|* 2 | TABLE ACCESS BY GLOBAL INDEX ROWID| T_BPM_PROCESSINSTANCE | 1 | 369 | 4 (0)| 00:00:01 | ROWID | ROWID |
|* 3 | INDEX UNIQUE SCAN | PK_T_BPM_PROCESSINSTANCE | 1 | | 3 (0)| 00:00:01 | | |
--------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("L2_PARTITION_KEY"=:P34)
3 - access("PROCESSINSTANCEID"=:P33)
sys@TRDDB1>select version_count from v$sqlarea where sql_id='gr5jm3av0n5sx';
VERSION_COUNT
-------------
108
该SQL的历史执行信息:
BEGIN_TIME INSTANCE_NUMBER MODULE PLAN_HASH_VALUE EXEC PER_GET PER_ROWS TIME_MS PER_READ
------------------- --------------- --------------------------------------------- --------------- ---------- ---------- ---------- ---------- ----------
2019-06-04 16:45:24 1 ddsserver@wbesngx202 (TNS V1-V3) 2404250912 359560 11 1 .49 0
2019-06-04 16:45:23 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 775550 12 1 46.17 0
2019-06-04 16:30:04 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 489342 26 1 4984.1 0
2019-06-04 16:30:04 1 ddsserver@wbesngx202 (TNS V1-V3) 2404250912 322380 12 1 .46 0
2019-06-04 16:15:31 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 717255 19 1 1652.25 0
2019-06-04 16:15:31 1 ddsserver@wbesngx202 (TNS V1-V3) 2404250912 298229 12 1 .46 0
2019-06-04 16:00:26 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 691853 12 1 .79 0
2019-06-04 16:00:26 1 ddsserver@wbesngx202 (TNS V1-V3) 2404250912 330588 11 1 .45 0
2019-06-04 15:45:20 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 679735 12 1 .81 0
2019-06-04 15:45:20 1 ddsserver@wbesngx202 (TNS V1-V3) 2404250912 381596 12 1 .43 0
2019-06-04 15:30:16 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 775566 13 1 20.56 0
2019-06-04 15:30:15 1 ddsserver@wbesngx202 (TNS V1-V3) 2404250912 359154 11 1 .42 0
2019-06-04 15:15:10 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 751541 13 1 19.75 0
2019-06-04 15:15:10 1 ddsserver@wbesngx202 (TNS V1-V3) 2404250912 483641 11 1 .44 0
2019-06-04 15:00:05 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 437762 13 1 .46 0
2019-06-04 15:00:05 1 ddsserver@wbesngx202 (TNS V1-V3) 2404250912 367565 12 1 .43 0
2019-06-04 14:45:21 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 436110 13 1 .83 0
2019-06-04 14:45:21 1 ddsserver@wbesngx202 (TNS V1-V3) 2404250912 345284 12 1 .43 0
2019-06-04 14:30:16 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 546521 13 1 2.03 0
2019-06-04 14:30:16 1 ddsserver@wbesngx202 (TNS V1-V3) 2404250912 336119 12 1 .43 0
2019-06-04 14:15:11 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 455908 12 1 .5 0
2019-06-04 14:15:11 1 ddsserver@wbesngx202 (TNS V1-V3) 2404250912 370684 11 1 .42 0
2019-06-04 14:00:06 2 ddsserver@wbesesb203 (TNS V1-V3) 2404250912 418343 11 1 .42 0
3、cursor:mutex x、cursor:mutex s等待事件
cursor:mutex x、cursor:mutex s是属于解析类等待事件,是因为解析该SQL时引起(软解析也会有该等待事件)oracle有如下的解释
cursor: mutex S
A session waits on this event when it is requesting a mutex in shared mode, when another session is currently holding a this mutex in exclusive mode on the same cursor object.
Cursor: mutex S情景:
Wants: Shared mode on Mutex of Parent / Child
• Change the reference count (“in flux”)= “new guy is interested / spinning”
• Parent examination
• When finding a cursor to execute, the parent must be examined. The examination of the parent is performed using the mutex, cursor: mutex S.
• When the parent cursor has many child cursors involved, this waits will come as the server process has to traverse the entire list of child cursors under the parent to find a match.
• Mutex is in the parent cursor.
cursor: mutex X:
The session requests the mutex for a cursor object in exclusive mode, and it must wait because the resource is busy. The mutex is busy because either the mutex is being held in exclusive mode by another session or the mutex is being held shared by one or more sessions. The existing mutex holder(s) must release the mutex before the mutex can be granted exclusively.
cursor: mutex X情景:
Wants: Exclusive mode on Mutex of Parent / Child
• Building a new cursor under a parent
• Although this operation is cheaper, building many cursors under a parent cursor is not recommended.
• Capture SQL bind data(PEEK)
• Build or Update statistics blocks,Modify cursor-related statistics
• Mutex is in the parent cursor.

4、parent cursor出现high version count:绑定变量不匹配、统计信息引起、绑定变量长度不一致


绑定变量类型不一致或者分级主要发生在以下几个变量:P4、P5、P6、P9、P14、P21

5、High Waits On cursor: mutex X After Upgrading The Database to 12c, 18c & 19c (Doc ID 2625815.1)
APPLIES TO:
Oracle Database - Enterprise Edition - Version 12.1.0.2 to 19.5.0.0.0 [Release 12.1 to 18]
Information in this document applies to any platform.
SYMPTOMS
• High waits on cursor: mutex X after upgrading the database to 12.1.0.2, 12.2.0.1, 18c or 19c.
• CPU utilization spikes to 100% due to the mutex waits leading to database hang or application outage.
• Huge version count is seen for the SQL statements during the problem period.
• Login or connection storm is observed during the problem.
• Huge sleeps on functions kkscsAddChildNode & kkscsPruneChild as per ‘Mutex Sleep Summary’ from AWR report.
Mutex Type Location Sleeps Wait Time (ms)
• Cursor Parent kkscsAddChildNode [KKSPRTLOC34] 302,348,122 140,622
• Cursor Parent kkscsPruneChild [KKSPRTLOC35] 4,763,695 -493,302
• Cursor Pin kkslce [KKSCHLPIN2] 1,232,845 659,278
• Hanganalyze trace shows most of the chains running the same cursor from different processes waiting on ‘cursor: mutex X’ with the below stack.
The stack trace:
<-kgxExclusive<-kkscsAddChildNode<-kxscod<-kkscsCompareBinds<-kkscscid_bnd_eval<-kkscsCheckCriteria<-kkscsCheckCursor<-kkscsSearchChildList<-kksfbc<-
CHANGES
Upgrading the database to 12c and higher.
CAUSE
This is due to below internal bug which is fixed in 20.1.
Unpublished Bug 28889389 : HIGH WAITS ON CURSOR: MUTEX X AFTER UPGRADE TO 12.1.0.2
This bug is hit after having observed that multiple unique sessions are waiting for a parent cursor mutex in exclusive mode on the same cursor under the following stack:-
ksfwaitctx()+28<-kgxWait()+1164<-kgxExclusive()+652<-kkscsAddChildNode()
‘cursor: mutex X’ would have been taken on the parent cursor to find out the matching child cursor for that particular SQL. Some kind of logon storm where all sessions are executing the same cursor at the same time leading to this mutex contention. The child cursor sharing code has been optimized to reduce the Mutex Contention so that exclusive mutex is not acquired for searching the child cursors for match under the same parent.
SOLUTION
To Apply the Patch 28889389 which has the optimized code for cursor mutex while searching the parent cursor for the match.
可能触碰到上述BUG
6、建议:
1 采取定期purge version count child的方式来规避(也通过规范绑定变量和控制统计信息收集策略来控制产生child cursor的数量)
2 设置隐含参数_cursor_obsolete_threshold控制子游标数量,可以设置为100
3 将该SQL进一步打散
将频繁执行的SQL分区拆解,分散竞争,如以下SQL通过注释将同一条SQL分解为2条,就分散了竞争:
select /*SQL 1*/ a from t_a where id=?
select /*SQL 2*/ a from t_a where id=?
4 打上相关patchset规避Oracle Bug引起的cursor竞争。




