问题描述
数据库一节点出现了大量的library cache lock和library cache: mutex X,并且都是同一条sql96yc3rmxjfatn产生的。
最终通过应用端停止这条sql恢复故障。
问题分析
2.1 SQL执行情况
从listener日志(实时采样)和dba_hist视图(10秒采样间隔)中可以看到:
在10:15:16时,数据库一切正常 ----dba_hist视图 在10:15:17时,开始产生了大量链接。----listener日志 在10:15:26时,数据库开始有大量的library cache事件 ----dba_hist视图
1)listener日志
分钟级别:
219-SEP-202010:10
319-SEP-202010:12
319-SEP-202010:13
219-SEP-202010:14
149719-SEP-202010:15
819-SEP-202010:16
219-SEP-202010:18
319-SEP-202010:19
1219-SEP-202010:21
18019-SEP-202010:22
17219-SEP-202010:23
25819-SEP-202010:24
20819-SEP-202010:25
28219-SEP-202010:26
59019-SEP-202010:27
8919-SEP-202010:28
18119-SEP-202010:29
98419-SEP-202010:30
秒级:
5419-SEP-202010:15:17 ----突然发起链接,之前没有链接
9519-SEP-202010:15:18
8219-SEP-202010:15:19
5319-SEP-202010:15:20
219-SEP-202010:15:21
519-SEP-202010:15:23
1919-SEP-202010:15:24
5119-SEP-202010:15:25
6219-SEP-202010:15:26
7319-SEP-202010:15:27
8419-SEP-202010:15:28
8319-SEP-202010:15:29
5819-SEP-202010:15:30
5019-SEP-202010:15:31
5319-SEP-202010:15:32
5219-SEP-202010:15:33
5619-SEP-202010:15:34
5119-SEP-202010:15:35
2919-SEP-202010:15:36
2)DBA_HIST视图
在10:15:26开始出现问题
SQL> select to_char(sample_time,'yyyymmdd hh24:mi:ss') sample_time,sql_id,event
2 from dba_hist_active_sess_history
3 where to_char(sample_time,'yyyymmdd hh24:mi:ss') between '20200919 10:15:00' and '20200919 10:16:00'
4 order by sample_time;
SAMPLE_TIME SQL_ID EVENT
----------------- ------------- ----------------------------------------------------------------
2020091910:15:06 log file sync
2020091910:15:06 log file sync
2020091910:15:08 amxqfhpq8pqjk
2020091910:15:18738pan92hsv3k
2020091910:15:18 db file parallel write
2020091910:15:18 amxqfhpq8pqjk
2020091910:15:2696yc3rmxjfatn library cache: mutex X ---突然发生问题
2020091910:15:2696yc3rmxjfatn library cache: mutex X
2020091910:15:2696yc3rmxjfatn library cache: mutex X
2020091910:15:2696yc3rmxjfatn library cache: mutex X
2020091910:15:2696yc3rmxjfatn library cache: mutex X
SQL> select to_char(sample_time,'yyyymmdd hh24:mi:ss') sample_time,event,count(*)
2 from dba_hist_active_sess_history
3 where to_char(sample_time,'yyyymmdd hh24:mi:ss') between '20200919 10:00:00' and '20200919 10:16:00'
4 and event='library cache lock'
5groupby sample_time,event
6 order by sample_time;
SAMPLE_TIME EVENT COUNT(*)
----------------- ---------------------------------------------------------------- ----------
20200919 10:07:25 library cache lock 1
20200919 10:15:26 library cache lock 181
20200919 10:15:36 library cache lock 1079
20200919 10:15:46 library cache lock 1140
20200919 10:15:56 library cache lock 501
SQL> select to_char(sample_time,'yyyymmdd hh24:mi:ss') sample_time,event,count(*)
2 from dba_hist_active_sess_history
3 where to_char(sample_time,'yyyymmdd hh24:mi:ss') between '20200919 10:00:00' and '20200919 10:16:00'
4 and event='library cache: mutex X'
5groupby sample_time,event
6 order by sample_time;
SAMPLE_TIME EVENT COUNT(*)
----------------- ---------------------------------------------------------------- ----------
20200919 10:15:26 library cache: mutex X 286
20200919 10:15:36 library cache: mutex X 14
20200919 10:15:46 library cache: mutex X 326
20200919 10:15:56 library cache: mutex X 1054
SQL> select to_char(sample_time,'yyyymmdd hh24:mi:ss') sample_time,sql_id,count(*)
2 from dba_hist_active_sess_history
3 where to_char(sample_time,'yyyymmdd hh24:mi:ss') between '20200919 10:00:00' and '20200919 10:16:00'
4 and sql_id='96yc3rmxjfatn'
5groupby sample_time,sql_id
6 order by sample_time;
SAMPLE_TIME SQL_ID COUNT(*)
----------------- ------------- ----------
20200919 10:00:04 96yc3rmxjfatn 1
20200919 10:00:24 96yc3rmxjfatn 1
20200919 10:01:34 96yc3rmxjfatn 1
20200919 10:02:54 96yc3rmxjfatn 1
20200919 10:04:04 96yc3rmxjfatn 1
20200919 10:04:14 96yc3rmxjfatn 1
20200919 10:04:34 96yc3rmxjfatn 1
20200919 10:04:44 96yc3rmxjfatn 1
20200919 10:05:14 96yc3rmxjfatn 2
20200919 10:05:34 96yc3rmxjfatn 1
20200919 10:06:15 96yc3rmxjfatn 1
。
。
。
。
SAMPLE_TIME SQL_ID COUNT(*)
----------------- ------------- ----------
20200919 10:12:26 96yc3rmxjfatn 1
20200919 10:14:46 96yc3rmxjfatn 1
20200919 10:14:56 96yc3rmxjfatn 1
20200919 10:15:26 96yc3rmxjfatn 476
20200919 10:15:36 96yc3rmxjfatn 1093
20200919 10:15:46 96yc3rmxjfatn 1467
20200919 10:15:56 96yc3rmxjfatn 1556
2.2 游标不共享问题
1)v$sql_shared_cursor
监控一段问题时间sql96yc3rmxjfatn的version数量,发现在不断增加,达到了16000:
5:30:26 SQL> select count(*) from v$sql_shared_cursor where sql_id='96yc3rmxjfatn';
COUNT(*)
----------
15847
15:30:37 SQL>
COUNT(*)
----------
15952
15:31:56 SQL>
COUNT(*)
----------
15957
15:31:58 SQL>
COUNT(*)
----------
15958
15:31:59 SQL>
COUNT(*)
----------
15962
15:32:03 SQL> 15:32:03 SQL>
COUNT(*)
----------
15968
15:32:09 SQL>
COUNT(*)
----------
15968
查询游标不共享的原因BIND_MISMATCH和BIND_LENGTH_UPGRADEABLE:
。。。。
FORCE_HARD_PARSE: 0
EXPLAIN_PLAN_CURSOR: 0
BUFFERED_DML_MISMATCH: 0
PDML_ENV_MISMATCH: 0
INST_DRTLD_MISMATCH: 0
SLAVE_QC_MISMATCH: 0
TYPECHECK_MISMATCH: 0
AUTH_CHECK_MISMATCH: 0
BIND_MISMATCH: 85
DESCRIBE_MISMATCH: 0
LANGUAGE_MISMATCH: 0
。。。。
REMOTE_MAPPING_MISMATCH: 0
LOAD_RUNTIME_HEAP_FAILED: 0
HASH_MATCH_FAILED: 0
PURGED_CURSOR: 0
BIND_LENGTH_UPGRADEABLE: 45
2)v$sql_bind_capture
从v$sql_bind_capture中查询到的sql96yc3rmxjfatn绑定变量数据,发现确实有长度不一致的问题。
查询结果节选:
SQL> select sql_id,child_number,position,datatype_string,max_length from v$sql_bind_capture where sql_id='96yc3rmxjfatn' order by position;
SQL_ID CHILD_NUMBER POSITION DATATYPE_STRING MAX_LENGTH
------------- ------------ ---------- ------------------------------------------------------------ ----------
96yc3rmxjfatn 44 1 VARCHAR2(128) 128
96yc3rmxjfatn 43 1 VARCHAR2(128) 128
96yc3rmxjfatn 42 1 VARCHAR2(128) 128
96yc3rmxjfatn 41 1 VARCHAR2(128) 128
96yc3rmxjfatn 40 1 VARCHAR2(128) 128
96yc3rmxjfatn 39 1 VARCHAR2(128) 128
96yc3rmxjfatn 38 1 VARCHAR2(128) 128
96yc3rmxjfatn 37 1 VARCHAR2(128) 128
96yc3rmxjfatn 36 1 VARCHAR2(128) 128
96yc3rmxjfatn 35 1 VARCHAR2(128) 128
96yc3rmxjfatn 34 1 VARCHAR2(128) 128
。。。。
SQL_ID CHILD_NUMBER POSITION DATATYPE_STRING MAX_LENGTH
------------- ------------ ---------- ------------------------------------------------------------ ----------
96yc3rmxjfatn 38 6 VARCHAR2(32) 32
96yc3rmxjfatn 37 6 VARCHAR2(128) 128
96yc3rmxjfatn 36 6 VARCHAR2(128) 128
96yc3rmxjfatn 35 6 VARCHAR2(128) 128
96yc3rmxjfatn 34 6 VARCHAR2(128) 128
96yc3rmxjfatn 33 6 VARCHAR2(128) 128
96yc3rmxjfatn 32 6 VARCHAR2(128) 128
96yc3rmxjfatn 31 6 VARCHAR2(128) 128
96yc3rmxjfatn 30 6 VARCHAR2(32) 32
96yc3rmxjfatn 29 6 VARCHAR2(32) 32
96yc3rmxjfatn 28 6 VARCHAR2(128) 128
SQL_ID CHILD_NUMBER POSITION DATATYPE_STRING MAX_LENGTH
------------- ------------ ---------- ------------------------------------------------------------ ----------
96yc3rmxjfatn 27 6 VARCHAR2(128) 128
96yc3rmxjfatn 26 6 VARCHAR2(128) 128
96yc3rmxjfatn 25 6 VARCHAR2(32) 32
96yc3rmxjfatn 24 6 VARCHAR2(32) 32
96yc3rmxjfatn 23 6 VARCHAR2(32) 32
96yc3rmxjfatn 22 6 VARCHAR2(128) 128
96yc3rmxjfatn 21 6 VARCHAR2(32) 32
96yc3rmxjfatn 20 6 VARCHAR2(32) 32
96yc3rmxjfatn 19 6 VARCHAR2(128) 128
96yc3rmxjfatn 18 6 VARCHAR2(128) 128
96yc3rmxjfatn 17 6 VARCHAR2(128) 128
。。。。
SQL_ID CHILD_NUMBER POSITION DATATYPE_STRING MAX_LENGTH
------------- ------------ ---------- ------------------------------------------------------------ ----------
96yc3rmxjfatn 9 32 VARCHAR2(128) 128
96yc3rmxjfatn 8 32 VARCHAR2(32) 32
96yc3rmxjfatn 7 32 VARCHAR2(32) 32
96yc3rmxjfatn 6 32 VARCHAR2(128) 128
96yc3rmxjfatn 5 32 VARCHAR2(32) 32
96yc3rmxjfatn 4 32 VARCHAR2(128) 128
96yc3rmxjfatn 3 32 VARCHAR2(32) 32
96yc3rmxjfatn 2 32 VARCHAR2(32) 32
96yc3rmxjfatn 1 32 VARCHAR2(32) 32
96yc3rmxjfatn 0 32 VARCHAR2(128) 128
96yc3rmxjfatn 44 33 VARCHAR2(32) 32
。。。。
SQL_ID CHILD_NUMBER POSITION DATATYPE_STRING MAX_LENGTH
------------- ------------ ---------- ------------------------------------------------------------ ----------
96yc3rmxjfatn 6 84 TIMESTAMP 11
96yc3rmxjfatn 5 84 VARCHAR2(32) 32
96yc3rmxjfatn 7 84 VARCHAR2(32) 32
96yc3rmxjfatn 8 84 TIMESTAMP 11
96yc3rmxjfatn 10 84 VARCHAR2(32) 32
96yc3rmxjfatn 9 84 VARCHAR2(32) 32
96yc3rmxjfatn 11 84 VARCHAR2(32) 32
96yc3rmxjfatn 12 84 VARCHAR2(32) 32
96yc3rmxjfatn 14 84 VARCHAR2(32) 32
96yc3rmxjfatn 13 84 VARCHAR2(32) 32
96yc3rmxjfatn 15 84 VARCHAR2(32) 32
。。。。
问题总结
经查询和沟通,因业务的变更,导致此条sql的执行并发量比之前要大。
同时通过查询问题发生时这条sql在v$sql_shared_cursor视图中的记录,发现如下几点:
视图中关于sql96yc3rmxjfatn的记录条数不断的在增加,这证明sql96yc3rmxjfatn没有共享游标,并且不停的产生新的游标。
不共享游标的原因是BIND_MISMATCH和BIND_LENGTH_UPGRADEABLE,既绑定变量的长度有变化导致了游标不能共享。
通过v$sql_bind_capture视图中抓到的sql96yc3rmxjfatn绑定使用情况,也证明了上述信息。
综上所诉,由于绑定变量长度总是在变化,导致了sql不能共享游标,当出现高并发的时候,sql在不停的解析,并没有共享游标,从而产生了大量的library cache lock和library cache: mutex X,从而引发了此次问题。
后续建议
4.1 当前数据库版本为19.8,是20年7月14号发布的最新版本,在此版本上没有搜寻到类似的已知bug
4.2 建议应用规范绑定变量的长度
4.3 可以通过以下两个方法临时解决这个问题
建议在进行详尽测试后在生产上实施。
方法一
可以设置event 10503临时解决绑定变量长度不一致导致的游标不能共享问题,这个event的作用是绑定变量长度小于设置的值就视为长度一致:
alter system set events '10503 trace name context forever, level 4000';
或者:
alter system set event = '10503 trace name context forever, level 4000' scope=spfile;
这个设置不需要重启数据库。
方法二
可以设置_cursor_obsolete_threshold为100,这个参数的作用是当SQL版本超过这个参数设定后,直接舍弃这个游标,重新解析:
alter system set "_cursor_obsolete_threshold"=100 scope=spfile;
这个设置需要重启数据库。

本文作者:白炀斌(上海新炬中北团队)
本文来源:“IT那活儿”公众号





