暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Wow,Index block have 169 ITLS ?

原创 Roger 2014-09-28
541
一个朋友的exadata环境遇到这个问题ora-00600错误,该错误其实很常见,比较有意思的事情是在后面。我们首先来看看alert log信息,如下:
Sun Sep 28 10:52:13 2014
Errors in file /u01/app/oracle/diag/rdbms/dbm/dbm1/trace/dbm1_ora_3924.trc (incident=2295022):
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/dbm/dbm1/incident/incdir_2295022/dbm1_ora_3924_i2295022.trc
Sun Sep 28 10:52:30 2014
Dumping diagnostic data in directory=[cdmp_20140928105230], requested by (instance=1, osid=3924), summary=[incident=2295022].

如果没有遇到过这个错误的朋友,可能第一时间不知道是什么意思。标准的解释是这样的:The ora-600 [kdsgrp1] error is thrown when a fetch operation fails to find the expected row. The error is hit in memory and so may be a memory only error or an error that results from corruption on disk.
而产生这个错误的原因可能有如下几种:
Lost writes 
Parallel DML issues
Index corruption
Data block corruption
Consistent read [CR] issues
Buffer cache corruption

我遇到的情况,通常来讲是跟Index有关系,我们先来分析下对应的trace文件。
*** ACTION NAME:() 2014-09-28 10:52:13.809

Dump continued from file: /u01/app/oracle/diag/rdbms/dbm/dbm1/trace/dbm1_ora_3924.trc
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []

========= Dump for incident 2295022 (ORA 600 [kdsgrp1]) ========

*** 2014-09-28 10:52:13.810
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=53dnb22ubgync) -----
select ps_provision_id,
bms_subscription_id,
ps_operate_type,
ps_priority,
ps_cur_retry_times,
to_char(ps_last_retry_time, 'YYYYMMDDHH24MISS'),
ps_fail_reason,
ps_extend_feature,
ps_detail_status,
bms_subscription_no,
ps_max_retry_times,
ps_node_id,
ps_provision_status,
bms_svc_id,
bms_product_id,
bms_customer_id,
bms_user_name,
bms_user_password
from ps_provision_9
where ps_priority = :v1
and bms_subscription_begin_time <= to_date(:v2, 'YYYYMMDDHH24MISS')
and (ps_provision_status = 0 or ps_provision_status = 1 or
ps_provision_status = 4 or ps_provision_status = 5)
and bms_svc_id in (:s0, :s1, :s2, :s3)
order by ps_provision_id

只是知道执行这个SQL的时候触发了这个错误,从trace文件中我们直接搜索Plan Table可以定位到SQL的执行计划:
============
Plan Table
============
-----------------------------------------------------+-----------------------------------+
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------+-----------------------------------+
| 0 | SELECT STATEMENT | | | | 11 | |
| 1 | TABLE ACCESS BY INDEX ROWID | PS_PROVISION_9| 2 | 1388 | 11 | 00:00:01 |
| 2 | INDEX FULL SCAN | I402000_9_1_PK| 8 | | 2 | 00:00:01 |
-----------------------------------------------------+-----------------------------------+

Content of other_xml column
===========================
db_version : 11.2.0.3
parse_schema : OBS61
plan_hash : 3017981752
plan_hash_2 : 2407233167
Outline Data:
/*+
BEGIN_OUTLINE_DATA
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('11.2.0.3')
DB_VERSION('11.2.0.3')
ALL_ROWS
OUTLINE_LEAF(@"SEL$1")
INDEX(@"SEL$1" "PS_PROVISION_9"@"SEL$1" ("PS_PROVISION_9"."PS_PROVISION_ID"))
END_OUTLINE_DATA
*/

我们可以看到,该SQL运行时确实走Index,也是为什么会有这个问题了,如果要解决很简单,一般来讲rebuild Index即可。
下面我们分析下为什么会报这个错误? 需要分析另外一个trace,内容如下:
*** ACTION NAME:() 2014-09-28 10:50:26.270

* kdsgrp1-1: *************************************************
row 0x04412f00.2 continuation at
0x04412f00.2 file# 17 block# 77568 slot 2 not found
KDSTABN_GET: 0 ..... ntab: 1
curSlot: 2 ..... nrows: 3
kdsgrp - dump CR block dba=0x04412f00
Block header dump: 0x04412f00
Object id on Block? Y
seg/obj: 0x4637 csc: 0x19.d6b0b312 itc: 2 flg: E typ: 1 - DATA
brn: 1 bdba: 0x4c10b38 ver: 0x01 opc: 0
inc: 0 exflg: 0

Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x002d.005.0039d1ac 0x0005743d.eb62.21 --U- 1 fsc 0x02d1.d6b0b962
0x02 0x0012.003.00496a50 0x0005672e.80e6.34 C--- 0 scn 0x0019.d6b0a9aa
bdba: 0x04412f00
data_block_dump,data header at 0xcf3ee064
===============
tsiz: 0x1f98
hsiz: 0x18
pbl: 0xcf3ee064
76543210
flag=--------
ntab=1
nrow=3
frre=0
fsbo=0x18
fseo=0x31a
avsp=0x19d8
tosp=0x1cab
0xe:pti[0] nrow=3 offs=0
0x12:pri[0] sfll=-1
0x14:pri[1] offs=0x5ed
0x16:pri[2] offs=0x31a
block_row_dump:
tab 0, row 1, @0x5ed
tl: 725 fb: --H-FL-- lb: 0x0 cc: 25
col 0: [ 6] c5 09 4e 16 24 36
col 1: [ 7] 78 72 05 04 01 01 01
col 2: [ 6] c5 05 01 3a 47 0c
col 3: [ 5] c4 09 02 01 03
col 4: [ 2] c1 17
col 5: [11] 31 35 37 33 34 35 36 30 39 39 30
col 6: [16] 38 64 32 61 37 34 34 37 62 34 36 37 66 35 61 65
col 7: [ 6] c5 04 13 1c 59 1e
col 8: [ 2] c1 03
col 9: [ 7] 78 72 09 1c 0b 32 33
col 10: [ 1] 80
col 11: [ 7] 78 72 09 1c 0b 32 33
col 12: [592]
7b 7b 7b 61 69 2d 46 6f 72 65 69 67 6e 2d 52 6f 61 6d 20 7b 30 7d 7d 20 7b
。。。。。。
34 37 62 34 36 37 66 35 61 65 7d 7d 7d 20 7b 7d 7d
col 13: *NULL*
col 14: [ 1] 80
col 15: [ 1] 80
col 16: [ 1] 80
col 17: [ 6] c5 09 4e 16 24 36
col 18: [ 1] 80
col 19: [ 2] c1 04
col 20: [ 3] c2 05 34
col 21: [ 1] 80
col 22: [ 1] 80
col 23: [ 7] 78 72 09 1c 0b 32 33
col 24: [ 3] c2 05 34
tab 0, row 2, @0x31a
tl: 2 fb: --HDFL-- lb: 0x1
end_of_block_dump

从上面的一句 file# 17 block# 77568 slot 2 not found,我们就比较清楚了。访问这个block的时候出现问题了。这里的slot其实是指的这个block的row 2. 从这里来看,我们发现这个block的第row=2 这行数据已经被delete了。
从这个block的itl来看,第1个ITL对应的事务已经commit了状态为U,但是还有个row的lock byte还没有被修改过来。根据UBA的地址,我们搜索trace,还可以看到这个block的redo相关信息:
这里根据前面的block中的XID:0x002d.005.0039d1ac 进行搜索:
REDO RECORD - Thread:2 RBA: 0x00217c.00373c1a.0054 LEN: 0x04b4 VLD: 0x01
SCN: 0x0019.d6b0b312 SUBSCN: 1 09/28/2014 10:49:53
(LWN RBA: 0x00217c.00373c18.0010 LEN: 0013 NST: 0001 SCN: 0x0019.d6b0b307)
CHANGE #1 TYP:0 CLS:105 AFN:4 DBA:0x000001c0 OBJ:4294967295 SCN:0x0019.d6b0b2e6 SEQ:1 OP:5.2 ENC:0 RBL:0
ktudh redo: slt: 0x0005 sqn: 0x0039d1ac flg: 0x0012 siz: 960 fbi: 0
uba: 0x0005743d.eb62.21 pxid: 0x0000.000.00000000
CHANGE #2 TYP:0 CLS:106 AFN:4 DBA:0x0005743d OBJ:4294967295 SCN:0x0019.d6b0b2d3 SEQ:3 OP:5.1 ENC:0 RBL:0
ktudb redo: siz: 960 spc: 4782 flg: 0x0012 seq: 0xeb62 rec: 0x21
xid: 0x002d.005.0039d1ac
ktubl redo: slt: 5 rci: 0 opc: 11.1 [objn: 17975 objd: 17975 tsn: 10]
Undo type: Regular undo Begin trans Last buffer split: No
Temp Object: No
Tablespace Undo: No
0x00000000 prev ctl uba: 0x0005743d.eb62.11
prev ctl max cmt scn: 0x0019.d6b07e7c prev tx cmt scn: 0x0019.d6b07e84
txn start scn: 0xffff.ffffffff logon user: 34 prev brb: 357433 prev bcl: 0 BuExt idx: 0 flg2: 0
KDO undo record:
KTB Redo
op: 0x04 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: L itl: xid: 0x000b.01a.005a3434 uba: 0x00055a71.22ce.39
flg: C--- lkc: 0 scn: 0x0019.d6b09f3b
KDO Op code: IRP row dependencies Disabled
xtype: XA flags: 0x00000000 bdba: 0x04412f00 hdba: 0x02c4f9d2
itli: 1 ispac: 0 maxfr: 4858
tabn: 0 slot: 2(0x2) size/delt: 723
fb: --H-FL-- lb: 0x0 cc: 25
null:
01234567890123456789012345678901234567890123456789012345678901234567890123456789
-------------N-----------
col 0: [ 6] c5 09 4e 16 24 5e
col 1: [ 7] 78 72 03 08 01 01 01
col 2: [ 6] c5 04 53 5d 25 4a
col 3: [ 5] c4 09 02 01 03
col 4: [ 2] c1 17
col 5: [11] 31 35 31 37 32 31 36 36 36 38 37
col 6: [16] 37 31 36 63 31 33 37 30 30 35 61 61 65 34 38 64
col 7: [ 6] c5 04 0d 31 05 0a
col 8: [ 2] c1 03
col 9: [ 7] 78 72 09 1c 0b 32 35
col 10: [ 1] 80
col 11: [ 7] 78 72 09 1c 0b 32 35
col 12: [592]
7b 7b 7b 61 69 2d 44 79 6e 61 6d 69 63 2d 50 61 73 73 77 6f 72 64 20 7b 34
。。。。。。
37 30 30 35 61 61 65 34 38 64 7d 7d 7d 20 7b 7d 7d
col 13: *NULL*
col 14: [ 1] 80
col 15: [ 1] 80
col 16: [ 1] 80
col 17: [ 6] c5 09 4e 16 24 5e
col 18: [ 1] 80
col 19: [ 2] c1 04
col 20: [ 2] c1 1c
col 21: [ 1] 80
col 22: [ 1] 80
col 23: [ 7] 78 72 09 1c 0b 32 35
col 24: [ 2] c1 1c
CHANGE #3 TYP:0 CLS:1 AFN:17 DBA:0x04412f00 OBJ:17975 SCN:0x0019.d6b0a9aa SEQ:1 OP:11.3 ENC:0 RBL:0
KTB Redo
op: 0x11 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: F xid: 0x002d.005.0039d1ac uba: 0x0005743d.eb62.21
Block cleanout record, scn: 0x0019.d6b0b312 ver: 0x01 opt: 0x02, entries follow...
itli: 2 flg: 2 scn: 0x0019.d6b0a9aa
KDO Op code: DRP row dependencies Disabled
xtype: XA flags: 0x00000000 bdba: 0x04412f00 hdba: 0x02c4f9d2
itli: 1 ispac: 0 maxfr: 4858
tabn: 0 slot: 2(0x2)

这里的change# 2的opcode是5.1,是针对undo block的操作,换句话讲这里insert的记录就是上面的data block被delete数据的前镜像。接着下面的change# 的opcode是11.3 ,表示drop row,本质上就是指的前面的数据块的delete操作。
比较怪异的事情是,根据XID:  0x002d.005.0039d1ac 搜索,在数据块产生的redo下面搜索不到对应的index block的操作。
既然是这样,那我们来看下,删除的那条记录在Index block中的是否还存在。
既然是跟Index有关系,那么来看看Index Block的信息,由于Index column是第一列,因此我们直接看index block中是否还存在这行记录即可:c5 09 4e 16 24 5e ,搜索到如下的信息:
        buffer tsn: 12 rdba: 0x110025f4 (68/9716)
scn: 0x0019.d6b1d8e8 seq: 0x00 flg: 0x00 tail: 0xd8e80600
frmt: 0x02 chkval: 0x0000 type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
Dump of memory from 0x000000061AA80000 to 0x000000061AA82000
61AA80000 0000A206 110025F4 D6B1D8E8 00000019 [.....{39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}..........]
。。。。。。。
61AA81FF0 30313533 30363430 02C102FF D8E80600 [35100460........]
Block header dump: 0x110025f4
Object id on Block? Y
seg/obj: 0x55d7 csc: 0x19.d6b0f5df itc: 169 flg: E typ: 2 - INDEX
brn: 0 bdba: 0x110025f0 ver: 0x01 opc: 0
inc: 0 exflg: 0

Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0038.005.008efa30 0x0006608f.e460.01 CB-- 0 scn 0x0019.d6ababa7
0x02 0x0009.014.00b1250a 0x000650d7.8bfb.0a C--- 0 scn 0x0019.d6b0aa20
0x03 0x0045.014.0089d73e 0x000662f6.5fa7.24 C--- 0 scn 0x0019.d6b091e5
0x04 0x0051.012.002c919e 0x00066769.ea22.21 C--- 0 scn 0x0019.d6b0ac12
0x05 0x0045.008.0089d72d 0x000662f8.5fa7.02 C--- 0 scn 0x0019.d6b0aa87
0x06 0x002b.011.0085dd6c 0x000666f9.d64d.17 C--- 0 scn 0x0019.d6b0ac32
0x07 0x004d.009.003f1820 0x000665ed.156f.2e C--- 0 scn 0x0019.d6b0acc5
0x08 0x0002.00c.0089fd56 0x000676b0.e0f4.25 C--- 0 scn 0x0019.d6b0adf8
。。。。。。
0x9d 0x0018.019.00550ddb 0x0005711e.0a17.18 C-U- 0 scn 0x0019.d6b049e5
0x9e 0x0006.000.0091749a 0x000663b9.1b95.21 C--- 0 scn 0x0019.d6b0a513
0x9f 0x002c.00f.0083cb36 0x00056cb0.556b.14 C-U- 0 scn 0x0019.d6b070e7
0xa0 0x004f.009.004e686a 0x000673b5.38b3.2a C--- 0 scn 0x0019.d6b0ca89
0xa1 0x0001.01b.00f890b2 0x00067734.051a.1d C--- 0 scn 0x0019.d6b04e71
0xa2 0x0013.018.0035707a 0x000566c4.f620.38 C--- 0 scn 0x0019.d6b0cb64
0xa3 0x0029.021.017e347a 0x00065caf.0164.18 C--- 0 scn 0x0019.d6b0d07b
0xa4 0x0008.01c.008928ee 0x00066d9d.0eaa.04 C--- 0 scn 0x0019.d6b0cd03
0xa5 0x0003.007.00c466e2 0x000678b5.c63c.16 C--- 0 scn 0x0019.d6b0cfaf
0xa6 0x0053.009.0036e0a5 0x00067b94.e044.22 C--- 0 scn 0x0019.d6b0d7ea
0xa7 0x0036.00f.02427574 0x000664d2.fccb.3a C--- 0 scn 0x0019.d6b05538
0xa8 0x0035.015.020d5114 0x00066252.f868.35 C--- 0 scn 0x0019.d6b0d752
0xa9 0x0053.018.0036e0b3 0x00067b94.e044.2e C--- 0 scn 0x0019.d6b0d9c2
Leaf block dump
===============
header address 26217025548=0x61aa8100c
kdxcolev 0
KDXCOLEV Flags = - - -
kdxcolok 0
kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y
kdxconco 1
kdxcosdc 9
kdxconro 115
kdxcofbo 266=0x10a
kdxcofeo 1311=0x51f
kdxcoavs 2034
kdxlespl 0
kdxlende 0
kdxlenxt 0=0x0
kdxleprv 143951530=0x89486aa
kdxledsz 6
kdxlebksz 4024
row#0[1311] flag: ------, lock: 0, len=14, data:(6): 03 c0 7c a7 00 00
col 0; len 5; (5): c5 09 4e 16 1d
row#1[1325] flag: ------, lock: 0, len=15, data:(6): 0e 46 01 36 00 00
col 0; len 6; (6): c5 09 4e 16 1d 04
row#2[1340] flag: ------, lock: 0, len=15, data:(6): 03 c0 7c 95 00 01
col 0; len 6; (6): c5 09 4e 16 1d 1d
row#3[1355] flag: ------, lock: 0, len=15, data:(6): 1b c5 0f 26 00 07
col 0; len 6; (6): c5 09 4e 16 1d 44
。。。。。。
row#95[1685] flag: ------, lock: 0, len=15, data:(6): 04 41 2f 00 00 02
col 0; len 6; (6): c5 09 4e 16 24 5e
。。。。。。。
row#113[3214] flag: ------, lock: 0, len=15, data:(6): 02 45 ea 82 00 02
col 0; len 6; (6): c5 09 4e 16 27 09
row#114[3184] flag: ------, lock: 0, len=15, data:(6): 02 86 72 f9 00 01
col 0; len 6; (6): c5 09 4e 16 27 17
----- end of leaf block dump -----

我们可以看到该键值还存在该index block中,也就是说Index的信息还没有被更新。(有可能是buffer cache的问题)

看到上面的信息,我也被吓了一跳!我靠,第一次见一个index block中居然高达169个ITL。理论上来讲这不是不可能,因为毕竟是主键Index,键值又是number,所占据的空间少。 不过即使是这样,还是把我吓尿了!根据dba地址确认,确实是属于查询报错的table的主键index!
SYS@dbm> Select owner, segment_name, segment_type, partition_name,tablespace_name
2 From dba_extents
3 Where relative_fno = 68
4 And 9716 between block_id and (block_id+blocks-1);

OWNER SEGMENT_NAME SEGMENT_TYPE PARTITION_NAME TABLESPACE_NAME
------------- ------------------ ------------- -------------- ---------------
xxx61 I402000_9_1_PK INDEX xxx_INDEX2

我这里不得不怀疑是buffer cache corruption或者是lost writes了。这个问题不排除是bug导致。

总的来讲,这个问题处理是比较简单的。 另外,针对这个错误,我查询了一下,相关的bug也不少。
关于这个ora-00600 错误,大家可以参考如下2篇mos文章!
Detailed Steps for Troubleshooting ORA-00600 [kdsgrp1] (ID 1492150.1)

Causes and Solutions for ora-600 [kdsgrp1] (ID 1332252.1)
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论