问题描述
一个朋友的exadata环境遇到这个问题ora-00600错误,该错误其实很常见,比较有意思的事情是在后面。我们首先来看看alert log信息,如下:
Sun Sep 28 10:52:13 2014 Errors in file /u01/app/oracle/diag/rdbms/dbm/dbm1/trace/dbm1_ora_3924.trc (incident=2295022): ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/dbm/dbm1/incident/incdir_2295022/dbm1_ora_3924_i2295022.trc Sun Sep 28 10:52:30 2014 Dumping diagnostic data in directory=[cdmp_20140928105230], requested by (instance=1, osid=3924), summary=[incident=2295022].
专家解答
如果没有遇到过这个错误的朋友,可能第一时间不知道是什么意思。标准的解释是这样的:The ora-600 [kdsgrp1] error is thrown when a fetch operation fails to find the expected row. The error is hit in memory and so may be a memory only error or an error that results from corruption on disk.
而产生这个错误的原因可能有如下几种:
Lost writes Parallel DML issues Index corruption Data block corruption Consistent read [CR] issues Buffer cache corruption
我遇到的情况,通常来讲是跟Index有关系,我们先来分析下对应的trace文件。
*** ACTION NAME:() 2014-09-28 10:52:13.809 Dump continued from file: /u01/app/oracle/diag/rdbms/dbm/dbm1/trace/dbm1_ora_3924.trc ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], [] ========= Dump for incident 2295022 (ORA 600 [kdsgrp1]) ======== *** 2014-09-28 10:52:13.810 dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0) ----- Current SQL Statement for this session (sql_id=53dnb22ubgync) ----- select ps_provision_id, bms_subscription_id, ps_operate_type, ps_priority, ps_cur_retry_times, to_char(ps_last_retry_time, 'YYYYMMDDHH24MISS'), ps_fail_reason, ps_extend_feature, ps_detail_status, bms_subscription_no, ps_max_retry_times, ps_node_id, ps_provision_status, bms_svc_id, bms_product_id, bms_customer_id, bms_user_name, bms_user_password from ps_provision_9 where ps_priority = :v1 and bms_subscription_begin_time <= to_date(:v2, 'YYYYMMDDHH24MISS') and (ps_provision_status = 0 or ps_provision_status = 1 or ps_provision_status = 4 or ps_provision_status = 5) and bms_svc_id in (:s0, :s1, :s2, :s3) order by ps_provision_id
只是知道执行这个SQL的时候触发了这个错误,从trace文件中我们直接搜索Plan Table可以定位到SQL的执行计划:
============ Plan Table ============ -----------------------------------------------------+-----------------------------------+ | Id | Operation | Name | Rows | Bytes | Cost | Time | -----------------------------------------------------+-----------------------------------+ | 0 | SELECT STATEMENT | | | | 11 | | | 1 | TABLE ACCESS BY INDEX ROWID | PS_PROVISION_9| 2 | 1388 | 11 | 00:00:01 | | 2 | INDEX FULL SCAN | I402000_9_1_PK| 8 | | 2 | 00:00:01 | -----------------------------------------------------+-----------------------------------+ Content of other_xml column =========================== db_version : 11.2.0.3 parse_schema : OBS61 plan_hash : 3017981752 plan_hash_2 : 2407233167 Outline Data: /*+ BEGIN_OUTLINE_DATA IGNORE_OPTIM_EMBEDDED_HINTS OPTIMIZER_FEATURES_ENABLE('11.2.0.3') DB_VERSION('11.2.0.3') ALL_ROWS OUTLINE_LEAF(@"SEL$1") INDEX(@"SEL$1" "PS_PROVISION_9"@"SEL$1" ("PS_PROVISION_9"."PS_PROVISION_ID")) END_OUTLINE_DATA */
我们可以看到,该SQL运行时确实走Index,也是为什么会有这个问题了,如果要解决很简单,一般来讲rebuild Index即可。
下面我们分析下为什么会报这个错误? 需要分析另外一个trace,内容如下:
*** ACTION NAME:() 2014-09-28 10:50:26.270 * kdsgrp1-1: ************************************************* row 0x04412f00.2 continuation at 0x04412f00.2 file# 17 block# 77568 slot 2 not found KDSTABN_GET: 0 ..... ntab: 1 curSlot: 2 ..... nrows: 3 kdsgrp - dump CR block dba=0x04412f00 Block header dump: 0x04412f00 Object id on Block? Y seg/obj: 0x4637 csc: 0x19.d6b0b312 itc: 2 flg: E typ: 1 - DATA brn: 1 bdba: 0x4c10b38 ver: 0x01 opc: 0 inc: 0 exflg: 0 Itl Xid Uba Flag Lck Scn/Fsc 0x01 0x002d.005.0039d1ac 0x0005743d.eb62.21 --U- 1 fsc 0x02d1.d6b0b962 0x02 0x0012.003.00496a50 0x0005672e.80e6.34 C--- 0 scn 0x0019.d6b0a9aa bdba: 0x04412f00 data_block_dump,data header at 0xcf3ee064 =============== tsiz: 0x1f98 hsiz: 0x18 pbl: 0xcf3ee064 76543210 flag=-------- ntab=1 nrow=3 frre=0 fsbo=0x18 fseo=0x31a avsp=0x19d8 tosp=0x1cab 0xe:pti[0] nrow=3 offs=0 0x12:pri[0] sfll=-1 0x14:pri[1] offs=0x5ed 0x16:pri[2] offs=0x31a block_row_dump: tab 0, row 1, @0x5ed tl: 725 fb: --H-FL-- lb: 0x0 cc: 25 col 0: [ 6] c5 09 4e 16 24 36 col 1: [ 7] 78 72 05 04 01 01 01 col 2: [ 6] c5 05 01 3a 47 0c col 3: [ 5] c4 09 02 01 03 col 4: [ 2] c1 17 col 5: [11] 31 35 37 33 34 35 36 30 39 39 30 col 6: [16] 38 64 32 61 37 34 34 37 62 34 36 37 66 35 61 65 col 7: [ 6] c5 04 13 1c 59 1e col 8: [ 2] c1 03 col 9: [ 7] 78 72 09 1c 0b 32 33 col 10: [ 1] 80 col 11: [ 7] 78 72 09 1c 0b 32 33 col 12: [592] 7b 7b 7b 61 69 2d 46 6f 72 65 69 67 6e 2d 52 6f 61 6d 20 7b 30 7d 7d 20 7b 。。。。。。 34 37 62 34 36 37 66 35 61 65 7d 7d 7d 20 7b 7d 7d col 13: *NULL* col 14: [ 1] 80 col 15: [ 1] 80 col 16: [ 1] 80 col 17: [ 6] c5 09 4e 16 24 36 col 18: [ 1] 80 col 19: [ 2] c1 04 col 20: [ 3] c2 05 34 col 21: [ 1] 80 col 22: [ 1] 80 col 23: [ 7] 78 72 09 1c 0b 32 33 col 24: [ 3] c2 05 34 tab 0, row 2, @0x31a tl: 2 fb: --HDFL-- lb: 0x1 end_of_block_dump
从上面的一句 file# 17 block# 77568 slot 2 not found,我们就比较清楚了。访问这个block的时候出现问题了。这里的slot其实是指的这个block的row 2. 从这里来看,我们发现这个block的第row=2 这行数据已经被delete了。
从这个block的itl来看,第1个ITL对应的事务已经commit了状态为U,但是还有个row的lock byte还没有被修改过来。根据UBA的地址,我们搜索trace,还可以看到这个block的redo相关信息:
这里根据前面的block中的XID:0x002d.005.0039d1ac 进行搜索:
REDO RECORD - Thread:2 RBA: 0x00217c.00373c1a.0054 LEN: 0x04b4 VLD: 0x01 SCN: 0x0019.d6b0b312 SUBSCN: 1 09/28/2014 10:49:53 (LWN RBA: 0x00217c.00373c18.0010 LEN: 0013 NST: 0001 SCN: 0x0019.d6b0b307) CHANGE #1 TYP:0 CLS:105 AFN:4 DBA:0x000001c0 OBJ:4294967295 SCN:0x0019.d6b0b2e6 SEQ:1 OP:5.2 ENC:0 RBL:0 ktudh redo: slt: 0x0005 sqn: 0x0039d1ac flg: 0x0012 siz: 960 fbi: 0 uba: 0x0005743d.eb62.21 pxid: 0x0000.000.00000000 CHANGE #2 TYP:0 CLS:106 AFN:4 DBA:0x0005743d OBJ:4294967295 SCN:0x0019.d6b0b2d3 SEQ:3 OP:5.1 ENC:0 RBL:0 ktudb redo: siz: 960 spc: 4782 flg: 0x0012 seq: 0xeb62 rec: 0x21 xid: 0x002d.005.0039d1ac ktubl redo: slt: 5 rci: 0 opc: 11.1 [objn: 17975 objd: 17975 tsn: 10] Undo type: Regular undo Begin trans Last buffer split: No Temp Object: No Tablespace Undo: No 0x00000000 prev ctl uba: 0x0005743d.eb62.11 prev ctl max cmt scn: 0x0019.d6b07e7c prev tx cmt scn: 0x0019.d6b07e84 txn start scn: 0xffff.ffffffff logon user: 34 prev brb: 357433 prev bcl: 0 BuExt idx: 0 flg2: 0 KDO undo record: KTB Redo op: 0x04 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: L itl: xid: 0x000b.01a.005a3434 uba: 0x00055a71.22ce.39 flg: C--- lkc: 0 scn: 0x0019.d6b09f3b KDO Op code: IRP row dependencies Disabled xtype: XA flags: 0x00000000 bdba: 0x04412f00 hdba: 0x02c4f9d2 itli: 1 ispac: 0 maxfr: 4858 tabn: 0 slot: 2(0x2) size/delt: 723 fb: --H-FL-- lb: 0x0 cc: 25 null: 01234567890123456789012345678901234567890123456789012345678901234567890123456789 -------------N----------- col 0: [ 6] c5 09 4e 16 24 5e col 1: [ 7] 78 72 03 08 01 01 01 col 2: [ 6] c5 04 53 5d 25 4a col 3: [ 5] c4 09 02 01 03 col 4: [ 2] c1 17 col 5: [11] 31 35 31 37 32 31 36 36 36 38 37 col 6: [16] 37 31 36 63 31 33 37 30 30 35 61 61 65 34 38 64 col 7: [ 6] c5 04 0d 31 05 0a col 8: [ 2] c1 03 col 9: [ 7] 78 72 09 1c 0b 32 35 col 10: [ 1] 80 col 11: [ 7] 78 72 09 1c 0b 32 35 col 12: [592] 7b 7b 7b 61 69 2d 44 79 6e 61 6d 69 63 2d 50 61 73 73 77 6f 72 64 20 7b 34 。。。。。。 37 30 30 35 61 61 65 34 38 64 7d 7d 7d 20 7b 7d 7d col 13: *NULL* col 14: [ 1] 80 col 15: [ 1] 80 col 16: [ 1] 80 col 17: [ 6] c5 09 4e 16 24 5e col 18: [ 1] 80 col 19: [ 2] c1 04 col 20: [ 2] c1 1c col 21: [ 1] 80 col 22: [ 1] 80 col 23: [ 7] 78 72 09 1c 0b 32 35 col 24: [ 2] c1 1c CHANGE #3 TYP:0 CLS:1 AFN:17 DBA:0x04412f00 OBJ:17975 SCN:0x0019.d6b0a9aa SEQ:1 OP:11.3 ENC:0 RBL:0 KTB Redo op: 0x11 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: F xid: 0x002d.005.0039d1ac uba: 0x0005743d.eb62.21 Block cleanout record, scn: 0x0019.d6b0b312 ver: 0x01 opt: 0x02, entries follow... itli: 2 flg: 2 scn: 0x0019.d6b0a9aa KDO Op code: DRP row dependencies Disabled xtype: XA flags: 0x00000000 bdba: 0x04412f00 hdba: 0x02c4f9d2 itli: 1 ispac: 0 maxfr: 4858 tabn: 0 slot: 2(0x2)
这里的change# 2的opcode是5.1,是针对undo block的操作,换句话讲这里insert的记录就是上面的data block被delete数据的前镜像。接着下面的change# 的opcode是11.3 ,表示drop row,本质上就是指的前面的数据块的delete操作。
比较怪异的事情是,根据XID: 0x002d.005.0039d1ac 搜索,在数据块产生的redo下面搜索不到对应的index block的操作。
既然是这样,那我们来看下,删除的那条记录在Index block中的是否还存在。
既然是跟Index有关系,那么来看看Index Block的信息,由于Index column是第一列,因此我们直接看index block中是否还存在这行记录即可:c5 09 4e 16 24 5e ,搜索到如下的信息:
buffer tsn: 12 rdba: 0x110025f4 (68/9716) scn: 0x0019.d6b1d8e8 seq: 0x00 flg: 0x00 tail: 0xd8e80600 frmt: 0x02 chkval: 0x0000 type: 0x06=trans data Hex dump of block: st=0, typ_found=1 Dump of memory from 0x000000061AA80000 to 0x000000061AA82000 61AA80000 0000A206 110025F4 D6B1D8E8 00000019 [.....%..........] 。。。。。。。 61AA81FF0 30313533 30363430 02C102FF D8E80600 [35100460........] Block header dump: 0x110025f4 Object id on Block? Y seg/obj: 0x55d7 csc: 0x19.d6b0f5df itc: 169 flg: E typ: 2 - INDEX brn: 0 bdba: 0x110025f0 ver: 0x01 opc: 0 inc: 0 exflg: 0 Itl Xid Uba Flag Lck Scn/Fsc 0x01 0x0038.005.008efa30 0x0006608f.e460.01 CB-- 0 scn 0x0019.d6ababa7 0x02 0x0009.014.00b1250a 0x000650d7.8bfb.0a C--- 0 scn 0x0019.d6b0aa20 0x03 0x0045.014.0089d73e 0x000662f6.5fa7.24 C--- 0 scn 0x0019.d6b091e5 0x04 0x0051.012.002c919e 0x00066769.ea22.21 C--- 0 scn 0x0019.d6b0ac12 0x05 0x0045.008.0089d72d 0x000662f8.5fa7.02 C--- 0 scn 0x0019.d6b0aa87 0x06 0x002b.011.0085dd6c 0x000666f9.d64d.17 C--- 0 scn 0x0019.d6b0ac32 0x07 0x004d.009.003f1820 0x000665ed.156f.2e C--- 0 scn 0x0019.d6b0acc5 0x08 0x0002.00c.0089fd56 0x000676b0.e0f4.25 C--- 0 scn 0x0019.d6b0adf8 。。。。。。 0x9d 0x0018.019.00550ddb 0x0005711e.0a17.18 C-U- 0 scn 0x0019.d6b049e5 0x9e 0x0006.000.0091749a 0x000663b9.1b95.21 C--- 0 scn 0x0019.d6b0a513 0x9f 0x002c.00f.0083cb36 0x00056cb0.556b.14 C-U- 0 scn 0x0019.d6b070e7 0xa0 0x004f.009.004e686a 0x000673b5.38b3.2a C--- 0 scn 0x0019.d6b0ca89 0xa1 0x0001.01b.00f890b2 0x00067734.051a.1d C--- 0 scn 0x0019.d6b04e71 0xa2 0x0013.018.0035707a 0x000566c4.f620.38 C--- 0 scn 0x0019.d6b0cb64 0xa3 0x0029.021.017e347a 0x00065caf.0164.18 C--- 0 scn 0x0019.d6b0d07b 0xa4 0x0008.01c.008928ee 0x00066d9d.0eaa.04 C--- 0 scn 0x0019.d6b0cd03 0xa5 0x0003.007.00c466e2 0x000678b5.c63c.16 C--- 0 scn 0x0019.d6b0cfaf 0xa6 0x0053.009.0036e0a5 0x00067b94.e044.22 C--- 0 scn 0x0019.d6b0d7ea 0xa7 0x0036.00f.02427574 0x000664d2.fccb.3a C--- 0 scn 0x0019.d6b05538 0xa8 0x0035.015.020d5114 0x00066252.f868.35 C--- 0 scn 0x0019.d6b0d752 0xa9 0x0053.018.0036e0b3 0x00067b94.e044.2e C--- 0 scn 0x0019.d6b0d9c2 Leaf block dump =============== header address 26217025548=0x61aa8100c kdxcolev 0 KDXCOLEV Flags = - - - kdxcolok 0 kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y kdxconco 1 kdxcosdc 9 kdxconro 115 kdxcofbo 266=0x10a kdxcofeo 1311=0x51f kdxcoavs 2034 kdxlespl 0 kdxlende 0 kdxlenxt 0=0x0 kdxleprv 143951530=0x89486aa kdxledsz 6 kdxlebksz 4024 row#0[1311] flag: ------, lock: 0, len=14, data:(6): 03 c0 7c a7 00 00 col 0; len 5; (5): c5 09 4e 16 1d row#1[1325] flag: ------, lock: 0, len=15, data:(6): 0e 46 01 36 00 00 col 0; len 6; (6): c5 09 4e 16 1d 04 row#2[1340] flag: ------, lock: 0, len=15, data:(6): 03 c0 7c 95 00 01 col 0; len 6; (6): c5 09 4e 16 1d 1d row#3[1355] flag: ------, lock: 0, len=15, data:(6): 1b c5 0f 26 00 07 col 0; len 6; (6): c5 09 4e 16 1d 44 。。。。。。 row#95[1685] flag: ------, lock: 0, len=15, data:(6): 04 41 2f 00 00 02 col 0; len 6; (6): c5 09 4e 16 24 5e 。。。。。。。 row#113[3214] flag: ------, lock: 0, len=15, data:(6): 02 45 ea 82 00 02 col 0; len 6; (6): c5 09 4e 16 27 09 row#114[3184] flag: ------, lock: 0, len=15, data:(6): 02 86 72 f9 00 01 col 0; len 6; (6): c5 09 4e 16 27 17 ----- end of leaf block dump -----
我们可以看到该键值还存在该index block中,也就是说Index的信息还没有被更新。(有可能是buffer cache的问题)
看到上面的信息,我也被吓了一跳!我靠,第一次见一个index block中居然高达169个ITL。理论上来讲这不是不可能,因为毕竟是主键Index,键值又是number,所占据的空间少。 不过即使是这样,还是把我吓尿了!根据dba地址确认,确实是属于查询报错的table的主键index!
SYS@dbm> Select owner, segment_name, segment_type, partition_name,tablespace_name 2 From dba_extents 3 Where relative_fno = 68 4 And 9716 between block_id and (block_id+blocks-1); OWNER SEGMENT_NAME SEGMENT_TYPE PARTITION_NAME TABLESPACE_NAME ------------- ------------------ ------------- -------------- --------------- xxx61 I402000_9_1_PK INDEX xxx_INDEX2
我这里不得不怀疑是buffer cache corruption或者是lost writes了。这个问题不排除是bug导致。
总的来讲,这个问题处理是比较简单的。 另外,针对这个错误,我查询了一下,相关的bug也不少。
关于这个ora-00600 错误,大家可以参考如下2篇mos文章!
Detailed Steps for Troubleshooting ORA-00600 [kdsgrp1] (ID 1492150.1)