问题描述
一个朋友公司的数据库出现异常,没有备份。 数据库open之后很快就crash掉,如下的alert log的信息:
Tue Jul 08 22:53:03 2014 SMON: enabling cache recovery [13803] Successfully onlined Undo Tablespace 2. Undo initialization finished serial:0 start:7751194 end:7751284 diff:90 (0 seconds) Verifying file header compatibility for 11g tablespace encryption.. Verifying 11g file header compatibility for tablespace encryption completed SMON: enabling tx recovery Database Characterset is ZHS16GBK Errors in file /home/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_smon_13788.trc (incident=56461): ORA-00600: internal error code, arguments: [17182], [0x2B4E2EB8D7A0], [], [], [], [], [], [], [], [], [], [] Incident details in: /home/oracle/diag/rdbms/xxxx/xxxx/incident/incdir_56461/xxxx_smon_13788_i56461.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. No Resource Manager plan active replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC Tue Jul 08 22:53:04 2014 QMNC started with pid=70, OS id=13956 Completed: alter database open Tue Jul 08 22:53:05 2014 db_recovery_file_dest_size of 4977 MB is 0.00% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Tue Jul 08 22:53:05 2014 Dumping diagnostic data in directory=[cdmp_20140708225305], requested by (instance=1, osid=13788 (SMON)), summary=[incident=56461]. Block recovery from logseq 2855, block 100 to scn 104606896 Recovery of Online Redo Log: Thread 1 Group 5 Seq 2855 Reading mem 0 Mem# 0: /backup/oradata/xxxx/redo05.log Block recovery completed at rba 2855.131.16, scn 0.104606897 ORACLE Instance xxxx (pid = 32) - Error 600 encountered while recovering transaction (14, 2) on object 15113. Errors in file /home/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_smon_13788.trc: ORA-00600: internal error code, arguments: [17182], [0x2B4E2EB8D7A0], [], [], [], [], [], [], [], [], [], [] Tue Jul 08 22:53:05 2014 Starting background process CJQ0 Tue Jul 08 22:53:05 2014 CJQ0 started with pid=73, OS id=13984 Dumping diagnostic data in directory=[cdmp_20140708225306], requested by (instance=1, osid=13788 (SMON)), summary=[abnormal process termination]. Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x97EE385, kghrst()+1835] [flags: 0x0, count: 1] Errors in file /home/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_smon_13788.trc (incident=56462): ORA-07445: exception encountered: core dump [kghrst()+1835] [SIGSEGV] [ADDR:0x0] [PC:0x97EE385] [SI_KERNEL(general_protection)] [] ORA-00600: internal error code, arguments: [17182], [0x2B4E2EB8D7A0], [], [], [], [], [], [], [], [], [], [] Incident details in: /home/oracle/diag/rdbms/xxxx/xxxx/incident/incdir_56462/xxxx_smon_13788_i56462.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /home/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_smon_13788.trc (incident=56463): ORA-00600: internal error code, arguments: [kghfrempty:ds], [0x2B4E2EB8D790], [], [], [], [], [], [], [], [], [], [] ORA-07445: exception encountered: core dump [kghrst()+1835] [SIGSEGV] [ADDR:0x0] [PC:0x97EE385] [SI_KERNEL(general_protection)] [] ORA-00600: internal error code, arguments: [17182], [0x2B4E2EB8D7A0], [], [], [], [], [], [], [], [], [], [] Incident details in: /home/oracle/diag/rdbms/xxxx/xxxx/incident/incdir_56463/xxxx_smon_13788_i56463.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /home/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_smon_13788.trc (incident=56464): ORA-00600: internal error code, arguments: [kghfrempty:ds], [0x2B4E2EB8D790], [], [], [], [], [], [], [], [], [], [] ORA-07445: exception encountered: core dump [kghrst()+1835] [SIGSEGV] [ADDR:0x0] [PC:0x97EE385] [SI_KERNEL(general_protection)] [] ORA-00600: internal error code, arguments: [17182], [0x2B4E2EB8D7A0], [], [], [], [], [], [], [], [], [], [] Incident details in: /home/oracle/diag/rdbms/xxxx/xxxx/incident/incdir_56464/xxxx_smon_13788_i56464.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Setting Resource Manager plan SCHEDULER[0x32DA]:DEFAULT_MAINTENANCE_PLAN via scheduler window Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter Tue Jul 08 22:53:09 2014 Starting background process VKRM Tue Jul 08 22:53:09 2014 VKRM started with pid=69, OS id=13988 Dumping diagnostic data in directory=[cdmp_20140708225309], requested by (instance=1, osid=13788 (SMON)), summary=[incident=56463]. Errors in file /home/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_smon_13788.trc (incident=56465): ORA-00600: internal error code, arguments: [kghfrempty:ds], [0x2B4E2EB8D790], [], [], [], [], [], [], [], [], [], [] ORA-07445: exception encountered: core dump [kghrst()+1835] [SIGSEGV] [ADDR:0x0] [PC:0x97EE385] [SI_KERNEL(general_protection)] [] ORA-00600: internal error code, arguments: [17182], [0x2B4E2EB8D7A0], [], [], [], [], [], [], [], [], [], [] Incident details in: /home/oracle/diag/rdbms/xxxx/xxxx/incident/incdir_56465/xxxx_smon_13788_i56465.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Dumping diagnostic data in directory=[cdmp_20140708225310], requested by (instance=1, osid=13788 (SMON)), summary=[incident=56465]. Errors in file /home/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_smon_13788.trc (incident=56466): ORA-00600: internal error code, arguments: [kghfrempty:ds], [0x2B4E2EB8D790], [], [], [], [], [], [], [], [], [], [] ORA-07445: exception encountered: core dump [kghrst()+1835] [SIGSEGV] [ADDR:0x0] [PC:0x97EE385] [SI_KERNEL(general_protection)] [] ORA-00600: internal error code, arguments: [17182], [0x2B4E2EB8D7A0], [], [], [], [], [], [], [], [], [], [] Incident details in: /home/oracle/diag/rdbms/xxxx/xxxx/incident/incdir_56466/xxxx_smon_13788_i56466.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Dumping diagnostic data in directory=[cdmp_20140708225311], requested by (instance=1, osid=13788 (SMON)), summary=[incident=56466]. Errors in file /home/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_smon_13788.trc (incident=56467): ORA-00600: internal error code, arguments: [kghfrempty:ds], [0x2B4E2EB8D790], [], [], [], [], [], [], [], [], [], [] ORA-07445: exception encountered: core dump [kghrst()+1835] [SIGSEGV] [ADDR:0x0] [PC:0x97EE385] [SI_KERNEL(general_protection)] [] ORA-00600: internal error code, arguments: [17182], [0x2B4E2EB8D7A0], [], [], [], [], [], [], [], [], [], [] Incident details in: /home/oracle/diag/rdbms/xxxx/xxxx/incident/incdir_56467/xxxx_smon_13788_i56467.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Tue Jul 08 22:53:12 2014 Begin automatic SQL Tuning Advisor run for special tuning task "SYS_AUTO_SQL_TUNING_TASK" Tue Jul 08 22:53:13 2014 Errors in file /home/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_smon_13788.trc (incident=56468): ORA-00600: internal error code, arguments: [kghfrempty:ds], [0x2B4E2EB8D790], [], [], [], [], [], [], [], [], [], [] ORA-07445: exception encountered: core dump [kghrst()+1835] [SIGSEGV] [ADDR:0x0] [PC:0x97EE385] [SI_KERNEL(general_protection)] [] ORA-00600: internal error code, arguments: [17182], [0x2B4E2EB8D7A0], [], [], [], [], [], [], [], [], [], [] Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. PMON (ospid: 13666): terminating the instance due to error 474 System state dump requested by (instance=1, osid=13666 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /home/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_diag_13684_20140708225315.trc Tue Jul 08 22:53:16 2014 Dumping diagnostic data in directory=[cdmp_20140708225315], requested by (instance=1, osid=13666 (PMON)), summary=[abnormal instance termination]. Tue Jul 08 22:53:17 2014 Instance terminated by PMON, pid = 13666
专家解答
可以看出,是smon进程无法对某个事务进行恢复导致,最后Pmon进程强行终止了Instance。
具体来讲,是无法恢复这个事务:transaction (14, 2) on object 15113
从信息来看,是smon 回滚事务存在异常,那么我们来分析下smon 进程的trace文件,确认下为什么无法恢复(14,2)这个事务。
根据事务(14,2),直接搜索:000e.002 定位到如下信息:
<pre class="brush:plain"> ************ dump undo block ********** Dump of buffer cache at level 3 for tsn=2 rdba=12630748 BH (0x1d9ef41a48) file#: 3 rdba: 0x00c0badc (3/47836) class: 44 ba: 0x1d84096000 set: 322 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,0 dbwrid: 1 obj: -1 objn: 0 tsn: 2 afn: 3 hint: f hash: [0x2127e3c4e0,0x2127e3c4e0] lru: [0x1d9ef41c70,0x1d9ef41a00] ckptq: [NULL] fileq: [NULL] objq: [0x1d9ef43b10,0x2110560590] objaq: [0x1d9ef41f18,0x2110560580] st: XCURRENT md: NULL fpin: 'ktuwh23: ktubko' tch: 1 flags: LRBA: [0x0.0.0] LSCN: [0x0.0] HSCN: [0xffff.ffffffff] HSUB: [65535] buffer tsn: 2 rdba: 0x00c0badc (3/47836) scn: 0x0000.06382fde seq: 0x05 flg: 0x04 tail: 0x2fde0205 frmt: 0x02 chkval: 0x5fc6 type: 0x02=KTU UNDO BLOCK Hex dump of block: st=0, typ_found=1 Dump of memory from 0x0000001D84096000 to 0x0000001D84098000 ......省略部分内容 ******************************************************************************** UNDO BLK: xid: 0x000e.002.00034175 seq: 0x35dc cnt: 0x5 irb: 0x1 icl: 0x0 flg: 0x0000 Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset --------------------------------------------------------------------------- 0x01 0x1f28 0x02 0x1ec8 0x03 0x1e38 0x04 0x1db8 0x05 0x1d1c *----------------------------- * Rec #0x1 slt: 0x02 objn: 15113(0x00003b09) objd: 15555 tblspc: 56(0x00000038) * Layer: 11 (Row) opc: 1 rci 0x00 Undo type: Regular undo Last buffer split: No Temp Object: No Tablespace Undo: No rdba: 0x00c0badb *----------------------------- KDO undo record: KTB Redo op: 0x02 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: C uba: 0x00c0badb.35dc.2f KDO Op code: IRP row dependencies Disabled xtype: XA flags: 0x00000000 bdba: 0x468069a2 hdba: 0x46800c91 itli: 2 ispac: 0 maxfr: 4858 tabn: 0 slot: 108(0x6c) size/delt: 64 fb: --H-FL-- lb: 0x0 cc: 16 null: 01234567890123456789012345678901234567890123456789012345678901234567890123456789 ------NNNNN--NN- col 0: [12] 33 35 38 30 34 34 36 33 32 36 30 36 col 1: [ 1] 20 col 2: [ 9] 64 65 6c 69 76 65 72 65 64 col 3: [ 5] 30 32 31 30 30 col 4: [ 4] c9 cf ba a3 col 5: [ 1] 30 col 6: *NULL* col 7: *NULL* col 8: *NULL* col 9: *NULL* col 10: *NULL* col 11: [ 7] 78 72 04 04 11 23 02 col 12: [ 7] 78 72 04 04 11 23 02 col 13: *NULL* col 14: *NULL* col 15: [ 0] *----------------------------- * Rec #0x2 slt: 0x02 objn: 15113(0x00003b09) objd: 15555 tblspc: 56(0x00000038) * Layer: 11 (Row) opc: 1 rci 0x01 Undo type: Regular undo User Undo Applied Last buffer split: No Temp Object: No Tablespace Undo: No rdba: 0x00000000 *----------------------------- KDO undo record: KTB Redo op: 0x02 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: C uba: 0x00c0badc.35dc.01 KDO Op code: IRP row dependencies Disabled xtype: XA flags: 0x00000000 bdba: 0x468069a2 hdba: 0x46800c91 itli: 2 ispac: 0 maxfr: 4858 tabn: 0 slot: 107(0x6b) size/delt: 9 fb: --H----- lb: 0x0 cc: 0 nrid: 0x468069b9.7 null: *----------------------------- * Rec #0x3 slt: 0x02 objn: 15404(0x00003c2c) objd: 15556 tblspc: 4(0x00000004) * Layer: 10 (Index) opc: 22 rci 0x02 Undo type: Regular undo User Undo Applied Last buffer split: No Temp Object: No Tablespace Undo: No rdba: 0x00000000 *----------------------------- index undo for leaf key operations KTB Redo op: 0x04 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: L itl: xid: 0x000e.00f.00034021 uba: 0x00c07ed9.35d5.18 flg: C--- lkc: 0 scn: 0x0000.0635da5a Dump kdilk : itl=2, kdxlkflg=0x25 sdc=2 indexid=0x100008a block=0x01002052 (kdxlre): restore leaf row (clear leaf delete flags) number of keys: 2 key sizes: 18 18 key :(36): 05 30 32 31 30 30 ff 0a 00 00 3c c3 46 80 69 a2 00 6c 05 30 32 31 30 30 ff 0a 00 00 3c c3 46 80 69 a2 00 6e selflock: (1): 00 bitmap: (1): fc *----------------------------- * Rec #0x4 slt: 0x02 objn: 15139(0x00003b23) objd: 15559 tblspc: 57(0x00000039) * Layer: 10 (Index) opc: 22 rci 0x03 Undo type: Regular undo User Undo Applied Last buffer split: No Temp Object: No Tablespace Undo: No rdba: 0x00000000 *----------------------------- index undo for leaf key operations KTB Redo op: 0x04 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: L itl: xid: 0x000e.00e.0001da74 uba: 0x00c144ff.2156.16 flg: C--- lkc: 0 scn: 0x0000.04d6e213 Dump kdilk : itl=2, kdxlkflg=0x25 sdc=2 indexid=0x474000a2 block=0x47400ffe (kdxlre): restore leaf row (clear leaf delete flags) number of keys: 2 key sizes: 9 9 key :(18): 01 20 06 46 80 69 a2 00 6c 01 20 06 46 80 69 a2 00 6e selflock: (1): 00 bitmap: (1): fc *----------------------------- * Rec #0x5 slt: 0x02 objn: 15134(0x00003b1e) objd: 15560 tblspc: 57(0x00000039) * Layer: 10 (Index) opc: 22 rci 0x04 Undo type: Regular undo User Undo Applied Last buffer split: No Temp Object: No Tablespace Undo: No rdba: 0x00000000 *----------------------------- index undo for leaf key operations KTB Redo op: 0x04 ver: 0x01 compat bit: 4 (post-11) padding: 1 op: L itl: xid: 0x000d.002.000287af uba: 0x00c1b5ce.37e8.1f flg: C--- lkc: 0 scn: 0x0000.05e39111 Dump kdilk : itl=2, kdxlkflg=0x25 sdc=1 indexid=0x47400082 block=0x47800db3 (kdxlre): restore leaf row (clear leaf delete flags) number of keys: 2 key sizes: 13 13 key :(26): 0c 33 35 38 30 34 34 36 33 32 35 38 38 0c 33 35 38 30 34 34 36 33 32 36 30 36 keydata/bitmap: (20): 00 00 3c c3 46 80 69 a2 00 6e 00 00 3c c3 46 80 69 a2 00 6c selflock: (1): 00 bitmap: (1): fc kcra_dump_redo_internal: skipped for critical process ORACLE Instance xxxx (pid = 32) - Error 600 encountered while recovering transaction (14, 2) on object 15113.
可以看到,(14,2)这个事务包含了5个undo record,也就是需要进行5个操作,然后到最后record 1的时候无法进行了。
真的是这样吗?
我们首先来看下该undo block的dump信息:
xid: 0x000e.002.00034175 seq: 0x35dc cnt: 0x5 irb: 0x1
从这里我们可以看出,这个undo block包含了5个undo record,然而针对该事务而言,起点是irb:0x1,也就是第一个undo record。
这里明显是有问题的,我们可以看到如下的几个undo recor的rci值分别是:0x04 0x03 0x02 0x01 0x00
所以,从undo record的rci值来判断,这个事务应该是从undo record 5开始rollback到undo record 1结束。
但是undo block中提到的irb值居然是0x1 ? 这里irb:0x1 是第一个undo record,其rci值本身就为0了,怎么回滚?
因此,很明显这里是存在问题。
所以如果想要临时解决该数据库open之后不crash的问题,那么可以直接修改下这个undo block.修改下irb或rci的值。
备注:朋友这里的库,其他还有其他的文件存在坏块,这里暂不讨论。因为没有其他信息了。