客户反映说一个库前几天因为磁盘损坏,导致操作系统宕机。几经折腾终于把盘恢复了,却又发现数据库无法启动了。并且已经使用trace里面的backup controlfile重建了控制文件,但还是无法正常启动.
检查之后,mount实例,尝试recover database 并 open 数据库:
recover database until cancel using backup controlfile;因为controlfile是重建出来的,所以当前的controlfile并不知道哪个在线日志是current的,需要手动指定。
Thu Apr 16 13:01:14 2015ALTER DATABASE RECOVER database until cancel using backup controlfile Thu Apr 16 13:01:14 2015Media Recovery StartWARNING! Recovering data file 1 from a fuzzy file. If not the current fileit might be an online backup taken without entering the begin backup command.WARNING! Recovering data file 2 from a fuzzy file. If not the current fileit might be an online backup taken without entering the begin backup command....WARNING! Recovering data file 75 from a fuzzy file. If not the current fileit might be an online backup taken without entering the begin backup command.ORA-279 signalled during: ALTER DATABASE RECOVER database until cancel usin...Thu Apr 16 13:02:31 2015ALTER DATABASE RECOVER LOGFILE 'E:\datafile\redo01.log' Thu Apr 16 13:02:31 2015Media Recovery Log E:\datafile\redo01.logErrors with log E:\datafile\redo01.logORA-339 signalled during: ALTER DATABASE RECOVER LOGFILE 'E:\datafile\red...Thu Apr 16 13:02:31 2015ALTER DATABASE RECOVER CANCEL ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ...Thu Apr 16 13:03:02 2015ALTER DATABASE RECOVER database until cancel using backup controlfile Thu Apr 16 13:03:02 2015Media Recovery StartWARNING! Recovering data file 1 from a fuzzy file. If not the current fileit might be an online backup taken without entering the begin backup command.WARNING! Recovering data file 2 from a fuzzy file. If not the current fileit might be an online backup taken without entering the begin backup command....WARNING! Recovering data file 75 from a fuzzy file. If not the current fileit might be an online backup taken without entering the begin backup command.ORA-279 signalled during: ALTER DATABASE RECOVER database until cancel usin...Thu Apr 16 13:03:16 2015ALTER DATABASE RECOVER LOGFILE 'E:\datafile\redo02.log' Thu Apr 16 13:03:16 2015Media Recovery Log E:\datafile\redo02.logErrors with log E:\datafile\redo02.logORA-339 signalled during: ALTER DATABASE RECOVER LOGFILE 'E:\datafile\red...Thu Apr 16 13:03:16 2015ALTER DATABASE RECOVER CANCEL ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ...Thu Apr 16 13:03:34 2015ALTER DATABASE RECOVER database until cancel using backup controlfile Thu Apr 16 13:03:34 2015Media Recovery StartWARNING! Recovering data file 1 from a fuzzy file. If not the current fileit might be an online backup taken without entering the begin backup command.WARNING! Recovering data file 2 from a fuzzy file. If not the current fileit might be an online backup taken without entering the begin backup command....WARNING! Recovering data file 75 from a fuzzy file. If not the current fileit might be an online backup taken without entering the begin backup command.ORA-279 signalled during: ALTER DATABASE RECOVER database until cancel usin...Thu Apr 16 13:03:49 2015ALTER DATABASE RECOVER LOGFILE 'E:\datafile\redo03.log' Thu Apr 16 13:03:49 2015Media Recovery Log E:\datafile\redo03.logIncomplete recovery applied all redo ever generated.Recovery completed through change 14081497748113Media Recovery CompleteCompleted: ALTER DATABASE RECOVER LOGFILE 'E:\datafile\red多次尝试后发现redo03.log是所需要的current group member,尝试alter database open resetlogs,但是出现ORA-00600 [4000]错误:
Thu Apr 16 08:00:40 2015Errors in file d:\oracle\admin\kf2\udump\kf2_ora_144.trc:ORA-00600: 内部错误代码,参数: [4000], [3], [], [], [], [], [], []
Thu Apr 16 08:00:40 2015Errors in file d:\oracle\admin\kf2\udump\kf2_ora_144.trc:ORA-00704: 引导程序进程失败
ORA-00704: 引导程序进程失败
ORA-00600: 内部错误代码,参数: [4000], [3], [], [], [], [], [], []
Thu Apr 16 08:00:40 2015Error 704 happened during db open, shutting down databaseUSER: terminating instance due to error 704Thu Apr 16 08:00:40 2015Errors in file d:\oracle\admin\kf2\bdump\kf2_pmon_5172.trc:ORA-00704: bootstrap process failure以上ORA-00600: [4000], [3],说明在使用usn=3的回滚段rollback数据块时发现rollback segment存在错误,且伴随有ORA-00704: bootstrap process failure错误,说明需要回滚的数据块是bootstrap需要的自举对象。
一般来说bootstrap object需要做rollback或cleanup,而apply undo数据时,我们是无法使用_corrupted_rollback_segments,_offline_rollback_segments或10513事件来阻止ORA-00600: [4000]的发生。
不过还是抱着试一试的心理尝试一下:
*._allow_resetlogs_corruption=trueevent=('10510 trace name context forever,level 1','10511 trace name context forever,level 2','10512 trace name context forever,level 1','10513 trace name context forever,level 2')*._corrupted_rollback_segments=(_SYSSMU3$)*._offline_rollback_segments=(_SYSSMU3$)再次尝试启动数据库:
SMON: enabling cache recoveryThu Apr 16 08:10:20 2015Errors in file d:\oracle\admin\kf2\udump\kf2_ora_5148.trc:ORA-00600: 内部错误代码,参数: [4000], [3], [], [], [], [], [], []
Thu Apr 16 08:10:20 2015Errors in file d:\oracle\admin\kf2\udump\kf2_ora_5148.trc:ORA-00704: 引导程序进程失败
ORA-00704: 引导程序进程失败
ORA-00600: 内部错误代码,参数: [4000], [3], [], [], [], [], [], []
确实还是无法避免bootstrap对象发生ORA-00600:[4000]错误。没办法,要想恢复这个数据库就必须要解决这个bootstrap对象。
先来看一下ORA-00600:[4000]内部错误的trace日志:
Dump file d:\oracle\admin\kf2\udump\kf2_ora_5148.trcThu Apr 16 08:10:08 2015ORACLE V9.2.0.8.0 - Production vsnsta=0vsnsql=12 vsnxtr=3Windows 2000 Version 5.2 Service Pack 2, CPU type 586Oracle9i Enterprise Edition Release 9.2.0.8.0 - ProductionWith the Partitioning, OLAP and Oracle Data Mining optionsJServer Release 9.2.0.8.0 - ProductionWindows 2000 Version 5.2 Service Pack 2, CPU type 586Instance name: kf2 Redo thread mounted by this instance: 1 Oracle process number: 12 Windows thread id: 5148, image: ORACLE.EXE *** SESSION ID:(9.3) 2015-04-16 08:10:08.515Start recovery at thread 1 ckpt scn 14081497668136 logseq 1 block 2*** 2015-04-16 08:10:10.265Media Recovery Log D:\ORACLE\ORA92\RDBMS\ARC00001.001*** 2015-04-16 08:10:10.281Media Recovery Log D:\ORACLE\ORA92\RDBMS\ARC00001.001*** 2015-04-16 08:10:18.015*** 2015-04-16 08:10:20.140ksedmp: internal or fatal errorORA-00600: 内部错误代码,参数: [4000], [3], [], [], [], [], [], []
Current SQL statement for this session:select ctime, mtime, stime from obj$ where obj# = :1引发错误的数据块头信息:
Block header dump: 0x0040007a Object id on Block? Y seg/obj: 0x12 csc: 0xcce.9aabd24f itc: 1 flg: - typ: 1 - DATA fsl: 0 fnx: 0x0 ver: 0x01 Itl Xid Uba Flag Lck Scn/Fsc0x01 0x0003.000.0001a69c 0x0080002d.33df.01 --U- 1 fsc 0x0000.9aabd250 data_block_dump,data header at 0x34270044通过上面的trace文件内容,我们知道:
1. 引发ORA-00600:[4000], [3]错误的语句是”select ctime, mtime, stime from obj$ where obj# = :1″。这是一条常用的递归SQL语句,查询的对象是非常重要的bootstrap自举对象OBJ$基表,这说明需要cleanup的块是OBJ$表上的。
2. 引发ORA-00600:[4000], [3]错误的数据块是1号数据文件的122块,seg/obj为0×12,块类型为Data,且存在有一条ITL entry:
SQL> select DBMS_UTILITY.data_block_address_file (TO_NUMBER ('0040007a','xxxxxxxx')) file_no, 2 DBMS_UTILITY.data_block_address_block (TO_NUMBER ('0040007a','xxxxxxxx')) block_no 3 from dual; FILE_NO BLOCK_NO---------- ---------- 1 122 SQL> Itl Xid Uba Flag Lck Scn/Fsc0x01 0x0003.000.0001a69c 0x0080002d.33df.01 --U- 1 fsc 0x0000.9aabd250因为该致命内部错误涉及到的对象是重要的Bootstrap表OBJ$,无法使用传统的例如_corrupted_rollback_segments,_offline_rollback_segments或10513事件来阻止ORA-00600: [4000]的发生,必须使用块修改工具BBED来修改存在问题的数据块将ITL事务槽的FLAG从U修改为C(Commit),手工提交该事务。
事务状态标识:
TRANSACTION_COMMITED = 0x08;
TRANSACTION_UPBOUND = 0x02;
TRANSACTION_ACTIVE = 0x01;Flag= -U- 即TRANSACTION_UPBOUND时,flag值为0×02,需要将该字节修改为TRANSACTION_COMMITED = 0×08;
下面用bbed修改system01.dbf文件。注意修改前一定要先备份。
E:\datafile>bbed filename=system01.dbf password=blockedit blocksize=8192 mode=edit BBED: Release 2.0.0.0.0 - Limited Production on 星期四 4月 16 09:58:02 2015
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. ************* !!! For Oracle Internal Use only !!! *************** BBED> map File: system01.dbf (0) Block: 1 Dba:0x00000000------------------------------------------------------------ Data File Header struct kcvfh, 360 bytes @0 ub4 tailchk @8188 BBED> set block 122 BLOCK# 122 BBED> map File: system01.dbf (0) Block: 122 Dba:0x00000000------------------------------------------------------------ KTB Data Block (Table/Cluster) struct kcbh, 20 bytes @0 struct ktbbh, 48 bytes @20 struct kdbh, 14 bytes @68 struct kdbt[1], 4 bytes @82 sb2 kdbr[108] @86 ub1 freespace[859] @302 ub1 rowdata[7027] @1161 ub4 tailchk @8188 BBED> p ktbbhstruct ktbbh, 48 bytes @20 ub1 ktbbhtyp @20 0x01 (KDDBTDATA) union ktbbhsid, 4 bytes @24 ub4 ktbbhsg1 @24 0x00000012 ub4 ktbbhod1 @24 0x00000012 struct ktbbhcsc, 8 bytes @28 ub4 kscnbas @28 0x9aabd24f ub2 kscnwrp @32 0x0cce b2 ktbbhict @36 1 ub1 ktbbhflg @38 0x02 (NONE) ub1 ktbbhfsl @39 0x00 ub4 ktbbhfnx @40 0x00000000 struct ktbbhitl[0], 24 bytes @44 struct ktbitxid, 8 bytes @44 ub2 kxidusn @44 0x0003 ub2 kxidslt @46 0x0000 ub4 kxidsqn @48 0x0001a69c struct ktbituba, 8 bytes @52 ub4 kubadba @52 0x0080002d ub2 kubaseq @56 0x33df ub1 kubarec @58 0x01 ub2 ktbitflg @60 0x2001 (KTBFUPB) union _ktbitun, 2 bytes @62 b2 _ktbitfsc @62 0 ub2 _ktbitwrp @62 0x0000 ub4 ktbitbas @64 0x9aabd250 BBED> set offset 61 OFFSET 61BBED> set count 16 COUNT 16BBED> d File: system01.dbf (0) Block: 122 Offsets: 61 to 76 Dba:0x00000000------------------------------------------------------------------------ 20000050 d2ab9a00 016c00ff ffea0045 <32 bytes per line> BBED> m x 0x80 File: system01.dbf (0) Block: 122 Offsets: 61 to 76 Dba:0x00000000------------------------------------------------------------------------ 80000050 d2ab9a00 016c00ff ffea0045 <32 bytes per line> BBED> sum applyCheck value for File 0, Block 122:current = 0xb0d6, required = 0xb0d6 BBED>再次尝试打开数据库,出现ORA-00600:[2256]错误:
Thu Apr 16 14:43:57 2015Errors in file d:\oracle\admin\kf2\udump\kf2_ora_5744.trc:ORA-00600: 内部错误代码,参数: [2256], [3278], [3221225472], [3278], [3221225539], [], [], []
Thu Apr 16 14:43:57 2015Errors in file d:\oracle\admin\kf2\udump\kf2_ora_5744.trc:ORA-00600: 内部错误代码,参数: [2256], [3278], [3221225472], [3278], [3221225539], [], [], []
Thu Apr 16 14:43:57 2015Error 600 happened during db open, shutting down databaseUSER: terminating instance due to error 600一般ORA-00600:[2256]错误可以通过10015 ADJUST_SCN事件来推进SCN解决。
MOS里面对ORA-00600:[2256]的各个参数解释如下:
Arg [a] Current SCN WRAPArg [b] Current SCN BASEArg [c] dependent SCN WRAPArg [d] dependent SCN BASE Arg [e] Where present this is the DBA where the dependent SCN came from.我们的日志中的各个参数值如下:
ORA-00600: 内部错误代码,参数: [2662], [3278], [2594951966], [3278], [2595857063], [4235250], [], []
根据level的计算规则算出需要推进的level大小:
Level = Arg[c] * 4 + Arg[d]/1024/1024/1024 = 13115
alter session set events '10015 trace name adjust_scn level 13115';再次open数据库,alert日志如下:
Thu Apr 16 13:47:49 2015alter database open resetlogsThu Apr 16 13:47:49 2015RESETLOGS is being done without consistancy checks. This may resultin a corrupted database. The database should be recreated.RESETLOGS after incomplete recovery UNTIL CHANGE 14081497748203Resetting resetlogs activation ID 1527068138 (0x5b0535ea)Online log 1 of thread 1 was previously clearedOnline log 2 of thread 1 was previously clearedThu Apr 16 13:47:51 2015Assigning activation ID 1527030262 (0x5b04a1f6)Thread 1 opened at log sequence 1 Current log# 3 seq# 1 mem# 0: E:\DATAFILE\REDO03.LOG Current log# 3 seq# 1 mem# 1: E:\DATAFILE\REDO06.LOGSuccessful open of redo thread 1Thu Apr 16 13:47:52 2015SMON: enabling cache recoveryThu Apr 16 13:47:52 2015Debugging event used to advance scn to 14082124021760Dictionary check beginningDictionary check completeThu Apr 16 13:47:55 2015SMON: enabling tx recoveryThu Apr 16 13:47:55 2015Database Characterset is ZHS16CGB231280Hex dump of Absolute File 1, Block 44180 in trace file d:\oracle\admin\kf2\udump\kf2_ora_5052.trc***Corrupt block relative dba: 0x0040ac94 (file 1, block 44180)Fractured block found during buffer readData in bad block - type: 6 format: 2 rdba: 0x0040ac94 last change scn: 0x0cce.9ab97c09 seq: 0x1 flg: 0x06 consistency value in tail: 0xae1f0601 check value in block header: 0xc9d7, computed block checksum: 0xc30f spare1: 0x0, spare2: 0x0, spare3: 0x0***Reread of rdba: 0x0040ac94 (file 1, block 44180) found same corrupted dataThu Apr 16 13:47:55 2015Errors in file d:\oracle\admin\kf2\udump\kf2_ora_5052.trc:ORA-01578: ORACLE 数据块损坏(文件号1,块号44180)
ORA-01110: 数据文件 1: 'E:\DATAFILE\SYSTEM01.DBF'
Error 1578 happened during db open, shutting down databaseUSER: terminating instance due to error 1578可以看到,这时的报错已经不一样了,这是因为system01.dbf文件中还有其他的坏块从而导致down库的。
用dbv检查system01.dbf文件:
E:\datafile>dbv file='E:\datafile\SYSTEM01.dbf' blocksize=8192 DBVERIFY: Release 9.2.0.8.0 - Production on 星期四 4月 16 13:59:11 2015
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. DBVERIFY - 验证正在开始 : FILE = E:\datafile\SYSTEM01.dbf
Block Checking: DBA = 4194426, Block Type = KTB-managed data blockdata header at 0x1162244kdbchk: row locked by non-existent transaction table=0 slot=46 lockid=1 ktbbhitc=1页 122 失败,校验代码为 6101
汇入的页43268 - 可能是介质损坏
***Corrupt block relative dba: 0x0040a904 (file 1, block 43268)Fractured block found during dbv:Data in bad block - type: 6 format: 2 rdba: 0x0040a904 last change scn: 0x0cce.9ab97baa seq: 0x1 flg: 0x06 consistency value in tail: 0xaf060601 check value in block header: 0x43af, computed block checksum: 0xd8a5 spare1: 0x0, spare2: 0x0, spare3: 0x0*** 汇入的页44052 - 可能是介质损坏
***Corrupt block relative dba: 0x0040ac14 (file 1, block 44052)Fractured block found during dbv:Data in bad block - type: 6 format: 2 rdba: 0x0040ac14 last change scn: 0x0cce.9aaba70d seq: 0x1 flg: 0x06 consistency value in tail: 0x748e0601 check value in block header: 0x7766, computed block checksum: 0x91c3 spare1: 0x0, spare2: 0x0, spare3: 0x0*** 汇入的页44180 - 可能是介质损坏
***Corrupt block relative dba: 0x0040ac94 (file 1, block 44180)Fractured block found during dbv:Data in bad block - type: 6 format: 2 rdba: 0x0040ac94 last change scn: 0x0cce.9ab97c09 seq: 0x1 flg: 0x06 consistency value in tail: 0xae1f0601 check value in block header: 0xc9d7, computed block checksum: 0xc30f spare1: 0x0, spare2: 0x0, spare3: 0x0*** 汇入的页47652 - 可能是介质损坏
***Corrupt block relative dba: 0x0040ba24 (file 1, block 47652)Fractured block found during dbv:Data in bad block - type: 6 format: 2 rdba: 0x0040ba24 last change scn: 0x0cce.9aabaf06 seq: 0x1 flg: 0x06 consistency value in tail: 0x7baa0601 check value in block header: 0xf2dc, computed block checksum: 0x9ec4 spare1: 0x0, spare2: 0x0, spare3: 0x0*** DBVERIFY - 验证完成
检查的页总数 :128000
处理的页总数(数据):31569
失败的页总数(数据):1
处理的页总数(索引):5268
失败的页总数(索引):0
处理的页总数(其它):1546
处理的总页数 (段) : 0
失败的总页数 (段) : 0
空的页总数 :89613
标记为损坏的总页数:4
汇入的页总数 :4
Highest block SCN : 14081498655836 (3278.2595859548) E:\datafile>检查出有4个坏块(43268,44180,44052,47652)。
再次用bbed工具修复坏块:
E:\datafile>bbed filename=system01.dbf password=blockedit blocksize=8192 mode=edit BBED: Release 2.0.0.0.0 - Limited Production on 星期四 4月 16 15:03:53 2015
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. ************* !!! For Oracle Internal Use only !!! *************** BBED> set block 43268 BLOCK# 43268 BBED> p kcbhstruct kcbh, 20 bytes @0 ub1 type_kcbh @0 0x06 ub1 frmt_kcbh @1 0x02 ub1 spare1_kcbh @2 0x00 ub1 spare2_kcbh @3 0x00 ub4 rdba_kcbh @4 0x0040a904 ub4 bas_kcbh @8 0x9ab97baa ub2 wrp_kcbh @12 0x0cce ub1 seq_kcbh @14 0x01 ub1 flg_kcbh @15 0x06 (KCBHFDLC, KCBHFCKV) ub2 chkval_kcbh @16 0x43af ub2 spare3_kcbh @18 0x0000 BBED> p tailchkub4 tailchk @8188 0xaf060601正常情况下tail check=scn base+block type+scn seq,因此taikchk的值应该是7baa + 06 + 01,即0×7baa0601。而我们的当前值是0xaf060601。
根据little-endian的规则,0×7baa0601应该写成:01 06 aa 7b
BBED> m /x 0106aa7b offset 8188 File: system01.dbf (0) Block: 43268 Offsets: 8188 to 8191 Dba:0x00000000------------------------------------------------------------------------ 0106aa7b <32 bytes per line> BBED> sum applyCheck value for File 0, Block 43268:current = 0x4fa6, required = 0x4fa6 BBED> verifyDBVERIFY - 验证正在启动
FILE =system01.dbfBLOCK = 43268 DBVERIFY - 验证完成
检查的总块数:1
已处理的总块数(数据):1
无法处理的总块数(数据):0
已处理的总块数(索引):0
无法处理的总块数(索引):0
空的总块数:0
标记为损坏的总数块:0
汇入的块总数:0
验证后,43268的块已经修复。用同样的方法将另外3个块修复。
再次打开数据库:
SQL> alter database open resetlogs;数据库已更改。SQL>终于启动起来了。
不过还需要重建undo表空间,并立马将数据exp导出备份好。事实上,数据字典还是有所损坏,部分数据查询时还有错误,在此不再赘述。




