一、问题现象
测试环境遇到的问题比较有意思记录一下。
测试环境虚拟机异常测试主机崩了,重启后db启动报错;
[root@t2 ~]# su - oracle
Last login: Sat May 7 02:07:05 PDT 2022
[oracle@t2 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Sat May 7 02:07:54 2022
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> select status from v$instance;
STATUS
------------
MOUNTED
SQL>
SQL> archive log list
Database log mode Archive Mode
Automatic archival Enabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 20
Next log sequence to archive 22
Current log sequence 22
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1],
[22], [47672], [47759], [], [], [], [], [], [], []
二、问题分析
2.1查询Db alert
--db alert
Sat May 07 02:09:00 2022
alter database open
Beginning crash recovery of 1 threads
Started redo scan
Completed redo scan
read 146 KB redo, 38 data blocks need recovery
Errors in file /u01/app/oracle/diag/rdbms/t2/t2/trace/t2_ora_18806.trc
(incident=68556):
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [22],
[47672], [47759], [], [], [], [], [], [], []
Incident details in:
/u01/app/oracle/diag/rdbms/t2/t2/incident/incdir_68556/t2_ora_18806_i68556.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Aborting crash recovery due to error 600
Errors in file /u01/app/oracle/diag/rdbms/t2/t2/trace/t2_ora_18806.trc:
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [22],
[47672], [47759], [], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/t2/t2/trace/t2_ora_18806.trc:
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [22],
[47672], [47759], [], [], [], [], [], [], []
ORA-600 signalled during: alter database open...
Sat May 07 02:09:00 2022
Dumping diagnostic data in directory=[cdmp_20220507020900], requested by (instance=1,
osid=18806), summary=[incident=68556].
Sat May 07 02:09:02 2022
Sweep [inc][68556]: completed
Sweep [inc2][68556]: completed
Trace日志
[oracle@t2 trace]$ more /u01/app/oracle/diag/rdbms/t2/t2/trace/t2_ora_18806.trc
Trace file /u01/app/oracle/diag/rdbms/t2/t2/trace/t2_ora_18806.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0/dbhome_1
System name: Linux
Node name: t2
Release: 3.10.0-957.el7.x86_64
Version: #1 SMP Thu Oct 4 20:48:51 UTC 2018
Machine: x86_64
VM name: VMWare Version: 6
Instance name: t2
Redo thread mounted by this instance: 1
Oracle process number: 19
Unix process pid: 18806, image: oracle@t2 (TNS V1-V3)
*** 2022-05-07 02:09:00.080
*** SESSION ID:(1.9) 2022-05-07 02:09:00.080
*** CLIENT ID:() 2022-05-07 02:09:00.080
*** SERVICE NAME:() 2022-05-07 02:09:00.080
*** MODULE NAME:(sqlplus@t2 (TNS V1-V3)) 2022-05-07 02:09:00.080
*** ACTION NAME:() 2022-05-07 02:09:00.080
==**Thread 1 checkpoint: logseq 22, block 2, scn 14115731
cache-low rba: logseq 22, block 47379
on-disk rba: logseq 22, block 47759, scn 14134141
start recovery at logseq 22, block 47379, scn 0**==
*** 2022-05-07 02:09:00.084
Started writing zeroblks thread 1 seq 22 blocks 47672-47679
*** 2022-05-07 02:09:00.084
Completed writing zeroblks thread 1 seq 22
==== Redo read statistics for thread 1 ====
Total physical reads (from disk and memory): 4096Kb
-- Redo read_disk statistics --
Read rate (ASYNC): 146Kb in 0.01s => 14.31 Mb/sec
Longest record: 1Kb, moves: 0/423 (0%)
Change moves: 28/351 (7%), moved: 0Mb
Longest LWN: 20Kb, moves: 0/198 (0%), moved: 0Mb
Last redo scn: 0x0000.00d7ab23 (14134051)
----------------------------------------------
----- Recovery Hash Table Statistics ---------
Hash table buckets = 262144
Longest hash chain = 1
Average hash chain = 38/38 = 1.0
Max compares per lookup = 1
Avg compares per lookup = 89/666 = 0.1
----------------------------------------------
WARNING! Crash recovery of thread 1 seq 22 is
ending at redo block 47672 but should not have ended before
redo block 47759
Incident 68556 created, dump file:
/u01/app/oracle/diag/rdbms/t2/t2/incident/incdir_68556/t2_ora_18806_i68556.trc
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [22],
[47672], [47759], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [22],
[47672], [47759], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [22],
[47672], [47759], [], [], [], [], [], [], []
2.2 匹配MOS文章
SQL> select file#,CHECKPOINT_CHANGE# from v$datafile_header order by 1;
FILE# CHECKPOINT_CHANGE#
---------- ------------------
1 14115731
2 14115731
3 14115731
4 14115731
5 14115731
6 14115731
7 14115731
7 rows selected.
SQL> select CHECKPOINT_CHANGE#,CONTROLFILE_SEQUENCE#,CONTROLFILE_CHANGE# from v$database;
CHECKPOINT_CHANGE# CONTROLFILE_SEQUENCE# CONTROLFILE_CHANGE#
------------------ --------------------- -------------------
14115731 1100600118 14134088
SQL> select * from v$log;
GROUP# THREAD# SEQUENCE# ARC STATUS FIRST_CHANGE# NEXT_CHANGE#
---------- ---------- ---------- ---------- ---------- ---------- --- -
1 1 22 NO CURRENT 14115731 281474976710655
3 1 21 YES INACTIVE 14115663 14115731
2 1 20 YES INACTIVE 14075524 14115663
----- < Stack Trace -----
kgdsdst() < ksedst1()+103 < ksedst()+39 <
dbkedDefDump()+2746 < ksedmp()+41 < ksfdmp()+69 <
dbgexPhaseII()+1764 < dbgexProcessError() < dbgeExecuteForError <
dbgePostErrorKGE()+ < dbkePostKGE_kgsf()+ < kgeadse()+390 <
kgerinv_internal()+ < kgerinv()+38 < kgeasnmierr()+148 <
kcratr_odr_check()+ < kcratr()+1202 < kctrec()+4518 <
kcvcrv()+5715 < kcfopd()+1060 < adbdrv()+52169 <
opiexe()+18724 < opiosq0()+3932 <
kpooprx()+274 < kpoal8()+842 < opiodr()+917 <
ttcpip()+2183 < opitsk()+1710 < opiino()+969 <
opiodr()+917 < opidrv()+570 < sou2o()+103 <
opimai_real()+133 < ssthrdmain()+265 < main()+201 <
__libc_start_main < _start
Alter database open fails with ORA-00600 kcratr_nab_less_than_odr (Doc ID 1296264.1)
Oracle Database - Enterprise Edition - Version 11.2.0.1 to 11.2.0.1 [Release 11.2]
There was a power failure causing logical corruption in controlfile
This Problem is caused by Storage Problem of the Database Files.
The Subsystem (eg. SAN) crashed while the Database was open.
The Database then crashed because the Database Files were not accessible anymore.
This caused a lost Write into the Online RedoLogs and/or causing logical corruption
in controlfile so Instance Recovery is not possible and raising the ORA-600.
----- Call Stack Trace -----
ksedst1 <- ksedst <- dbkedDefDump <- ksedmp <- dbgexPhaseII <- dbgexProcessError <-
dbgePostErrorKGE <- kgeasnmierr <- kcratr_odr_check <- kcratr <- kctrec <- kcvcrv <-
kcfopd <- adbdrv <- opiexe <- opiosq0 <- kpoal8 <- opiodr <- ttcpip <- opitsk <-
opiino <- opiodr <- opidrv <- sou2o <- opimai_real <- ssthrdmain <- main <- start
我们看到ORA-600的第一个参数是一致的并且mos提供的函数堆栈也大部分一致基本上非常匹配了
2.3 问题处理
Option a: Do cancel based reocvery, and apply 'current online redolog' manually
-----------
SQL>Startup mount ;
SQL>Show parameter control_files
Query 1
------------
SQL> select a.member, a.group#, b.status from v$logfile a ,v$log b where
a.group#=b.group# and b.status='CURRENT' ;
Note down the name of the redo log
MEMBER GROUP# STATUS
------------------------------------------------- ----------------
/u01/app/oracle/oradata/t2dir/redo01.log 1 CURRENT
SQL> Shutdown abort ;
Take a OS Level backup of the controlfile
(This is to ensure we have a backup of current state of controlfile)
SQL> host cp /u01/app/oracle/oradata/t2/control01.ctl
/u01/app/oracle/oradata/t2/control_20220507.bak
SQL> Startup mount ;
SQL> select CHECKPOINT_CHANGE#,CONTROLFILE_SEQUENCE#,CONTROLFILE_CHANGE#
from v$database;
CHECKPOINT_CHANGE# CONTROLFILE_SEQUENCE# CONTROLFILE_CHANGE#
------------------ --------------------- -------------------
14115731 1100600123 14134088
SQL> select file#,CHECKPOINT_CHANGE# from v$datafile_header order by 1;
FILE# CHECKPOINT_CHANGE#
---------- ------------------
1 14115731
2 14115731
3 14115731
4 14115731
5 14115731
6 14115731
7 14115731
7 rows selected.
SQL> recover database using backup controlfile until cancel ;
Enter location of redo log shown as current in Query 1 when prompted for recovery
Hit Enter
SQL> recover database using backup controlfile until cancel ;
ORA-00279: change 14115731 generated at 05/06/2022 02:15:01 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/fast_recovery_area/T2/archivelog/2022_05_07/o1_mf_1_22_%u_.arc
ORA-00280: change 14115731 for thread 1 is in sequence #22
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/u01/app/oracle/oradata/t2dir/redo01.log
Log applied.
Media recovery complete.
--Dbalert
ORA-279 signalled during: ALTER DATABASE RECOVER database
using backup controlfile until cancel ...
Sat May 07 02:40:32 2022
ALTER DATABASE RECOVER LOGFILE '/u01/app/oracle/oradata/t2dir/redo01.log'
Media Recovery Log /u01/app/oracle/oradata/t2dir/redo01.log
Sat May 07 02:40:48 2022
Incomplete recovery applied all redo ever generated.
Recovery completed through change 14134088 time 05/06/2022 06:09:07
Media Recovery Complete (t2)
Completed: ALTER DATABASE RECOVER LOGFILE '/u01/app/oracle/oradata/t2dir/redo01.log'
--检查点信息无变化!
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01589: must use RESETLOGS or NORESETLOGS option for database open
SQL> Alter database open resetlogs ;
SQL> select file#,CHECKPOINT_CHANGE# from v$datafile_header order by 1;
FILE# CHECKPOINT_CHANGE#
---------- ------------------
1 14154717
2 14154717
3 14154717
4 14154717
5 14154717
6 14154717
7 14154717
7 rows selected.
SQL> select file#,CHECKPOINT_CHANGE# from v$datafile order by 1;
FILE# CHECKPOINT_CHANGE#
---------- ------------------
1 14154717
2 14154717
3 14154717
4 14154717
5 14154717
6 14154717
7 14154717
7 rows selected.
上述是使用current log recover后,open resetlogs的方式对redo进行clear
2.4 第二种方法,重建控制文件
因此回头看Oracle MOS的文章
Alter database open fails with ORA-00600 kcratr_nab_less_than_odr (Doc ID 1296264.1)
第二种方法,使用的是备份控制文件to trace,随后以Noresetlogs的方式open db.
$ rman target /
rman> spool log to '/tmp/rman.log';
rman> list backup ;
rman> exit
SQL> Show parameter control_files
Keep this location handy.
SQL> oradebug setmypid
SQL> Alter session set tracefile_identifier='controlfilerecreate' ;
SQL> Alter database backup controlfile to trace noresetlogs;
SQL> oradebug tracefile_name ;
--> This command will give the path and name of the trace file
Go to this location ,Open this trace file and select the controlfile
recreation script with NOResetlogs option
SQL> Shutdown immediate;
Rename the existing controlfile to <originalname>_old
---> This is Important as we need to have a backup of existing controlfile
since we plan to recreate it
SQL> Startup nomount
Now run the Controlfile recreation script with NO Resetlogs option.
SQL> Alter database open ;
For database version 10g and above
Once database is opened you can recatalog the rman backup information present
in the list /tmp/rman.log using
Rman> Catalog start with '<location of backupiece>' ;
If backups are on tape, and you are not using a catalog, backups
can be cataloged using information in:
How to Catalog Tape Backup Pieces (Doc ID 550082.1)
https://blog.csdn.net/hnlivingyang/article/details/84779039
2.5 相关链接及案例

https://blog.csdn.net/h082602/article/details/74910701
这个操作是重建控制文件随后进行recover database,open noresetlogs的方式!
根据之前的控制文件的测试,使用rman备份恢复后的控制文件默认是使用resetlogs的方式!
因此重建控制文件一般是to trace,随后手工执行控制文件创建脚本进行重建!
https://www.modb.pro/doc/35112
三、Cache-Low RBA 与 On-Disk RBA 的学习
比较凑巧,学习看书看到了01盖总的案例,和本文的一样,并且当时盖总是没有Mos文章的情况下按照个人理解解决了问题,膜拜大佬…emm是2010年的文章,学习一下也加深一下理解
1.7 Cache-Low RBA 与 On-Disk RBA 的恢复证明
下载地址-免费~https://www.modb.pro/download/26242.2.2.
那么追加的小测试主要是制造一定的脏块,我们都知道Oracle的数据写和日志写不是同时写入的,Oracle奉行日志优先所有的数据变化日志先写,而对于数据cache的修改是允许异步慢慢写入磁盘,因此就存在单数据库down的时候,启动数据库需要恢复一定的数据!
那么本次创建一个测试表,随后dump控制文件to trace日志,我们看看控制文件中记录的信息!随后数据库abort,在start肯定会存在一定的恢复! 因为memory和disk的数据不是完全一致! 然后观察恢复的scn的信息!!!
3.1 控制文件dump
[oracle@t3 ~]$ sqlplus yz/yz
drop table yz;
create table yz as select * from dba_objects;
[oracle@t3 ~]$ sqlplus / as sysdba
SQL> oradebug setmypid;
Statement processed.
SQL> oradebug dump controlf 3;
Statement processed.
SQL> oradebug tracefile_name;
/u01/app/oracle/diag/rdbms/dgtest001/dgtest001/trace/dgtest001_ora_19476.trc
SQL> shutdown abort;
SQL> startup
ORACLE instance started.
Total System Global Area 730714112 bytes
Fixed Size 2256832 bytes
Variable Size 452984896 bytes
Database Buffers 272629760 bytes
Redo Buffers 2842624 bytes
Database mounted.
Database opened.
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
PL/SQL Release 11.2.0.4.0 - Production
CORE 11.2.0.4.0 Production
TNS for Linux: Version 11.2.0.4.0 - Production
NLSRTL Version 11.2.0.4.0 - Production
https://www.askmac.cn/archives
/%e8%bd%ac%e5%82%a8%e6%8e%a7%e5%88%b6%e6%96%87%e4%bb%b6%e4%bf%a1%e6%81%af.html
3.2数据库检查点
***************************************************************************
DATABASE ENTRY
***************************************************************************
(size = 316, compat size = 316, section max = 1, section in-use = 1,
last-recid= 0, old-recno = 0, last-recno = 0)
(extent = 1, blkno = 1, numrecs = 1)
03/27/2020 08:30:47
DB Name "T2"
Database flags = 0x50404001 0x00001200
Controlfile Creation Timestamp 03/27/2020 08:30:47
Incmplt recovery scn: 0x0000.00000000
Resetlogs scn: 0x0000.000e2006 Resetlogs Timestamp 03/27/2020 08:30:49
Prior resetlogs scn: 0x0000.00000001 Prior resetlogs Timestamp 08/24/2013 11:37:30
Redo Version: compatible=0xb200400
#Data files = 4, #Online files = 4
Database checkpoint: Thread=1 scn: 0x0000.018506b4 --10进制SCN 25495220
Threads: #Enabled=1, #Open=1, Head=1, Tail=1
enabled threads: 01000000 00000000 00000000 00000000 00000000 00000000
进制转换
https://www.sojson.com/hexconvert/16to10.html
3.3cache rba and disk rba
***************************************************************************
CHECKPOINT PROGRESS RECORDS
***************************************************************************
(size = 8180, compat size = 8180, section max = 11, section in-use = 0,
last-recid= 0, old-recno = 0, last-recno = 0)
(extent = 1, blkno = 2, numrecs = 11)
THREAD #1 - status:0x2 flags:0x0 dirty:237
low cache rba:(0x6d1.3.0) on disk rba:(0x6d2.6097.0)
on disk scn: 0x0000.018516bf 05/10/2022 03:32:23 --scn 25499327
resetlogs scn: 0x0000.000e2006 03/27/2020 08:30:49
heartbeat: 1104313927 mount id: 2710581651
把这里的 RBA 信息简单分析一下:
RBA 信息 Log Sequence Blcok Number
Low Cache RBA 0x6d1.3.0 0x6d1 =1745 3=3
On Disk RBA 0x6d2.6097.0 0x6d2 =1746 6097=24727
3.4 数据库恢复trace日志
书中说Oracle start恢复有trace日志,但是日志太多了,测试环境通过
cat *ora*.trc|grep -50 rba
快速定位类似的trace日志
[oracle@t3 trace]$ cat /u01/app/oracle/diag/rdbms/dgtest001/dgtest001/
trace/dgtest001_ora_19578.trc Trace file
/u01/app/oracle/diag/rdbms/dgtest001/dgtest001/trace/dgtest001_ora_19578.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0/dbhome_1
System name: Linux
Node name: t3
Release: 3.10.0-957.el7.x86_64
Version: #1 SMP Thu Oct 4 20:48:51 UTC 2018
Machine: x86_64
VM name: VMWare Version: 6
Instance name: dgtest001
Redo thread mounted by this instance: 1
Oracle process number: 19
Unix process pid: 19578, image: oracle@t3 (TNS V1-V3)
*** 2022-05-10 03:33:06.391
*** SESSION ID:(1.5) 2022-05-10 03:33:06.391
*** CLIENT ID:() 2022-05-10 03:33:06.391
*** SERVICE NAME:() 2022-05-10 03:33:06.391
*** MODULE NAME:(sqlplus@t3 (TNS V1-V3)) 2022-05-10 03:33:06.391
*** ACTION NAME:() 2022-05-10 03:33:06.391
Thread 1 checkpoint: logseq 1745, block 2, scn 25495220
cache-low rba: logseq 1745, block 3
on-disk rba: logseq 1746, block 24738, scn 25499338
start recovery at logseq 1745, block 3, scn 0
*** 2022-05-10 03:33:06.401
Started writing zeroblks thread 1 seq 1746 blocks 24739-24746
*** 2022-05-10 03:33:06.401
Completed writing zeroblks thread 1 seq 1746
==== Redo read statistics for thread 1 ====
Total physical reads (from disk and memory): 16442Kb
-- Redo read_disk statistics --
Read rate (ASYNC): 12426Kb in 0.01s => 1213.53 Mb/sec
Longest record: 8Kb, moves: 0/8629 (0%)
Change moves: 0/160 (0%), moved: 0Mb
Longest LWN: 1053Kb, moves: 3/989 (0%), moved: 2Mb
Last redo scn: 0x0000.018516ca (25499338)
----------------------------------------------
----- Recovery Hash Table Statistics ---------
Hash table buckets = 262144
Longest hash chain = 1
Average hash chain = 237/237 = 1.0
Max compares per lookup = 1
Avg compares per lookup = 11066/11463 = 1.0
----------------------------------------------
*** 2022-05-10 03:33:06.405
KCRA: start recovery claims for 237 data blocks
*** 2022-05-10 03:33:06.589
KCRA: blocks processed = 237/237, claimed = 237, eliminated = 0
*** 2022-05-10 03:33:06.590
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1745 Reading mem 0
*** 2022-05-10 03:33:06.591
Recovery of Online Redo Log: Thread 1 Group 2 Seq 1746 Reading mem 0
*** 2022-05-10 03:33:06.625
Completed redo application of 1.33MB
*** 2022-05-10 03:33:06.760
Completed recovery checkpoint
----- Recovery Hash Table Statistics ---------
Hash table buckets = 262144
Longest hash chain = 1
Average hash chain = 237/237 = 1.0
Max compares per lookup = 1
Avg compares per lookup = 11143/11303 = 1.0
----------------------------------------------
Recovery sets nab of thread 1 seq 1746 to 24739 with 8 zeroblks
kwqmnich: current time:: 10: 33: 6: 0
kwqmnich: instance no 0 repartition flag 1
kwqmnich: initialized job cache structure
3.5 日志文件检查点信息
select * from v$log;
GROUP# SEQUENCE# STATUS FIRST_CHANGE# NEXT_CHANGE#
---------- ---------- ---------------------- ------------
1 1748 CURRENT 25519844 2.8147E+14
2 1746 INACTIVE 25495647 25519341
3 1747 INACTIVE 25519341 25519844
SQL> select THREAD#,SEQUENCE#,FIRST_CHANGE#,NEXT_CHANGE# from
v$archived_log where sequence#=1745;
THREAD# SEQUENCE# FIRST_CHANGE# NEXT_CHANGE#
---------- ---------- ------------- ------------
1 1745 25495219 25495647
我们来完整梳理一下这个流程:
1.SCN 25495220 数据库一致性检查点!
2. 数据库abort之前控制文件打印trace观察检查点
RBA 信息 Log Sequence Blcok Number
Low Cache RBA 0x6d1.3.0 0x6d1 =1745 3=3
On Disk RBA 0x6d2.6097.0 0x6d2 =1746 6097=24727
因为oracle间隔一定时间自动会触发检查点,因此部分不一致scn往前走是正常现象;
3.随后abort
4.start启动数据库观察恢复的trace
Thread 1 checkpoint: logseq 1745, block 2, scn 25495220
cache-low rba: logseq 1745, block 3
on-disk rba: logseq 1746, block 24738, scn 25499338
start recovery at logseq 1745, block 3, scn 0
也就是说Oracle恢复的scn是数据库一致性检查点scn 25495220 to on-disk rba 25499338,那么这个最大的scn是哪里来的呢? 其实就是current logfile,因此数据库启动就需要使用redo进行恢复,而cache-low的部分的blocks的数据就是还没有写入db file部分的数据了,脏块的内存块需要使用redo进行恢复使用。
那么如果存储问题导致写入的Blocks异常,但是控制文件又记录了这个block scn需要进行恢复,那么恢复就会出现600报错提示恢复到某个时间点异常无法完整的恢复!




