暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

【问题诊断】机房断电后Oracle数据库启动异常ORA-00600[kcratr_nab_less_than_odr]案例分享

618
本期将为大家分享”机房断电后Oracle数据库启动异常出现ORA-00600 错误”的案例。
关键字:机房断电、数据库无法启动、ORA-00600[kcratr_nab_less_than_odr]、ORA-01122、ORA-01110、ORA-01207、ORA-00338、ORA-00312、error 338

机房有时会因供电中断造成各种事故,服务器、存储设备和网络交换机等设备会直接断电关机。突然断电可能会导致数据库无法打开、数据文件坏块、控制文件损坏、日志文件损坏等等。下面将介绍下之前遇到的一个真实处理案例。数据库启动过程出现ORA-00600[kcratr_nab_less_than_odr],不能open数据库。错误信息如下所示。

1、接到故障反馈后,先了解故障情况和环境信息。数据库是11.2.0.1版本,未开启归档模式、数据库无备份、数据库未打补丁,Windows 7操作系统,4g低配置内存且过保的物理机。
    查看版本:opatch lsinventory
    查看日志状态:set linesize 1000
    col member for a70
    select * from v$log;
    select * from v$logfile;
    select a.member, a.group#, b.status
    from v$logfile a ,v$log b where a.group#=b.group# and b.status='CURRENT' ;

    2、查看当前在线重做日志信息。当前日志文件序号为2750。查看数据文件scn与控制文件scn号是一致的,值为68953003。

      select file#,to_char(checkpoint_change#),to_char(CHECKPOINT_TIME,'yyyy-mm-dd hh24:mi:ss') 
      from v$datafile;
      select file#,to_char(checkpoint_change#),to_char(CHECKPOINT_TIME,'yyyy-mm-dd hh24:mi:ss')
      from v$datafile_header;

      3、查看数据库alert日志,开库进行数据库一致性校验还原时,出现ORA-00600[kcratr_nab_less_than_odr]错误。

        alter database open
        Beginning crash recovery of 1 threads
        parallel recovery started with 3 processes
        Started redo scan
        Completed redo scan
        read 958 KB redo, 401 data blocks need recovery
        Errors in file d:\app\administrator\diag\rdbms\orcl\orcl\trace\orcl_ora_5212.trc (incident=128562):
        ORA-00600: ??????, ??: [kcratr_nab_less_than_odr], [1], [2750], [46011], [46073], [], [], [], [], [], [], []
        Incident details in: d:\app\administrator\diag\rdbms\orcl\orcl\incident\incdir_128562\orcl_ora_5212_i128562.trc
        Aborting crash recovery due to error 600
        Errors in file d:\app\administrator\diag\rdbms\orcl\orcl\trace\orcl_ora_5212.trc:
        ORA-00600: ??????, ??: [kcratr_nab_less_than_odr], [1], [2750], [46011], [46073], [], [], [], [], [], [], []
        Errors in file d:\app\administrator\diag\rdbms\orcl\orcl\trace\orcl_ora_5212.trc:
        ORA-00600: ??????, ??: [kcratr_nab_less_than_odr], [1], [2750], [46011], [46073], [], [], [], [], [], [], []
        ORA-600 signalled during: alter database open...
        Trace dumping is performing id=[cdmp_20221102185258]
        Wed Nov 02 18:53:55 2022
        Sweep [inc][128562]: completed
        Sweep [inc2][128562]: completed
        Wed Nov 02 19:04:31 2022
        Shutting down instance (immediate)
        Shutting down instance: further logons disabled
        Stopping background process MMNL
        Stopping background process MMON

        4、进一步查看Incident details的跟踪文件。通过alert和trace中的信息可以知道,数据库恢复块数据是从日志序号2750的块号为44095开始到46073,但是实际只能恢复rba到46011,从而导致数据库无法正常open。

        ending at redo block 46011 but should not have ended before redo block 46073

          Successfully allocated 3 recovery slaves
          Using 45 overflow buffers per recovery slave
          Thread 1 checkpoint: logseq 2750, block 2, scn 68953003
          cache-low rba: logseq 2750, block 44095
          on-disk rba: logseq 2750, block 46073, scn 68970890
          start recovery at logseq 2750, block 44095, scn 0
          。。。。WARNING! Crash recovery of thread 1 seq 2750 is
          ending at redo block 46011 but should not have ended before
          redo block 46073
          Incident 128562 created, dump file: d:\app\administrator\diag\rdbms\orcl\orcl\incident\incdir_128562\orcl_ora_5212_i128562.trc
          ORA-00600: ??????, ??: [kcratr_nab_less_than_odr], [1], [2750], [46011], [46073], [], [], [], [], [], [], []


          ORA-00600: ??????, ??: [kcratr_nab_less_than_odr], [1], [2750], [46011], [46073], [], [], [], [], [], [], []
          ORA-00600: ??????, ??: [kcratr_nab_less_than_odr], [1], [2750], [46011], [46073], [], [], [], [], [], [], []

          1、由于服务器异常断电,内存中的缓存数据未及时写联机日志文件,下次启动数据库时,需要做实例级恢复并校验数据块的一致性。根据ORA-600参数值,查阅MOS官方文档,找到对应的解决方案。

          2、根据官方文档,查阅跟踪文件,调用堆栈记录与1296264.1文档里描述一致。

            ========= Dump for incident 128562 (ORA 600 [kcratr_nab_less_than_odr]) ========


            *** 2022-11-02 18:52:57.721
            dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
            ----- Current SQL Statement for this session (sql_id=a01hp0psv0rrh) -----
            alter database open


            ----- Call Stack Trace -----
            ksedst1 <- ksedst <- dbkedDefDump <- ksedmp <- dbgexPhaseII <- dbgexProcessError
            <- dbgePostErrorKGE <- kgeasnmierr <- kcratr_odr_check <- kcratr <- kctrec
            <- kcvcrv <- kcfopd <- adbdrv <- opiexe <- opiosq0 <- kpoal8 <- opiodr <- ttcpip
            <- opitsk <- opiino <- opiodr <- opidrv <- sou2o <- opimai_real <- ssthrdmain
            <- main
            <- start

            3、根据官方知识库可以确认原因:由于断电导致控制文件出现逻辑损坏。

              There was a power failure causing logical corruption in controlfile
              This Problem is caused by Storage Problem of the Database Files.
              The Subsystem (eg. SAN) crashed while the Database was open.
              The Database then crashed because the Database Files were not accessible anymore.
              This caused a lost Write into the Online RedoLogs and/or causing logical corruption
              in controlfile so Instance Recovery is not possible and raising the ORA-600.

              4、官方提出两种解决方法:

              (1)通过恢复取消命令,手动输入当前状态重做日志进行恢复。由于数据库大小不到10G,我直接将数据关闭,并进行系统级冷备。如果数据库的数据文件较大,可考虑对控制文件及重做文件及system.dbf文件进行备份。

              启动至mount状态:startup mount ;

              执行恢复命令:recover database using backup controlfile until cancel ;

              从下面截图来看,数据库自动寻找归档日志,但是归档日志实际上是没打开的。所以找不到归档文件。

              (2)采用oradebug命令生成控制文件创建脚本。执行控制文件重建及开库操作后,数据库最终恢复到46011数据块,并正常打开数据库。

                ALTER DATABASE RECOVER  DATABASE  
                Media Recovery Start
                started logmerger process
                Parallel Media Recovery started with 4 slaves
                Wed Nov 02 19:21:52 2022
                Recovery of Online Redo Log: Thread 1 Group 2 Seq 2750 Reading mem 0
                Mem# 0: D:\APP\ADMINISTRATOR\ORADATA\ORCL\REDO02.LOG
                Completed: ALTER DATABASE RECOVER DATABASE
                Wed Nov 02 19:22:02 2022
                Archived Log entry 1 added for thread 1 sequence 2748 ID 0x5df22786 dest 1:
                Archived Log entry 2 added for thread 1 sequence 2749 ID 0x5df22786 dest 1:
                ALTER DATABASE OPEN
                Beginning crash recovery of 1 threads
                parallel recovery started with 3 processes
                Started redo scan
                Completed redo scan
                read 23004 KB redo, 0 data blocks need recovery
                Started redo application at
                Thread 1: logseq 2750, block 2, scn 68953003
                Recovery of Online Redo Log: Thread 1 Group 2 Seq 2750 Reading mem 0
                Mem# 0: D:\APP\ADMINISTRATOR\ORADATA\ORCL\REDO02.LOG
                Completed redo application of 0.00MB
                Completed crash recovery at
                Thread 1: logseq 2750, block 46011, scn 68990864---->最终恢复点。0 data blocks read, 0 data blocks written, 23004 redo k-bytes read
                Wed Nov 02 19:22:05 2022
                LGWR: STARTING ARCH PROCESSES
                Completed crash recovery at  Thread 1: logseq 2750, block 46011, scn 68990864---->最终恢复点。

                5、建议重建控制文件后,备份文件信息要注册回控制文件。

                  命令:RMAN> CATALOG START WITH '/home/oracle/ywzd';

                  6、最后对数据库进行一次RMAN全备。

                  故障场景一:客户机房非正常断电,待恢复供电后,数据库无法正常启动,alert告警日志出现ORA-00338 ORA-00312错误,即线程1的日志 2比控制文件更新。

                    Wed Mar 20 10:24:03 2024
                    ALTER DATABASE MOUNT
                    Successful mount of redo thread 1, with mount id 387518083
                    Database mounted in Exclusive Mode
                    Lost write protection disabled
                    Completed: ALIER DATABASE MOUNT
                    Wed Mar 20 10:24:07 2024
                    ALTER DATABASE OPEN;
                    Errors in file opt/oracle/diag/rdbms/ywzd/ywzd/trace/ywzd_ora_30995.tra:
                    ORA-00338: log 2 of thread 1 is more recent than control file
                    ORA-00312: online log 2 thread 1: 'redo.log'
                    USER(ospid: 30995):terminating the inatance due to error 338
                    Instance terminated by USER,pid=30995


                    The ORA-00338 normally indicates an incorrect control file may be used:


                    00338, 00000, "log %s of thread %s is more recent than control file"
                    // *Cause: The control file change sequence number in the log file is
                    // greater than the number in the control file. This implies that
                    // the wrong control file is being used. Note that repeatedly causing
                    // this error can make it stop happening without correcting the real
                    // problem. Every attempt to open the database will advance the
                    // control file change sequence number until it is great enough.
                    // *Action: Use the current control file or do backup control file recovery to
                    // make the control file current. Be sure to follow all restrictions
                    // on doing a backup control file recovery.

                    故障场景二:数据库无法正常启动,alert告警日志出现ORA-01207错误,即数据文件1比控制文件scn时间更新。

                      SYS> alter database open;
                      alter database open
                      *
                      ERROR at line 1:
                      ORA-01122: database file 1 failed verification check
                      ORA-01110: data file 1: '/u01/app/oracle/oradata/ywzd/system.dbf'
                      ORA-01207: file is more recent than control file - old control file

                      技术细节一:在恢复过程中,Cache-Low RBA和On-Disk RBA主导了恢复过程,Oracle的恢复从上一次成功的写出开始,也就是以Cache-Low RBA为起点,恢复至日志的最后成功记录,也就是以On-Disk RBA为终点。如果想了解具体的原理细节,可以学习下盖老师的两篇文章。

                      https://www.eygle.com/archives/2010/02/cache_low_rba_on_disk_rba.html
                      https://www.eygle.com/archives/2011/02/cache_low_rba.html

                      技术细节二:Oracle 的控制文件记录了当前数据库的结构信息,包含数据文件及日志文件的信息以及相关的状态、归档信息等。控制文件是一个二进制文件,一个控制文件只属于一个数据库。当数据库的物理结构发生改变时,Oracle会自动更新控制文件。当增加、重命名、删除一个数据文件或者一个重做日志文件时,Oracle 服务器进程会立即更新控制文件以反映数据库结构的变化。用户不能手工编辑控制文件,控制文件的修改由 Oracle 自动完成。

                      数据库的启动和正常运行都离不开控制文件(数据库在 mount 阶段读取控制文件,open 阶段一直使用),一定要备份控制文件,控制文件损坏将导致整个数据库损坏,数据库正常工作至少需要一个控制文件,生产库至少需要两个控制文件(多个控制文件之间是镜像关系),控制文件的位置和数量由初始化参数(control_files)决定。启动数据库时,Oracle 从初始化参数文件中获取控制文件的名字及位置,并打开控制文件,然后从控制文件中读取数据文件和重做日志文件的信息,最后打开数据库。数据库运行时,会更改控制文件。

                      重建控制步骤

                        方法一:SQL> oradebug setmypid
                        SQL> alter session set tracefile_identifier='controlfilerecreate' ;
                        SQL> alter database backup controlfile to trace noresetlogs;
                        SQL> oradebug tracefile_name ; 脚本位置和名称
                        SQL> shutdown immediate
                        备份当前控制文件,非常重要。SQL> startup nomount
                        SQL> @控制文件脚本.sql 创建控制文件
                        SQL> alter database open;
                        方法二:重建控制文件方法2
                        SQL> ALTER DATABASE BACKUP CONTROLFILE TO TRACE AS 'd:\controlfilerecreate.txt';
                        • Alter database open fails with ORA-00600 kcratr_nab_less_than_odr (文档 ID 1296264.1)
                        • ORA-00338, ORA-00312 Errors Reported In Alert Log (Doc ID 1377160.1)
                        • How to Recreate a Controlfile (Doc ID 735106.1)
                        以上就是本期关于“机房断电后Oracle数据库启动异常出现ORA-00600[kcratr_nab_less_than_odr]错误”的案例分享。希望能给大家带来帮助。
                        欢迎关注“数据库运维之道”公众号,一起学习数据库技术!
                        欢迎扫码进“数据库运维之道1群”,此群用于数据库技术交流,禁止发广告!

                        可以加我的微信,交个朋友或讨论数据库解决方案,请备注”姓名单位“,谢谢!


                        文章转载自数据库运维之道,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                        评论