暂无图片
暂无图片
1
暂无图片
暂无图片
暂无图片

DATAGUARD的健康不容忽视

白鳝的洞穴 2020-01-20
1398

    大家都知道,dataguard是数据库安全保障的有效手段,甚至有些用户在搭建了Dataguard后都不对数据库进行备份,使用Dataguard作为数据保护的最后底线。

虽然对于这些用户而言,这些备库与主库一样重要,实际上在运维过程中,这些备库并没有得到与主库一样的重视,这种疏忽可能导致十分严重的后果。

下面就是老白在几年前参与过的一个案例。

这个案例最终只能通过对无法启动的主库中拯救出来的部分数据文件进行dul数据下载,然后补充到备库中,最终完成了数据修复。

但是由于坏块的存在,仍然丢失了上万条数据。

 

问题描述

某联通公司的一个系统主库由于电源故障存储宕机,电源故障恢复后由于存储存在坏块,数据文件大面积损坏,数据库无法启动。于是根据应急预案切换到备库。在切换过程中出现问题,在没有分析清楚问题的情况下采用强制打开数据库的方式,导致备库中的12个数据文件在备库打开后处于OFFLINE DROP状态,和这12个文件相关的部分表无法访问。采用临时措施将这些表改名后重建这些表暂时维持业务系统的运行。客户希望:

1、找出DG切换失败的原因

2、将相关受损的数据表28日前的历史数据抢救出来

3、针对DG任何进行可用性检查以及切换时的处理步骤提出建议

 

DG切换故障分析

 

从日志上看,28日凌晨3:38开始主库和备库的通讯出现了异常:

Thu Apr28 01:27:53 2011

MediaRecovery Waiting for thread 1 seq# 87118 (in transit)

Thu Apr28 03:38:53 2011

RFS:Possible network disconnect with primary database

Thu Apr28 04:24:47 2011

RFS:Possible network disconnect with primary database

Thu Apr28 04:26:35 2011

RFS:Possible network disconnect with primary database

Thu Apr28 05:37:44 2011

 alter database recover managed standbydatabase cancel

Thu Apr28 05:37:48 2011

MRP0: BackgroundMedia Recovery user canceled with status 16037

Recoveryinterrupted.

MRP0:Background Media Recovery process shutdown

 

 

5:37,操作人员进行了故障处理,首先结束恢复管理模式,然后以READ ONLY方式打开了数据库,该过程成功完成。

Thu Apr28 05:37:44 2011

 alter database recover managed standbydatabase cancel

Thu Apr28 05:37:48 2011

MRP0:Background Media Recovery user canceled with status 16037

Recoveryinterrupted.

MRP0: BackgroundMedia Recovery process shutdown

Thu Apr28 05:37:51 2011

ManagedStandby Recovery Cancelled

Completed:  alter database recover managed standbydatabase c

Thu Apr28 05:37:53 2011

 alter database open read only

 

6:16,数据库被成功关闭。

 

Thu Apr28 06:16:09 2011

Shuttingdown instance: further logons disabled

Shuttingdown instance (immediate)

Licensehigh water mark = 8

Thu Apr28 06:16:12 2011

ALTERDATABASE CLOSE NORMAL

Thu Apr28 06:16:12 2011

SMON:disabling cache recovery

Thu Apr28 06:16:13 2011

Completed:ALTER DATABASE CLOSE NORMAL

Thu Apr28 06:16:13 2011

ALTERDATABASE DISMOUNT

Completed:ALTER DATABASE DISMOUNT

ARCH:Archiving is disabled

Shuttingdown archive processes

Archivingis disabled

Thu Apr28 06:16:13 2011

ARCH shuttingdown

Thu Apr28 06:16:13 2011

ARCHshutting down

Thu Apr28 06:16:13 2011

ARC0:Archival stopped

随后数据库再次以STANDBY DB模式被MOUNT,然后备份了控制文件,随后成功关闭。

 

Thu Apr28 06:16:42 2011

StandbyDatabase mounted.

Completed:alter database mount standby database

Thu Apr28 06:17:15 2011

alterdatabase backup controlfile to trace as '/tmp/myconctltrace.ctl'

Completed:alter database backup controlfile to trace as '/tm

Thu Apr28 06:17:20 2011

alterdatabase backup controlfile to trace as '/tmp/myconctltrace.ctl.xxxxxx'

Completed:alter database backup controlfile to trace as '/tm

Shuttingdown instance: further logons disabled

Shuttingdown instance (immediate)

Licensehigh water mark = 3

Thu Apr28 06:17:30 2011

ALTERDATABASE CLOSE NORMAL

ORA-1109signalled during: ALTER DATABASE CLOSE NORMAL...

Thu Apr28 06:17:30 2011

ALTERDATABASE DISMOUNT

Completed:ALTER DATABASE DISMOUNT

 

6:21:53重建了控制文件。

Thu Apr28 06:21:53 2011

CREATECONTROLFILE REUSE DATABASE "ICD1" NORESETLOGS  ARCHIVELOG

--  SET STANDBY TO MAXIMIZE PERFORMANCE

    MAXLOGFILES 5

    MAXLOGMEMBERS 3

    MAXDATAFILES 233

    MAXINSTANCES 1

    MAXLOGHISTORY 11344

LOGFILE

  GROUP 1 (

    '/dev/vg_index/rora9_redo111',

    '/dev/vg_data/rora9_redo121'

  ) SIZE 280M,

  GROUP 2 (

    '/dev/vg_index/rora9_redo112',

    '/dev/vg_data/rora9_redo122'

  ) SIZE 280M,

  GROUP 3 (

    '/dev/vg_data/rora9_redo131',

    '/dev/vg_index/rora9_redo132'

  ) SIZE 280M

--STANDBY LOGFILE

--   GROUP 4 (

--     '/dev/vg_index/rora9_sgredo111',

--     '/dev/vg_data/rora9_sgredo121'

--   ) SIZE 280M,

--   GROUP 5 (

--     '/dev/vg_index/rora9_sgredo112',

--     '/dev/vg_data/rora9_sgredo122'

--   ) SIZE 280M

DATAFILE

  '/dev/vg_data/rora9_system',

  '/dev/vg_index/rora9_rbs1',

   ……

……

  '/dev/vg_index/rlv_data40'

CHARACTERSET ZHS16GBK

6:27:02进行了switchover。

Thu Apr28 06:25:30 2011

 alter database mount standby database

Thu Apr28 06:25:34 2011

Successfulmount of redo thread 1, with mount id 1176475930

Thu Apr28 06:25:34 2011

StandbyDatabase mounted.

Completed:  alter database mount standby database

Thu Apr28 06:27:02 2011

alterdatabase commit to switchover to primary

Thu Apr28 06:27:02 2011

ALTERDATABASE COMMIT TO SWITCHOVER TO PRIMARY

Databasenot recovered through End-Of-REDO

Databasenot recovered through End-Of-REDO

Switchover:Media recovery required - standby not in limbo

ORA-16139signalled during: alter database commit to switchover to primary...

Thu Apr28 06:27:15 2011

Restartingdead background process QMN0

QMN0started with pid=10

Thu Apr28 06:28:02 2011

ALTERDATABASE RECOVER  automatic standbydatabase 

Thu Apr28 06:28:02 2011

MediaRecovery Start

Thu Apr28 06:28:02 2011

MediaRecovery Log tongns/arch/1_87118.dbf

Errorswith log tongns/arch/1_87118.dbf

ORA-279signalled during: ALTER DATABASE RECOVER automatic standby database...

Thu Apr28 06:28:10 2011

ALTERDATABASE RECOVER    CONTINUE DEFAULT 

Thu Apr28 06:28:10 2011

MediaRecovery Log tongns/arch/1_87118.dbf

Errorswith log /tongns/arch/1_87118.dbf

ORA-308signalled during: ALTER DATABASE RECOVER   CONTINUE DEFAULT  ...

Thu Apr28 06:28:10 2011

ALTERDATABASE RECOVER    CONTINUE DEFAULT 

Thu Apr28 06:28:10 2011

MediaRecovery Log /tongns/arch/1_87118.dbf

Errorswith log /tongns/arch/1_87118.dbf

ORA-308signalled during: ALTER DATABASE RECOVER   CONTINUE DEFAULT  ...

Thu Apr28 06:28:10 2011

ALTERDATABASE RECOVER CANCEL

Thu Apr28 06:28:11 2011

MediaRecovery Cancelled

Completed:ALTER DATABASE RECOVER CANCEL

由于主库故障,因此SWITCHOVER操作在等待最新的REDO LOG信息时出现了故障,因此SWITCHOVER未能成功。

6:29:14,再次创建控制文件:

Thu Apr28 06:29:14 2011

CREATECONTROLFILE REUSE DATABASE "ICD1" NORESETLOGS  ARCHIVELOG

--  SET STANDBY TO MAXIMIZE PERFORMANCE

    MAXLOGFILES 5

    MAXLOGMEMBERS 3

    MAXDATAFILES 233

    MAXINSTANCES 1

    MAXLOGHISTORY 11344

LOGFILE

 

随后又多次创建控制文件。

 

6:32:34,以RESETLOGS方式重建控制文件

Thu Apr28 06:32:34 2011

CREATECONTROLFILE REUSE DATABASE "ICD1" RESETLOGS  ARCHIVELOG

 

 

由于多次重建控制文件均报错,在控制文件生成的TRACE中,包含下面的信息:

# Takefiles offline to match current control file.

ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data21' OFFLINE DROP;

ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data22' OFFLINE DROP;

ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data1' OFFLINE DROP;

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data23' OFFLINE DROP;

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data24' OFFLINE DROP;

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data25' OFFLINE DROP;

ALTER DATABASEDATAFILE '/dev/vg_index/rlv_data26' OFFLINE DROP;

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data27' OFFLINE DROP;

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data28' OFFLINE DROP;

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data29' OFFLINE DROP;

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data30' OFFLINE DROP;

 

 

为了顺利重建控制文件,6:38:27,DBA删除了有问题的文件:

Thu Apr28 06:38:27 2011

ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data21' OFFLINE DROP

Completed:ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data21'

Thu Apr28 06:38:32 2011

ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data22' OFFLINE DROP

Completed:ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data22'

Thu Apr28 06:38:32 2011

ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data1' OFFLINE DROP

Completed:ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data1' O

Thu Apr28 06:38:32 2011

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data23' OFFLINE DROP

Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data23'

Thu Apr28 06:38:32 2011

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data24' OFFLINE DROP

Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data24'

Thu Apr28 06:38:32 2011

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data25' OFFLINE DROP

Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data25'

Thu Apr 2806:38:32 2011

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data26' OFFLINE DROP

Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data26'

Thu Apr28 06:38:32 2011

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data27' OFFLINE DROP

Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data27'

Thu Apr28 06:38:32 2011

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data28' OFFLINE DROP

Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data28'

Thu Apr28 06:38:32 2011

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data29' OFFLINE DROP

Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data29'

Thu Apr28 06:38:34 2011

ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data30' OFFLINE DROP

Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data30'

Thu Apr28 06:39:52 2011

重建控制文件后,数据库可以顺利打开,打开后发现有些表无法访问,经检查发现这些文件均为执行了OFFLINE DROP命令的文件。

 

 

DG故障分析

由于备库已经OPEN,因此检查OFFLINE状态的文件,从文件头上我们可以看到:

NAME                           CHECKPOINT_CHANGE#CHECKPOIN STATUS

------------------------------------------------ --------- -------

/dev/vg_data/rlv_data21            12036594418511 10-FEB-11 OFFLINE

/dev/vg_data/rlv_data22            12036594418511 10-FEB-11 OFFLINE

/dev/vg_data/rlv_data1                          0           OFFLINE

/dev/vg_index/rlv_data23                        0           OFFLINE

/dev/vg_index/rlv_data24                        0           OFFLINE

/dev/vg_index/rlv_data25                        0           OFFLINE

/dev/vg_index/rlv_data26                        0           OFFLINE

/dev/vg_index/rlv_data27                        0           OFFLINE

/dev/vg_index/rlv_data28                        0           OFFLINE

/dev/vg_index/rlv_data29                        0           OFFLINE

/dev/vg_index/rlv_data30                        0           OFFLINE

检查控制文件:

NAME                           CHECKPOINT_CHANGE# CHECKPOINSTATUS

------------------------------------------------ --------- -------

/dev/vg_data/rlv_data21                         0           OFFLINE

/dev/vg_data/rlv_data22                         0           OFFLINE

/dev/vg_data/rlv_data1                          0           OFFLINE

/dev/vg_index/rlv_data23                        0           OFFLINE

/dev/vg_index/rlv_data24                        0           OFFLINE

/dev/vg_index/rlv_data25                        0           OFFLINE

/dev/vg_index/rlv_data26                        0           OFFLINE

/dev/vg_index/rlv_data27                        0           OFFLINE

/dev/vg_index/rlv_data28                        0           OFFLINE

/dev/vg_index/rlv_data29                        0           OFFLINE

/dev/vg_index/rlv_data30                        0           OFFLINE

 

 

 

从上述信息看,上述数据文件的状态是不正确的,即使不做OFFLINE DROP,也无法使用。有些文件甚至是空文件。

通过对相关文件在ALERT LOG中的日志信息的分析发现这些文件都是采用手工文件管理方式手工处理的,而不是DG数据文件自动管理产生的。从日志中可以看到:

  ALTER SYSTEM SETstandby_file_management='MANUAL' SCOPE=BOTH;

 

MonMar  7 20:35:49 2011

 alter database datafile '/ora10gsy/product/10.2/dbs/UNNAMED00109'offline drop

MonMar  7 20:35:49 2011

ORA-1516signalled during:  alter databasedatafile '/ora10gsy/product/10.2/d...

MonMar  7 20:36:11 2011

alterdatabase datafile '/opt/oracle/product/9.2/dbs/UNNAMED00109' offline drop

MonMar  7 20:36:12 2011

Completed:alter database datafile '/opt/oracle/product/9.2/d

MonMar  7 20:36:48 2011

alterdatabase rename file '/opt/oracle/product/9.2/dbs/UNNAMED00109' to  '/dev/vg_data/rlv_data21'

MonMar  7 20:36:48 2011

ORA-1511signalled during: alter database rename file '/opt/oracle/product/9....

MonMar  7 20:37:06 2011

ALTERSYSTEM SET standby_file_management='MANUAL' SCOPE=BOTH;

MonMar  7 20:37:09 2011

  alter database rename file'/opt/oracle/product/9.2/dbs/UNNAMED00109' to '/dev/vg_data/rlv_data21'

MonMar  7 20:37:09 2011

Completed:   alter database rename file'/opt/oracle/product/

MonMar  7 20:37:17 2011

ALTERSYSTEM SET standby_file_management='AUTO' SCOPE=BOTH;

首先standby_file_management被设置为manual,然后将数据文件进行重新命名。经过和相关操作人员确认,是因为主库创建新文件时,备库的裸设备不存在,从而导致在备库中生成了一个UNNAMEDxxxxxx的文件,为了将这个文件重新迁移到裸设备上,做了上述操作。

经过确认,采用这种方式处理数据文件缺少了一个步骤,就是执行alter database create datafile ... As '....';

从而导致该数据文件一直处于未完成修改的状态,因此DATAGUARD一直未对此文件进行REDO APPLY。并且在DATAGUARD中查看这些文件的状态是RECOVER状态。实际上这些文件大多数是空文件,里面并无任何数据。

DG故障分析结论

 

由于DATAGUARD中的这些文件本身存在问题,导致这类文件处于非正常状态,在DATAGURAD切换后,无法实现这些文件的切换。由于这些文件中有些文件需要1月份以来的REDO LOG,有些需要4月7日的REDO LOG,由于这些归档日志已经被删除,当时已经无条件进行修复。

 

针对DG/ADG运维的建议:

1、尽可能在主库添加数据文件前,在DG上准备好相关裸设备

2、如果出现忘记准备裸设备的情况,处理过程要严格按照官方要求(参考后面的建议部分)

3、定期检查DG,及时发现SATTUS为RECOVER状态的文件,及时进行纠正

4、针对备库,需要定期进行健康检查,查看REDO APPLY状态,检查ALERT LOG中是否存在异常



 

 


最后修改时间:2020-01-20 09:35:40
文章转载自 白鳝的洞穴,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论