大家都知道,dataguard是数据库安全保障的有效手段,甚至有些用户在搭建了Dataguard后都不对数据库进行备份,使用Dataguard作为数据保护的最后底线。
虽然对于这些用户而言,这些备库与主库一样重要,实际上在运维过程中,这些备库并没有得到与主库一样的重视,这种疏忽可能导致十分严重的后果。
下面就是老白在几年前参与过的一个案例。
这个案例最终只能通过对无法启动的主库中拯救出来的部分数据文件进行dul数据下载,然后补充到备库中,最终完成了数据修复。
但是由于坏块的存在,仍然丢失了上万条数据。
问题描述
某联通公司的一个系统主库由于电源故障存储宕机,电源故障恢复后由于存储存在坏块,数据文件大面积损坏,数据库无法启动。于是根据应急预案切换到备库。在切换过程中出现问题,在没有分析清楚问题的情况下采用强制打开数据库的方式,导致备库中的12个数据文件在备库打开后处于OFFLINE DROP状态,和这12个文件相关的部分表无法访问。采用临时措施将这些表改名后重建这些表暂时维持业务系统的运行。客户希望:
1、找出DG切换失败的原因
2、将相关受损的数据表28日前的历史数据抢救出来
3、针对DG任何进行可用性检查以及切换时的处理步骤提出建议
DG切换故障分析
从日志上看,28日凌晨3:38开始主库和备库的通讯出现了异常:
Thu Apr28 01:27:53 2011
MediaRecovery Waiting for thread 1 seq# 87118 (in transit)
Thu Apr28 03:38:53 2011
RFS:Possible network disconnect with primary database
Thu Apr28 04:24:47 2011
RFS:Possible network disconnect with primary database
Thu Apr28 04:26:35 2011
RFS:Possible network disconnect with primary database
Thu Apr28 05:37:44 2011
alter database recover managed standbydatabase cancel
Thu Apr28 05:37:48 2011
MRP0: BackgroundMedia Recovery user canceled with status 16037
Recoveryinterrupted.
MRP0:Background Media Recovery process shutdown
5:37,操作人员进行了故障处理,首先结束恢复管理模式,然后以READ ONLY方式打开了数据库,该过程成功完成。
Thu Apr28 05:37:44 2011
alter database recover managed standbydatabase cancel
Thu Apr28 05:37:48 2011
MRP0:Background Media Recovery user canceled with status 16037
Recoveryinterrupted.
MRP0: BackgroundMedia Recovery process shutdown
Thu Apr28 05:37:51 2011
ManagedStandby Recovery Cancelled
Completed: alter database recover managed standbydatabase c
Thu Apr28 05:37:53 2011
alter database open read only
6:16,数据库被成功关闭。
Thu Apr28 06:16:09 2011
Shuttingdown instance: further logons disabled
Shuttingdown instance (immediate)
Licensehigh water mark = 8
Thu Apr28 06:16:12 2011
ALTERDATABASE CLOSE NORMAL
Thu Apr28 06:16:12 2011
SMON:disabling cache recovery
Thu Apr28 06:16:13 2011
Completed:ALTER DATABASE CLOSE NORMAL
Thu Apr28 06:16:13 2011
ALTERDATABASE DISMOUNT
Completed:ALTER DATABASE DISMOUNT
ARCH:Archiving is disabled
Shuttingdown archive processes
Archivingis disabled
Thu Apr28 06:16:13 2011
ARCH shuttingdown
Thu Apr28 06:16:13 2011
ARCHshutting down
Thu Apr28 06:16:13 2011
ARC0:Archival stopped
随后数据库再次以STANDBY DB模式被MOUNT,然后备份了控制文件,随后成功关闭。
Thu Apr28 06:16:42 2011
StandbyDatabase mounted.
Completed:alter database mount standby database
Thu Apr28 06:17:15 2011
alterdatabase backup controlfile to trace as '/tmp/myconctltrace.ctl'
Completed:alter database backup controlfile to trace as '/tm
Thu Apr28 06:17:20 2011
alterdatabase backup controlfile to trace as '/tmp/myconctltrace.ctl.xxxxxx'
Completed:alter database backup controlfile to trace as '/tm
Shuttingdown instance: further logons disabled
Shuttingdown instance (immediate)
Licensehigh water mark = 3
Thu Apr28 06:17:30 2011
ALTERDATABASE CLOSE NORMAL
ORA-1109signalled during: ALTER DATABASE CLOSE NORMAL...
Thu Apr28 06:17:30 2011
ALTERDATABASE DISMOUNT
Completed:ALTER DATABASE DISMOUNT
6:21:53重建了控制文件。
Thu Apr28 06:21:53 2011
CREATECONTROLFILE REUSE DATABASE "ICD1" NORESETLOGS ARCHIVELOG
-- SET STANDBY TO MAXIMIZE PERFORMANCE
MAXLOGFILES 5
MAXLOGMEMBERS 3
MAXDATAFILES 233
MAXINSTANCES 1
MAXLOGHISTORY 11344
LOGFILE
GROUP 1 (
'/dev/vg_index/rora9_redo111',
'/dev/vg_data/rora9_redo121'
) SIZE 280M,
GROUP 2 (
'/dev/vg_index/rora9_redo112',
'/dev/vg_data/rora9_redo122'
) SIZE 280M,
GROUP 3 (
'/dev/vg_data/rora9_redo131',
'/dev/vg_index/rora9_redo132'
) SIZE 280M
--STANDBY LOGFILE
-- GROUP 4 (
-- '/dev/vg_index/rora9_sgredo111',
-- '/dev/vg_data/rora9_sgredo121'
-- ) SIZE 280M,
-- GROUP 5 (
-- '/dev/vg_index/rora9_sgredo112',
-- '/dev/vg_data/rora9_sgredo122'
-- ) SIZE 280M
DATAFILE
'/dev/vg_data/rora9_system',
'/dev/vg_index/rora9_rbs1',
……
……
'/dev/vg_index/rlv_data40'
CHARACTERSET ZHS16GBK
6:27:02进行了switchover。
Thu Apr28 06:25:30 2011
alter database mount standby database
Thu Apr28 06:25:34 2011
Successfulmount of redo thread 1, with mount id 1176475930
Thu Apr28 06:25:34 2011
StandbyDatabase mounted.
Completed: alter database mount standby database
Thu Apr28 06:27:02 2011
alterdatabase commit to switchover to primary
Thu Apr28 06:27:02 2011
ALTERDATABASE COMMIT TO SWITCHOVER TO PRIMARY
Databasenot recovered through End-Of-REDO
Databasenot recovered through End-Of-REDO
Switchover:Media recovery required - standby not in limbo
ORA-16139signalled during: alter database commit to switchover to primary...
Thu Apr28 06:27:15 2011
Restartingdead background process QMN0
QMN0started with pid=10
Thu Apr28 06:28:02 2011
ALTERDATABASE RECOVER automatic standbydatabase
Thu Apr28 06:28:02 2011
MediaRecovery Start
Thu Apr28 06:28:02 2011
MediaRecovery Log tongns/arch/1_87118.dbf
Errorswith log tongns/arch/1_87118.dbf
ORA-279signalled during: ALTER DATABASE RECOVER automatic standby database...
Thu Apr28 06:28:10 2011
ALTERDATABASE RECOVER CONTINUE DEFAULT
Thu Apr28 06:28:10 2011
MediaRecovery Log tongns/arch/1_87118.dbf
Errorswith log /tongns/arch/1_87118.dbf
ORA-308signalled during: ALTER DATABASE RECOVER CONTINUE DEFAULT ...
Thu Apr28 06:28:10 2011
ALTERDATABASE RECOVER CONTINUE DEFAULT
Thu Apr28 06:28:10 2011
MediaRecovery Log /tongns/arch/1_87118.dbf
Errorswith log /tongns/arch/1_87118.dbf
ORA-308signalled during: ALTER DATABASE RECOVER CONTINUE DEFAULT ...
Thu Apr28 06:28:10 2011
ALTERDATABASE RECOVER CANCEL
Thu Apr28 06:28:11 2011
MediaRecovery Cancelled
Completed:ALTER DATABASE RECOVER CANCEL
由于主库故障,因此SWITCHOVER操作在等待最新的REDO LOG信息时出现了故障,因此SWITCHOVER未能成功。
6:29:14,再次创建控制文件:
Thu Apr28 06:29:14 2011
CREATECONTROLFILE REUSE DATABASE "ICD1" NORESETLOGS ARCHIVELOG
-- SET STANDBY TO MAXIMIZE PERFORMANCE
MAXLOGFILES 5
MAXLOGMEMBERS 3
MAXDATAFILES 233
MAXINSTANCES 1
MAXLOGHISTORY 11344
LOGFILE
随后又多次创建控制文件。
6:32:34,以RESETLOGS方式重建控制文件
Thu Apr28 06:32:34 2011
CREATECONTROLFILE REUSE DATABASE "ICD1" RESETLOGS ARCHIVELOG
由于多次重建控制文件均报错,在控制文件生成的TRACE中,包含下面的信息:
# Takefiles offline to match current control file.
ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data21' OFFLINE DROP;
ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data22' OFFLINE DROP;
ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data1' OFFLINE DROP;
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data23' OFFLINE DROP;
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data24' OFFLINE DROP;
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data25' OFFLINE DROP;
ALTER DATABASEDATAFILE '/dev/vg_index/rlv_data26' OFFLINE DROP;
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data27' OFFLINE DROP;
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data28' OFFLINE DROP;
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data29' OFFLINE DROP;
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data30' OFFLINE DROP;
为了顺利重建控制文件,6:38:27,DBA删除了有问题的文件:
Thu Apr28 06:38:27 2011
ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data21' OFFLINE DROP
Completed:ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data21'
Thu Apr28 06:38:32 2011
ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data22' OFFLINE DROP
Completed:ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data22'
Thu Apr28 06:38:32 2011
ALTERDATABASE DATAFILE '/dev/vg_data/rlv_data1' OFFLINE DROP
Completed:ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data1' O
Thu Apr28 06:38:32 2011
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data23' OFFLINE DROP
Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data23'
Thu Apr28 06:38:32 2011
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data24' OFFLINE DROP
Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data24'
Thu Apr28 06:38:32 2011
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data25' OFFLINE DROP
Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data25'
Thu Apr 2806:38:32 2011
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data26' OFFLINE DROP
Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data26'
Thu Apr28 06:38:32 2011
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data27' OFFLINE DROP
Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data27'
Thu Apr28 06:38:32 2011
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data28' OFFLINE DROP
Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data28'
Thu Apr28 06:38:32 2011
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data29' OFFLINE DROP
Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data29'
Thu Apr28 06:38:34 2011
ALTERDATABASE DATAFILE '/dev/vg_index/rlv_data30' OFFLINE DROP
Completed:ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data30'
Thu Apr28 06:39:52 2011
重建控制文件后,数据库可以顺利打开,打开后发现有些表无法访问,经检查发现这些文件均为执行了OFFLINE DROP命令的文件。
DG故障分析
由于备库已经OPEN,因此检查OFFLINE状态的文件,从文件头上我们可以看到:
NAME CHECKPOINT_CHANGE#CHECKPOIN STATUS
------------------------------------------------ --------- -------
/dev/vg_data/rlv_data21 12036594418511 10-FEB-11 OFFLINE
/dev/vg_data/rlv_data22 12036594418511 10-FEB-11 OFFLINE
/dev/vg_data/rlv_data1 0 OFFLINE
/dev/vg_index/rlv_data23 0 OFFLINE
/dev/vg_index/rlv_data24 0 OFFLINE
/dev/vg_index/rlv_data25 0 OFFLINE
/dev/vg_index/rlv_data26 0 OFFLINE
/dev/vg_index/rlv_data27 0 OFFLINE
/dev/vg_index/rlv_data28 0 OFFLINE
/dev/vg_index/rlv_data29 0 OFFLINE
/dev/vg_index/rlv_data30 0 OFFLINE
检查控制文件:
NAME CHECKPOINT_CHANGE# CHECKPOINSTATUS
------------------------------------------------ --------- -------
/dev/vg_data/rlv_data21 0 OFFLINE
/dev/vg_data/rlv_data22 0 OFFLINE
/dev/vg_data/rlv_data1 0 OFFLINE
/dev/vg_index/rlv_data23 0 OFFLINE
/dev/vg_index/rlv_data24 0 OFFLINE
/dev/vg_index/rlv_data25 0 OFFLINE
/dev/vg_index/rlv_data26 0 OFFLINE
/dev/vg_index/rlv_data27 0 OFFLINE
/dev/vg_index/rlv_data28 0 OFFLINE
/dev/vg_index/rlv_data29 0 OFFLINE
/dev/vg_index/rlv_data30 0 OFFLINE
从上述信息看,上述数据文件的状态是不正确的,即使不做OFFLINE DROP,也无法使用。有些文件甚至是空文件。
通过对相关文件在ALERT LOG中的日志信息的分析发现这些文件都是采用手工文件管理方式手工处理的,而不是DG数据文件自动管理产生的。从日志中可以看到:
ALTER SYSTEM SETstandby_file_management='MANUAL' SCOPE=BOTH;
MonMar 7 20:35:49 2011
alter database datafile '/ora10gsy/product/10.2/dbs/UNNAMED00109'offline drop
MonMar 7 20:35:49 2011
ORA-1516signalled during: alter databasedatafile '/ora10gsy/product/10.2/d...
MonMar 7 20:36:11 2011
alterdatabase datafile '/opt/oracle/product/9.2/dbs/UNNAMED00109' offline drop
MonMar 7 20:36:12 2011
Completed:alter database datafile '/opt/oracle/product/9.2/d
MonMar 7 20:36:48 2011
alterdatabase rename file '/opt/oracle/product/9.2/dbs/UNNAMED00109' to '/dev/vg_data/rlv_data21'
MonMar 7 20:36:48 2011
ORA-1511signalled during: alter database rename file '/opt/oracle/product/9....
MonMar 7 20:37:06 2011
ALTERSYSTEM SET standby_file_management='MANUAL' SCOPE=BOTH;
MonMar 7 20:37:09 2011
alter database rename file'/opt/oracle/product/9.2/dbs/UNNAMED00109' to '/dev/vg_data/rlv_data21'
MonMar 7 20:37:09 2011
Completed: alter database rename file'/opt/oracle/product/
MonMar 7 20:37:17 2011
ALTERSYSTEM SET standby_file_management='AUTO' SCOPE=BOTH;
首先standby_file_management被设置为manual,然后将数据文件进行重新命名。经过和相关操作人员确认,是因为主库创建新文件时,备库的裸设备不存在,从而导致在备库中生成了一个UNNAMEDxxxxxx的文件,为了将这个文件重新迁移到裸设备上,做了上述操作。
经过确认,采用这种方式处理数据文件缺少了一个步骤,就是执行alter database create datafile ... As '....';
从而导致该数据文件一直处于未完成修改的状态,因此DATAGUARD一直未对此文件进行REDO APPLY。并且在DATAGUARD中查看这些文件的状态是RECOVER状态。实际上这些文件大多数是空文件,里面并无任何数据。
DG故障分析结论
由于DATAGUARD中的这些文件本身存在问题,导致这类文件处于非正常状态,在DATAGURAD切换后,无法实现这些文件的切换。由于这些文件中有些文件需要1月份以来的REDO LOG,有些需要4月7日的REDO LOG,由于这些归档日志已经被删除,当时已经无条件进行修复。
针对DG/ADG运维的建议:
1、尽可能在主库添加数据文件前,在DG上准备好相关裸设备
2、如果出现忘记准备裸设备的情况,处理过程要严格按照官方要求(参考后面的建议部分)
3、定期检查DG,及时发现SATTUS为RECOVER状态的文件,及时进行纠正
4、针对备库,需要定期进行健康检查,查看REDO APPLY状态,检查ALERT LOG中是否存在异常





