问题描述
前段时间因为硬件原因,standby DB crash,主库删除了日志且无备份,后期需要重建,该环境是主备两套11.2.0.3 2-NODES RAC ON EXADATA Machine ,存储在ASM, 数据库70TB 大小, 闲时每天1T归档日志, 忙时每天可达4T+归档, 之前朋友重建DG一直使用主机上一个shell 放后台,我查看了该shell 确实不算复杂,但是duplicate 一直hang在开始。
专家解答
下面我只是简单记录一下这个过程,及后期的DATAGUARD 相关的一些优化细节。因为数据库太大,这里使用了duplicate for standby from active DATAbase, 另个每天日志产生量大,所以在空间有限的情况下要格外注意速度。
附一下shell的调用
1,create pfile (not using spfile) DB_NAME=orarpt DB_UNIQUE_NAME=rptstby DB_BLOCK_SIZE=16384 2, shell adg_setup.sh export ORACLE_SID=orarpt2 rman msglog /home/oracle/work/adg_setup.log << EOF connect target sys/xxxxxx@rptpri2 connect auxiliary sys/xxxxx@rptstby run { allocate channel ch001 type disk; allocate channel ch002 type disk; allocate channel ch003 type disk; allocate channel ch004 type disk; allocate channel ch006 type disk; allocate channel ch007 type disk; allocate channel ch008 type disk; allocate auxiliary channel ch005 type disk;duplicate target DATAbase for standby from active database NOFILENAMECHECK spfile set db_unique_name 'rptstby' set db_create_file_dest '+DATA' set standby_file_management 'AUTO' set remote_listener 'anbob-scan:1521' set log_archive_dest_2 '' set log_archive_dest_state_2 'DEFER' set fal_server 'rptpri1','rptpri2' set fal_client 'rptstby' set control_files='+DATA/RPTSTBY/CONTROLFILE/ctrl1','+DATA/RPTSTBY/CONTROLFILE/ctrl2' set audit_file_dest '/oracle/app/oracle/admin/rptstby/adump' ; release channel ch001; release channel ch002; release channel ch003; release channel ch004; release channel ch005; release channel ch006; release channel ch007; release channel ch008; } EOF
Tip:
调用很简单nohup 放到后台,但是查看日志, 它睡的好沉…
... allocated channel: c8 channel c8: SID=834 device type=DISK Starting Duplicate Db at 04-FEB-16 contents of Memory Script: { backup as copy reuse targetfile '/oracle/app/oracle/product/11.2.0.3/dbhome_1/dbs/orapworarpt2' auxiliary format '/oracle/app/oracle/product/11.2.0.3/dbhome_1/dbs/orapworarpt2' targetfile '+DATA/orarpt/spfileorarpt.ora' auxiliary format '+DATA/rptstby/parameterfile/spfile.1123.886959551' ; sql clone "alter system set spfile= ''+DATA/rptstby/parameterfile/spfile.1123.886959551''"; } executing Memory Script Starting backup at 04-FEB-16 --hang here
note:
正常情况下早应该set newname, copy datafile了, 咨询了下其他DBA说真有几个小时不动的, 下班回家,第二天发现日志还在那里等着我,那就有点不正常了。
[oracle@anbob02 (orarpt2)work]$ ps -ef|grep adg oracle 20574 16423 0 Feb03 pts/0 00:00:00 sh adg_setup.sh oracle 20578 20574 0 Feb03 pts/0 00:00:02 rman msglog /home/oracle/work/adg_setup.log [oracle@anbob02 (orarpt2)work]$ strace -fp 20578 -o rman.trc -ttt Process 20578 attached - interrupt to quit [oracle@anbob02 (orarpt2)work]$ tail -f rman.trc 20578 1454551745.092009 nanosleep({10, 0}, NULL) = 0 20578 1454551755.092238 write(12, "\0)\0\0\6\0\0\0\0\0\3N\221\2\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\2\301"..., 41) = 41 20578 1454551755.092309 read(12, "\0\334\0\0\6\0\0\0\0\0\6\1\2\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 8208) = 220 20578 1454551755.093573 nanosleep({10, 0}, NULL) = 0 20578 1454551765.093786 write(12, "\0)\0\0\6\0\0\0\0\0\3N\222\2\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\2\301"..., 41) = 41 20578 1454551765.093858 read(12, "\0\334\0\0\6\0\0\0\0\0\6\1\2\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 8208) = 220 20578 1454551765.094551 nanosleep({10, 0}, NULL) = 0 20578 1454551775.094778 write(12, "\0)\0\0\6\0\0\0\0\0\3N\223\2\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\2\301"..., 41) = 41 20578 1454551775.094842 read(12, "\0\334\0\0\6\0\0\0\0\0\6\1\2\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 8208) = 220 20578 1454551775.095816 nanosleep({10, 0}, NULL) = 0 20578 1454551785.096035 write(12, "\0)\0\0\6\0\0\0\0\0\3N\224\2\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\2\301"..., 41) = 41 20578 1454551785.096106 read(12, "\0\334\0\0\6\0\0\0\0\0\6\1\2\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 8208) = 220 20578 1454551785.096849 nanosleep({10, 0}, NULL) = 0 20578 1454551795.097062 write(12, "\0)\0\0\6\0\0\0\0\0\3N\225\2\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\2\301"..., 41) = 41 20578 1454551795.097139 read(12, "\0\334\0\0\6\0\0\0\0\0\6\1\2\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 8208) = 220 20578 1454551795.097882 nanosleep({10, 0}, NULL) = 0 20578 1454551805.098073 write(12, "\0)\0\0\6\0\0\0\0\0\3N\226\2\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\2\301"..., 41) = 41 20578 1454551805.098141 read(12, "\0\334\0\0\6\0\0\0\0\0\6\1\2\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 8208) = 220 20578 1454551805.098974 nanosleep({10, 0}, NULL) = 0 20578 1454551815.099187 write(12, "\0)\0\0\6\0\0\0\0\0\3N\227\2\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\2\301"..., 41) = 41 20578 1454551815.099260 read(12, "\0\334\0\0\6\0\0\0\0\0\6\1\2\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 8208) = 220 20578 1454551815.100527 nanosleep({10, 0}, NULL) = 0 20578 1454551825.100707 write(12, "\0)\0\0\6\0\0\0\0\0\3N\230\2\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\2\301"..., 41) = 41 20578 1454551825.100783 read(12, "\0\334\0\0\6\0\0\0\0\0\6\1\2\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 8208) = 220 20578 1454551825.101540 nanosleep({10, 0}, NULL) = 0 20578 1454551835.101744 write(12, "\0)\0\0\6\0\0\0\0\0\3N\231\2\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\2\301"..., 41) = 41 20578 1454551835.101813 read(12, "\0\334\0\0\6\0\0\0\0\0\6\1\2\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 8208) = 220 20578 1454551835.102565 nanosleep({10, 0}, NULL) = 0 20578 1454551845.102764 write(12, "\0\35\0\0\6\0\0\0\0\0\3\2\232\376\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0", 29) = 29 20578 1454551845.102836 read(12, "\0\26\0\0\6\0\0\0\0\0\10\6\0\0\0\t\1\0\0\0\230!", 8208) = 22 20578 1454551845.103462 write(12, "\1B\0\0\6\0\0\0\0\0\3\3\233\6\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\26\1\0"..., 322) = 322 20578 1454551845.103521 read(12, "\0\222\0\0\6\0\0\0\0\0\4\1\0\0\0\231!\1\1\0\0\0\0\0\0\0\0\0\6\0\0\0"..., 8208) = 146 20578 1454551845.104163 write(12, "\1\305\0\0\6\0\0\0\0\0\3G\234\370\200\0\0\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 453) = 453 20578 1454551845.104215 read(12, "\0\254\0\0\6\0\0\0\0\0\4\1\0\0\0\232!\1\0\0\0\0{\5\0\0\0\0\6\0\0\0"..., 8208) = 172 20578 1454551845.108028 write(12, "\0*\0\0\6\0\0\0\0\0\3N\235\6\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\375\1"..., 42) = 42 20578 1454551845.108079 read(12, "\0\254\0\0\6\0\0\0\0\0\4\1\0\0\0\233!\1\0\0\0\0{\5\0\0\0\0\6\0\0\0"..., 8208) = 172 20578 1454551845.111745 write(12, "\0\21\0\0\6\0\0\0\0\0\3\10\236\6\0\0\0", 17) = 17 20578 1454551845.111793 read(12, "\0\21\0\0\6\0\0\0\0\0\t\1\0\0\0\234!", 8208) = 17 20578 1454551845.112397 write(12, "\0)\0\0\6\0\0\0\0\0\3N\237\2\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\2\301"..., 41) = 41 20578 1454551845.112475 read(12, "\0\334\0\0\6\0\0\0\0\0\6\1\2\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 8208) = 220 20578 1454551845.113188 nanosleep({10, 0}, NULL) = 0 20578 1454551855.113399 write(12, "\0)\0\0\6\0\0\0\0\0\3N\240\2\0\0\0\1\0\0\0\340\0\0\0\0\0\0\0\7\2\301"..., 41) = 41 20578 1454551855.113477 read(12, "\0\334\0\0\6\0\0\0\0\0\6\1\2\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 8208) = 220 20578 1454551855.114217 nanosleep({10, 0}, NULL) = 0 ...
note:
strace 看进程好像在一直循环nanosleep, 查看了句柄41 是socket. 下面查看rman 是否有堵塞和等待。
SQL> select S.SID, S.SERIAL#, S.INST_ID , P.SPID SPID, S.USERNAME, S.STATUS, S.OSUSER || '@' || S.MACHINE OSINFO, SUBSTR(S.PROGRAM,0,20) PROGRAM, S.EVENT, S.SECONDS_IN_WAIT SIW from gv$session s, gv$process p where s.type <> 'BACKGROUND' and s.inst_id = p.inst_id and s.paddr = p.addr and s.program like 'rman%' SID SERIAL# INST_ID SPID USERNAME STATUS OSINFO PROGRAM EVENT SIW ---- ---------- ---------- -------- -- ----------- - -------- -------------- --------------- ------------------------------ ---------- 147 61685 2 16192 SYS INACTIVE oracle@anbob02 rman@anbob02 SQL*Net message from client 4930 1014 26077 2 13700 SYS INACTIVE oracle@anbob02 rman@anbob02 SQL*Net message from client 4909 1833 40727 2 12810 SYS INACTIVE oracle@anbob02 rman@anbob02 SQL*Net message from client 4910 299 60835 2 15130 SYS INACTIVE oracle@anbob02 rman@anbob02 SQL*Net message from client 4933 2653 1403 2 13628 SYS INACTIVE oracle@anbob02 rman@anbob02 SQL*Net message from client 5 196 61379 2 16980 SYS INACTIVE oracle@anbob02 rman@anbob02 SQL*Net message from client 4928 576 45345 2 14183 SYS ACTIVE oracle@anbob02 rman@anbob02 remote db operation 4910 1016 27113 2 14593 SYS INACTIVE oracle@anbob02 rman@anbob02 SQL*Net message from client 4935 55 21301 2 17624 SYS INACTIVE oracle@anbob02 rman@anbob02 SQL*Net message from client 4927 2021 49333 2 15689 SYS INACTIVE oracle@anbob02 rman@anbob02 SQL*Net message from client 4932
Note:
以DG 和wait event 为关键字在MOS中发现RMAN “duplicate From Active Database” Hangs Between 2 Datacenters using Cisco Firewall ( Doc ID 2007835.1 ) ,但是网络确认近期无网络变动,且不存在firewall。 开了个SR也倾向于网络原因, 启用一些EVNET 排查,因使用duplicateauxiliary instance 要以pfile启动, pfile 中增加下面event:
*.event='10294 trace name context forever, level 1: 10298 trace name context forever,level 3:logon trace name krb_trace level 15'
同时启用rman 启用debug
rman debug trace=/tmp/rman_debug msglog=/home/oracle/work/adg_setup_weejar.log << EOF ... EOF
并且查看了duplicate 进程的 call stack
SQL> oradebug short_stack ksedsts()+461<-ksdxfstk()+32<-ksdxen_int()+4179<-ksdxen()+14<-opiodr()+916<-ttcpip()+2210<-opitsk()+1717<-opiino()+966<-opiodr()+916<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+252<-main()+201<-__libc_start_main()+244<-_start()+36 -- wait 10s , repeat 3 times , output same as above SQL> select * from v$iostat_network ; select * from v$iostat_network ; --无网络IO # 从上面的event 和debug 中trace 发现了一些,其它没找到有价值的信息 ksfdwtio:count=4294967295 aioflags=0x20 timeout=2147483647 ksfd_osmwat:count=4294967295 intr=0 posted=(nil) tout=2147483647 ksfdwat_internal:count=4294967295 timeout=32 aioflags=2147483647 ksfdwtio:count=4294967295 aioflags=0x20 timeout=2147483647 ksfd_osmwat:count=4294967295 intr=0 posted=(nil) tout=2147483647 ksfdwat_internal:count=4294967295 timeout=32 aioflags=2147483647 ksfdwtio:count=4294967295 aioflags=0x20 timeout=2147483647 ksfd_osmwat:count=4294967295 intr=0 posted=(nil) tout=2147483647 ksfdwat_internal:count=4294967295 timeout=32 aioflags=2147483647 ksfdwtio:count=4294967295 aioflags=0x20 timeout=2147483647 Number of resource hash buckets is 15115 Large Pages allocation failed (free: 22981 required: 192) 2016-02-19 13:30:28.626: [ default]failed to initialize skgp context 2016-02-19 13:30:28.626: [ default]slos op : sslssreghdlr 2016-02-19 13:30:28.626: [ default]slos dep : Error 0 (0) 2016-02-19 13:30:28.626: [ default]slos loc : sskgpinit1 2016-02-19 13:30:28.626: [ default]slos info: [ CLWAL]clsw_Initialize: OLR initlevel [30000] 2016-02-19 13:30:28.627: [ default]a_init: Unable to get log name. Retval:[-4]
后来尝试在pfile逐渐增加了些参数,还有些是因为错误加了其它项 最终
*.audit_file_dest='/oracle/app/oracle/admin/rptstby/adump' *.compatible='11.2.0.3.0' *.control_files='+DATA/rptstby/controlfile/control01.ctl','+DATA/rptstby/controlfile/control02.ctl' *.db_block_size=16384 *.db_create_file_dest='+DATA' *.db_domain='' *.db_file_multiblock_read_count=32 *.db_file_name_convert='+data/orarpt/datafile/','+data/rptstby/datafile/','+data/ORARPT/tempfile/','+data/rptstby/tempfile/','+flashts/ORARPT/datafile/','+data/rptstby/datafile/','+reco/ORARPT/datafile/','+reco/rptstby/datafile/','+DBFS_DG/orarpt/datafile/','+DBFS_DG/rptstby/datafile/' *.db_lost_write_protect='TYPICAL' *.db_name='orarpt' *.db_unique_name='rptstby' *.diagnostic_dest='/oracle/app/oracle' *.log_archive_dest_1='location=+reco' *.log_archive_dest_2='' *.log_archive_dest_state_2='DEFER' *.log_archive_format='arch_%t_%s_%r.arc' *.log_file_name_convert='+data/orarpt/onlinelog/','+data/rptstby/onlinelog/' *.standby_file_management='AUTO' *.utl_file_dir='*' *.cluster_database=false orarpt2.thread=2 orarpt1.thread=1 orarpt1.instance_number=1 orarpt2.instance_number=2 *.sga_max_size=40G *.sga_target=0 *.shared_pool_size=6G *.db_files=2000 *.pga_aggregate_target=10120855552 *.processes=2000
再启动rman duplicate 错误变成了
RMAN-03009: failure of backup command on d1 channel at 03/14/2016 13:50:07
ORA-00245: control file backup failed; target is likely on a local file system
这个问题的解决方法
RMAN> show all; .. CONFIGURE SNAPSHOT CONTROLFILE NAME TO '/oracle/app/oracle/product/11.2.0.3/dbhome_1/dbs/snapcf_orarpt1.f'; # default 修改到共享磁盘中 RMAN> CONFIGURE SNAPSHOT CONTROLFILE NAME TO '+data/snapcf_orarpt1.f';
经过上面修改了pfile和修改自动备份controlfile 目录到ASM, rman终于顺利开始copy datafile了,因网络限制用了4天3夜把终于把数据copy完了, 剩下工作不在详细描述的如
create spfile in ASM from pfile, modify clustr_database=true, create pfile on other nodes with spfile in ASM. $GIRD_HOME/bin/srvctl add database -d rptstby -o $ORACLE_HOME -p "+DATA/rptstby/parameterfile/spfileorarpt2.ora" -n orarpt -r physical_standby -s mount srvctl add instance -d rptstby -i orarpt1 -n qdexa1db01 srvctl add instance -d rptstby -i orarpt2 -n qdexa1db02 srvctl modify database -d rptstby -a srvctl start database -d rptstby –start real-time redo apply alter database recover managed standby database using current logfile disconnect from session; --stop redo apply alter database recover managed standby database cancel;
NOTE:
刚开始日志相差太多,如果启用ADG, open standby read-only时可能不会成功。
下面这个有些优化技术可以加速日志传速及日志应用。
1, 日志启用压缩传输 — on primary DB
如 LOG_ARCHIVE_DEST_2=’SERVICE=xxxx COMPRESSION=ENABLE ….’
2, 增加归档进程
增加 log_archive_max_processes — on primary DB
增加 log_archive_max_processes = log_archive_max_processes on primary* number of threads — on Standby DB
3, 调整内存参数
增加 standby DB 的 buffer cache 值,要大于primary DB
4, 启用并行日志恢复, 并行度 #CPUs * 2
如: alter database recover managed standby database PARALLEL 48 using current logfile disconnect from session;
# 查询
— 查看primary 库上的日志进度
select thread#,max(sequence#) from v$archived_log group by thread#; select * from v$log;
— 查看standby 库,日志传速进程,应用进程
select t.*,arched-applied gap from (select thread#,max(sequence#) arched, max(decode(applied,'YES',sequence#,1)) applied from v$archived_log group by thread#) t; select to_char(START_TIME,'dd.mm.yyyy hh24:mi:ss') "Recover_start",to_char(item)||' = '||to_char(sofar)||' '||to_char(units)||' '|| to_char(TIMESTAMP,'dd.mm.yyyy hh24:mi') "Values" from v$recovery_progress where start_time=(select max(start_time) from v$recovery_progress);