暂无图片
暂无图片
2
暂无图片
暂无图片
暂无图片

【OGG故障处理】目标端trail文件不完整故障处理

原创 闫伟 2021-07-27
4522

【OGG故障处理】目标端trail文件不完整故障处理

本文档指导如何通过指定pump的seq重新传输trail文件。具体故障的分析步骤和思路

目录

一、故障描述

1.1 环境描述

客户有一套Oracle 11.2.0.4.0 aix系统的RAC通过OGG同步数据到一体机X5上,一体机是12c数据库RAC

源端端OGG版本:

[oracle@xzjbrac2:/ogg]$ggsci

Oracle GoldenGate Command Interpreter for Oracle
Version 12.1.2.1.10 21604177 OGGCORE_12.1.2.1.0OGGBP_PLATFORMS_150902.1337
AIX 6, ppc, 64bit (optimized), Oracle 11g on Sep  2 2015 16:29:21
Operating system character set identified as ISO-8859-1.

Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserve

目标端OGG版本:

[oracle@XXXXXXXX ogg]$ ggsci

Oracle GoldenGate Command Interpreter for Oracle
Version 12.1.2.1.10 21604177 OGGCORE_12.1.2.1.0OGGBP_PLATFORMS_150902.1337_FBO
Linux, x64, 64bit (optimized), Oracle 12c on Sep  2 2015 21:49:08
Operating system character set identified as UTF-8.

Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved.

1.2 故障描述

客户通知OGG数据没有同步,OGG的源端pump和对端的rep都出现报错

GGSCI (XXXXXXXX.xzjb.com) 3>  info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
REPLICAT    ABENDED     REPOA       00:00:00      00:17:37    
REPLICAT    ABENDED     REPOB       00:00:00      00:11:16    
REPLICAT    RUNNING     REPOC       00:00:00      00:00:05    
REPLICAT    ABENDED     REPOD       00:00:00      00:13:26    
REPLICAT    ABENDED     REPOE       04:52:01      56:12:42    
REPLICAT    ABENDED     REPOG       00:00:00      00:09:16    
REPLICAT    ABENDED     REPOH       00:00:00      00:05:27 


GGSCI (xzjbrac2) 7> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     ABENDED     DPOA        00:07:41      01:02:10    
EXTRACT     ABENDED     DPOB        00:16:43      04:04:26    
EXTRACT     ABENDED     DPOC        07:43:48      01:35:22    
EXTRACT     RUNNING     DPOD        00:00:00      00:00:00    
EXTRACT     ABENDED     DPOE        07:51:41      01:11:14    
EXTRACT     ABENDED     DPOG        04:39:57      01:35:20    
EXTRACT     ABENDED     DPOH        00:00:00      01:06:05    
EXTRACT     RUNNING     EXTOA       00:00:02      00:00:07    
EXTRACT     RUNNING     EXTOB       00:00:03      00:00:00    
EXTRACT     RUNNING     EXTOC       00:00:03      00:00:06    
EXTRACT     RUNNING     EXTOD       00:00:03      00:00:02    
EXTRACT     RUNNING     EXTOE       00:00:03      00:00:00    
EXTRACT     RUNNING     EXTOG       00:00:03      00:00:09    
EXTRACT     RUNNING     EXTOH       00:00:03      00:00:08 

二、分析过程

2.1 首先检查rep进程的报告


目标端:
2021-07-26 10:21:10  ERROR   OGG-01028  Error 22 (Invalid argument) reading /dbfs/my_dbfs/ogg/dirdat/repoa/ra001461 (rba 486139617).
....

源端:
EXTRACT dpoa
PASSTHRU
RMTHOST XX.XX.XX.XXXX, MGRPORT 17809
RMTTRAIL /dbfs/my_dbfs/ogg/dirdat/repoa/ra

2021-07-26 11:42:39  ERROR   OGG-01031  There is a problem in network communication, a remote file problem, encryption keys for target and source do not match (if using ENCRYPT) or an unknown err
or. (Reply received is error 22 reading queue file /dbfs/my_dbfs/ogg/dirdat/repob/rb006473 at rba 144174060 (Invalid argument)).

通过错误分析应该是文件系统或者网络出现问题。

2.2 两端进行网络ping

通过ping发现没有丢包

2.3 测试文件系统

du: cannot access `dsg/ds2/rmp/cfg.finishseq.old': Invalid argument
du: cannot access `dsg/ds2/rmp/cfg.finishseq': Invalid argument
du: cannot access `dsg/ds2/rmp/2.572445.xf1': Invalid argument
du: cannot access `dsg/ds2/vcfsa': Invalid argument
du: cannot access `dsg/ds2/vcfsd': Invalid argument
du: cannot access `dsg/ds2/exp': Invalid argument
du: cannot access `dsg/ds2/config': Invalid argument
du: cannot access `dsg/ds2/bin': Invalid argument
du: cannot access `dsg/ds2/scripts': Invalid argument
du: cannot access `dsg/ds2/lock': Invalid argument
发现大量Invalid argument报错

ls: reading directory .: Invalid argument
[oracle@XXXXXXXX my_dbfs]$ ls
bck  db_ssjzh  dp  dsg  exp1  expdp  expdp1  expdp3  import.log  ogg  rman2ban  rsrc0723.dmp  shenji  yyaa01.dmp
[oracle@XXXXXXXX my_dbfs]$ ls
ls: reading directory .: Invalid argument
[oracle@XXXXXXXX my_dbfs]$ 



[root@XXXXXXXX ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       30G   11G   18G  40% /
tmpfs                 756G  1.8M  756G   1% /dev/shm
/dev/sda1             488M   47M  406M  11% /boot
/dev/mapper/VGExaDb-LVDbOra1
                       99G   70G   24G  75% /u01
dbfs-dbfs@:/           12T  6.4T  5.4T  54% /dbfs

2.4 操作系统日志

2.4.1 主机性能

Tasks: 5751 total,   6 running, 5745 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.7%us,  0.5%sy,  0.0%ni, 92.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  792034780k total, 734541072k used, 57493708k free,   604200k buffers
Swap: 25165820k total, 25165820k used,        0k free, 202649560k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                             
185681 oracle    20   0  149g  39g  39g R 100.0  5.2  38:45.83 

2.4.2 主机日志

/my_dbfs/dsg/ds2/xdt/2_19829.xdt 1001
Jul 26 11:27:01 XXXXXXXX dbfs_client[41428]: ASSERT failure::
Jul 26 11:27:01 XXXXXXXX dbfs_client[41428]: Assert in FSO
Jul 26 11:27:01 XXXXXXXX dbfs_client[41428]: #012
Jul 26 11:27:01 XXXXXXXX dbfs_client[41428]: Async thread exception /my_dbfs/dsg/ds2/rmp/rowid/o0003e543/f3e594_284.rmp 1001
Jul 26 11:27:01 XXXXXXXX dbfs_client[41428]: ASSERT failure::
Jul 26 11:27:01 XXXXXXXX dbfs_client[41428]: Assert in FSO
Jul 26 11:27:01 XXXXXXXX dbfs_client[41428]: #012
Jul 26 11:27:01 XXXXXXXX dbfs_client[41428]: ASSERT failure::
Jul 26 11:27:01 XXXXXXXX rsyslogd-2177: imuxsock begins to drop messages from pid 41428 due to rate-limiting

发现dbfs_client有报错,所有一体机的/dbfs有问题,再次说明/dbfs不是一般文件系统,是通过数据库的dbfs表空间封装的文件系统, 这里操作过exadata都是了解的,不在这里过多描述

2.5 检查dbfs数据库情况

2.5.1 数据库alert日志

  Current log# 1 seq# 1646699 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_1.280.942352145
Thread 1 cannot allocate new log, sequence 1646700
Checkpoint not complete
  Current log# 1 seq# 1646699 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_1.280.942352145
Mon Jul 26 16:53:49 2021
Thread 1 advanced to log sequence 1646700 (LGWR switch)
  Current log# 2 seq# 1646700 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_2.281.942352145
Thread 1 cannot allocate new log, sequence 1646701
Checkpoint not complete
  Current log# 2 seq# 1646700 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_2.281.942352145
Mon Jul 26 16:53:53 2021
Thread 1 advanced to log sequence 1646701 (LGWR switch)
  Current log# 1 seq# 1646701 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_1.280.942352145
Thread 1 cannot allocate new log, sequence 1646702
Checkpoint not complete
  Current log# 1 seq# 1646701 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_1.280.942352145
Mon Jul 26 16:53:55 2021
Thread 1 advanced to log sequence 1646702 (LGWR switch)
  Current log# 2 seq# 1646702 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_2.281.942352145
Thread 1 cannot allocate new log, sequence 1646703
Checkpoint not complete
  Current log# 2 seq# 1646702 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_2.281.942352145
Mon Jul 26 16:53:56 2021
Thread 1 advanced to log sequence 1646703 (LGWR switch)
  Current log# 1 seq# 1646703 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_1.280.942352145
Thread 1 cannot allocate new log, sequence 1646704
Checkpoint not complete
  Current log# 1 seq# 1646703 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_1.280.942352145
Mon Jul 26 16:53:59 2021
Thread 1 advanced to log sequence 1646704 (LGWR switch)
  Current log# 2 seq# 1646704 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_2.281.942352145
Thread 1 cannot allocate new log, sequence 1646705
Checkpoint not complete
  Current log# 2 seq# 1646704 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_2.281.942352145
Mon Jul 26 16:54:01 2021
Thread 1 advanced to log sequence 1646705 (LGWR switch)
  Current log# 1 seq# 1646705 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_1.280.942352145
Thread 1 cannot allocate new log, sequence 1646706
Checkpoint not complete
  Current log# 1 seq# 1646705 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_1.280.942352145
Mon Jul 26 16:54:05 2021
Thread 1 advanced to log sequence 1646706 (LGWR switch)
  Current log# 2 seq# 1646706 mem# 0: +DATAC1/DBFSDB/ONLINELOG/group_2.281.942352145
Thread 1 cannot allocate new log, sequence 1646707
Checkpoint not complete

这里没有报错,这是redo切换过快,说明IO比较繁忙而已

2.5.2 查看表空间

tablespace_name  tablespace(M)	  used(M)	use% free_space(M)	free% max_block(M)
---------------- ------------- ---------- ---------- ------------- ---------- ------------
SYSTEM			   810	   805.75      99.48	      4.25	  .52		 4
SYSAUX			  6100	  5611.56      91.99	    488.44	 8.01	       295
DBFS		      12288000 10226641.4      83.22	2061358.56	16.78	      2048
UNDOTBS1		 11565	  7196.06      62.22	   4368.94	37.78	   3594.44
USERS			     5	     1.69	33.8	      3.31	 66.2	      3.31
UNDOTBS2		  2425	    21.25	 .88	   2403.75	99.12	      2358


没有发现表空间占满的情况导致IO

2.5.3 查看REDO切换

TO_CHAR(FIRST_TIME,'	THREAD#   COUNT(*)
-------------------- ---------- ----------
2021-07-14		      1       2820
2021-07-15		      1       2899
2021-07-16		      1       3738
2021-07-17		      1       2496
2021-07-18		      1       2363
2021-07-19		      1       2861
2021-07-20		      1       3537
2021-07-21		      1       8153
2021-07-22		      1      10094
2021-07-23		      1       4591
2021-07-24		      1        207

TO_CHAR(FIRST_TIME,'	THREAD#   COUNT(*)
-------------------- ---------- ----------
2021-07-25		      1 	56
2021-07-26		      1        869

2.5.4 查看用户状态

  
col default_tablespace for a21
col temporary_tablespace for a21
select username,account_status,default_tablespace,temporary_tablespace,created from dba_users;

发现DBFS用户虽然是open的但是提示要过期了,估计这个是故障点

2.5.5 查看对象状态

select * from dba_objects where owner='DBFS' and status <> 'VALID';
对象正常

三、处理文件系统问题

3.1 文件系统状态

[oracle@xzjbdbadm01 ~]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       30G   11G   18G  39% /
tmpfs                 756G  1.8M  756G   1% /dev/shm
/dev/sda1             488M   47M  406M  11% /boot
/dev/mapper/VGExaDb-LVDbOra1
                       99G   70G   24G  75% /u01
dbfs-dbfs@:/           12T  6.4T  5.4T  54% /dbfs
[oracle@xzjbdbadm01 ~]$ 

3.2 查看使用情况

ps -ef|grep dbfs
ps -ef|grep dbfs|grep dsg|wc -l
发现有处理OGG使用还是DSG同步使用

3.3 尝试umount文件系统

fusermount -u /dbfs
一直报错再忙

3.4 通知DSG关闭进程

3.5 关闭OGG所有的进程

stop *

3.6 调整用户的密码策略

USERNAME		       ACCOUNT_STATUS							PROFILE
------------------------------ ---------------------------------------------------------------- --------------------
DBFS			       OPEN								DEFAULT

alter profile "DEFAULT" limit PASSWORD_GRACE_TIME UNLIMITED;

alter profile "DEFAULT" limit PASSWORD_LIFE_TIME UNLIMITED;

alter profile "DEFAULT" limit PASSWORD_LOCK_TIME UNLIMITED;

alter profile "DEFAULT" limit FAILED_LOGIN_ATTEMPTS UNLIMITED;

3.7 重启DBFS数据库

srvctl stop database -d dbfsdb
srvctl start database -d dbfsdb

3.8 重新mount /dbfs

[oracle@xzjbdbadm01 ~]$ more mountdbfs.sh 
#! /bin/bash
 
#       Please change the $ORACLE_SID, database username and password, and the mount point
#       to match your environment.
 
ORACLE_SID=dbfsdb1
 
ORAENV_ASK=NO
# Use full path for "local" oraenv to cater for (lack of) clean login env
ORACLE_HOME=/u01/app/oracle/product/12.1.0.2/dbhome_1
 
echo dbfs | nohup ${ORACLE_HOME}/bin/dbfs_client dbfs@ -o allow_other,direct_io,max_readahead=0,server_readahead=272,max_threads=32 /dbfs &

3.9 测试文件系统IO正常

四、解决OGG故障

4.1 ogg状态

GGSCI (xzjbdbadm01.xzjb.com) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
REPLICAT    RUNNING     REPOA       00:00:00      00:00:06    
REPLICAT    ABENDED     REPOB       00:00:00      00:17:14    
REPLICAT    ABENDED     REPOC       00:00:00      00:17:15    
REPLICAT    RUNNING     REPOD       00:00:00      00:00:03    
REPLICAT    ABENDED     REPOE       04:52:01      61:29:55    
REPLICAT    RUNNING     REPOG       00:00:00      00:00:00    
REPLICAT    ABENDED     REPOH       00:00:00      00:17:14  

GGSCI (xzjbrac2) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     ABENDED     DPOA        01:38:38      00:01:39    
EXTRACT     ABENDED     DPOB        04:43:55      00:07:39    
EXTRACT     ABENDED     DPOC        07:43:48      02:05:47    
EXTRACT     RUNNING     DPOD        00:00:00      00:00:03    
EXTRACT     RUNNING     DPOE        07:51:41      01:41:39    
EXTRACT     ABENDED     DPOG        06:38:01      00:07:42    
EXTRACT     ABENDED     DPOH        00:17:03      00:07:32    
EXTRACT     RUNNING     EXTOA       00:00:02      00:00:08    
EXTRACT     RUNNING     EXTOB       00:00:02      00:00:05    
EXTRACT     RUNNING     EXTOC       00:00:02      00:00:05    
EXTRACT     RUNNING     EXTOD       00:00:03      00:00:09    
EXTRACT     RUNNING     EXTOE       00:00:02      00:00:07    
EXTRACT     RUNNING     EXTOG       00:00:02      00:00:05    
EXTRACT     RUNNING     EXTOH       00:00:02      00:00:03   

4.2 应用进程日志报错

2021-07-26 15:38:45  ERROR   OGG-01028  Incompatible record (104) in /dbfs/my_dbfs/ogg/dirdat/repoc/rc000611, rba 177237 (getting he
ader).

说明由于IO问题导致trail文件记录不完整。需要重新传输相关tail文件

4.3 查看log_scn

在一体机上
SQL> select group_name,log_csn from ggadmin.checkpointtable;

GROUP_NA LOG_CSN
-------- --------------------------------------------------------------------------------------------------------------------------------
REPOE	 18257138616670
REPOC	 18257199285861
REPOB	 18257221962417  
REPOG	 18257168264659
REPOD	 18256668306120
REPOA	 17283784496659
REPOH	 18257145755986
上面记录应用的最近csn

4.4 对比报错的目标和源端日志

源端
[oracle@xzjbrac2:/ogg/dirdat/extoc]$ls -ltr
total 1012888
-rw-r-----    1 oracle   oinstall  499999925 Jul 26 14:35 lc000641
-rw-r-----    1 oracle   oinstall   18324871 Jul 26 16:26 lc000642
目标端
[oracle@xzjbdbadm01 ~]$ ll /dbfs/my_dbfs/ogg/dirdat/repoc/rc000611
-rw-r----- 1 oracle oinstall 659375 Jul 24 05:53 /dbfs/my_dbfs/ogg/dirdat/repoc/rc000611

4.5 通过logdump查询log_scn

logdump
open /ogg/dirdat/extoc/lc000641
ggstoken detail
filter logcsn = 18257199285861
next
发现scn的记录在/ogg/dirdat/extoc/lc000641

4.6 查看pump的详细状态

GGSCI (xzjbrac2) 2> info dpoc detail

EXTRACT    DPOC      Last Started 2021-07-26 14:08   Status RUNNING
Checkpoint Lag       00:00:00 (updated 00:00:03 ago)
Process ID           53085114
Log Read Checkpoint  File ./dirdat/extoc/lc000642
                     2021-07-26 16:28:20.000000  RBA 18644813

  Target Extract Trails:

  Trail Name                                       Seqno        RBA     Max MB Trail Type

  /dbfs/my_dbfs/ogg/dirdat/repoc/rc                  680   87555160        500 RMTTRAIL  

  Extract Source                          Begin             End     
  
  这里说明对端的seq为680;本地当前是642的seq

4.7 修改pump指定到641重新传输

stop dpoc
alter dpoc, extseqno 641, extrba 0
alter dpoc, etrollover  --切换
切换后对端应该是681了

4.8 应用进程调整

ggsci > alter repoc, extseqno 681, extrba 0   -从681开始
ggsci > start repoc, aftercsn 18257199285861  --从log_scn重新追加

4.9 按照类似操作处理其他进程

五、参考文档

esync Recover Data ( Tables , Trails ) Trails Are Missing, Damaged, Incompatible, Corrupt or Suspect for Extract or Replicat (Doc ID 1614665.1)

OGG-01028,unexpected LOB Record Or Record Fragment In ./dirdat/ea000563,rba1215 (文档 ID 1599376.1)

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论