适用范围
适用于OGG所有版本。
问题概述
2022年11月11日,收到客户反馈,说OGG目标端一个replicat进程延迟很大,查看进程RBA号持续不动,trail文件也一直不更新。
源端查看进程发现捕获进程已经ABENDED,与replicat进程停止复制的时间一致。
查看ggserr.log报错:Unable to write to file “./dirdat/ec005114” (error 28, No space left on device),显然目录没有空间了。查看文件系统,OGG目录果然使用率100%。
mgr进程配置了自动清理过期trail,却没有自动清理。排查发现源端的捕获、投递extract进程各有两个Extract Trail路径,其中Seqno:0 的是无效的。
故障分析
1、OGG目标端一个replicat进程延迟很大

2、首先查看进程RBA号,持续不动。
REPLICAT REP1 Last Started 2022-05-31 10:20 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:00 ago)
Log Read Checkpoint File ./dirdat/ec005133
2022-11-11 08:37:30.897554 RBA 60803824
3、查看ggserr.log和进程report日志,都没有出现报错;查看数据库v$session,也没有OGG相关的大事务会话。
set linesize 145
set pagesize 11111
col username for a12
col PROGRAM for a18
col MACHINE for a15
col EVENT for a34
col TERMINAL for a15
col osuser for a15
col sql_id for a13
col STATUS for a8
col sid for 9999
col serial# for 9999999
select username,SID,SERIAL#,BLOCKING_INSTANCE,blocking_session,BLOCKING_SESSION_STATUS,STATUS,MACHINE,PROGRAM,TERMINAL,SQL_ID,EVENT from v$session where username is not null and event not like '%message%' and username='OGG' order by event;
4、查看replicat进程trail文件,发现trail文件的修改时间和replicat进程的Log Read Checkpoint时间一致,也就是说源端一直未投递trail到目标端。
[oracle@host dirdat]$ ls -l ec005133
-rw-r----- 1 oracle oinstall 60975590 Nov 11 08:37 ec005133
[oracle@host dirdat]$
5、检查源端进程,发现捕获进程abended
GGSCI (host) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING PUMP1 00:00:02 00:00:07
EXTRACT ABENDED EXT1 unknown 00:00:03
6、查看ggserr.log日志,根据报错信息显示,明显没有空间了
2022-11-11 08:13:39 ERROR OGG-01096 Oracle GoldenGate Capture for Oracle, ext1.prm: Unable to write to file "./dirdat/ec005114" (error 28, No space left on device).
2022-11-11 08:13:41 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, ext1.prm: PROCESS ABENDING.
7、查看文件系统使用率,OGG目录果然使用率100%。
[oracle@host dirdat]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-LogVol_root
99G 8.5G 85G 10% /
tmpfs 505G 658M 505G 1% /dev/shm
/dev/sda1 240M 33M 195M 15% /boot
/dev/mapper/VolGroup-LogVol_oracle
99G 64G 30G 69% /u01
/dev/mapper/oggvg-lvogg
96G 96G 0 100% /oggfs
8、查看哪些文件占用的空间最大,发现是trail文件。
[oracle@host oggfs]$ du -sh dirdat
93G dirdat
9、mgr配置了自动清理过期trail却没有自动清理。
GGSCI (host) 3> view param mgr
port 8899
DYNAMICPORTLIST 8899-9988
--autostart er *
autorestart extract *, retries 5, waitminutes 1
purgeoldextracts ./dirdat/*, usecheckpoints, minkeepdays 1
userid ogg, password ogg
purgeddlhistory minkeepdays 15, maxkeepdays 30
purgemarkerhistory minkeepdays 15, maxkeepdays 30
10、查看进程exttrail信息,发现进程有两个Extract Trail路径,其中Seqno:0 的是无效的。
GGSCI (host) 1> info exttrail *
Extract Trail: /oggfs/dirdat/ec
Extract: PUMP1
Seqno: 0
RBA: 0
File Size: 100M
Extract Trail: ./dirdat/ec
Extract: PUMP1
Seqno: 5345
RBA: 37735873
File Size: 100M
Extract Trail: /oggfs/dirdat/ec
Extract: EXT1
Seqno: 0
RBA: 0
File Size: 100M
Extract Trail: ./dirdat/ec
Extract: EXT1
Seqno: 5328
RBA: 22457198
File Size: 100M
11、删除无效的Extract Trail路径,发现正在运行的进程无法删除(一定确定删除的Extract Trail路径没有其他进程正在使用)
GGSCI (host) 6> DELETE EXTTRAIL /oggfs/dirdat/ec
Cannot delete extract trail /oggfs/dirdat/ec, extract PUMP1 is running.
Cannot delete extract trail /oggfs/dirdat/ec, extract EXT1 is running.
12、先停掉捕获、投递进程
GGSCI (host) 11> stop ext1
Sending STOP request to EXTRACT EXT1 ...
Request processed.
GGSCI (host) 12> stop pump
Sending STOP request to EXTRACT PUMP1 ...
Request processed.
13、再次删除无效的Extract Trail路径
GGSCI (host) 13> DELETE EXTTRAIL /oggfs/dirdat/ec
Deleting extract trail /oggfs/dirdat/ec for extract PUMP1
Deleting extract trail /oggfs/dirdat/ec for extract EXT1
14、再次查看进程exttrail信息,发现无效Extract Trail路径已经被删除
GGSCI (host) 14> info exttrail *
Extract Trail: ./dirdat/ec
Extract: PUMP1
Seqno: 5345
RBA: 94634565
File Size: 100M
Extract Trail: ./dirdat/ec
Extract: EXT1
Seqno: 5328
RBA: 79370115
File Size: 100M
15、查看日志发现开始自动删除过期trail文件
2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005000, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5000.
2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005001, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5001.
2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005002, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5002.
2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005003, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5003.
2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005004, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5004.
2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005005, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5005.
2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005006, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5006.
2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005007, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5007.
2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005008, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5008.
故障根源
源端的捕获、投递extract进程各有两个Extract Trail路径,其中有一个路径无效,导致mgr无法自动删除过期的trail文件,进而导致ogg目录文件系统空间满,最终导致捕获进程因“No space left on device”而ABENDED。
解决方案
1、先停掉有无效Extract Trail路径的进程(不停无法删除);
2、删除进程中无效的Extract Trail路径;
3、再启动ogg进程后,自动清理trail文件的配置生效。




