– OGG 大事务处理 特大事务处理[应用进程]
故障分析1 :OGG应用进程延时非常高
Program Status Group Lag at Chkpt Time Since Chkpt
REPLICAT RUNNING R_07 00:00:04 39:27:19
故障分析2:检查进程日志位置,应用位置不变,应用检查点是6号文件
GGSCI (ogg_db1) 38> info R_07
REPLICAT R_07 Last Started 2021-08-26 09:18 Status RUNNING
INTEGRATED
Checkpoint Lag 00:00:04 (updated 40:24:14 ago)
Process ID 97630 -- 可以使用processid和v$session关联,查看执行的SQL
Log Read Checkpoint File ./dirdat/b3000000006
2021-08-24 17:38:31.000000 RBA 887034291
故障分析3:send status检查应用进程状态,命令被卡住无反应
GGSCI (ogg_db1) 52> send R_07,status
故障分析4:直接KILL掉应用进程,重新启动
GGSCI (ogg_db1) 53> kill R_07
GGSCI (ogg_db1) 54> start R_07
故障分析5:启用应用进程之后,info R_07 还是6号文件位置,send status在变,说明在应用大事务
GGSCI (ogg_db1) 60> send R_07,status
Sending STATUS request to REPLICAT R_07 ...
Current status: Processing data
Sequence #: 8
RBA: 1,403,433,736
14,800,801 records in current transaction.
STOP request pending end-of-transaction (14,800,801 records so far).
GGSCI (ogg_db1) 61> send R_07,status
Sending STATUS request to REPLICAT R_07 ...
Current status: Processing data
Sequence #: 9 -- sequence 和 rba在变
RBA: 874,973,822
19,601,669 records in current transaction.
STOP request pending end-of-transaction (19,601,669 records so far). -- 已经读入了1900万数据(未提交)
-- 故障分析6:检查目标数据库中应用进程
SYS@orcl(orcldg): 1> select sid,serial#,event,status from v$session where username='OGG_ADMI' and process=97630;
SID SERIAL# EVENT status
------- ------------------ ---------------------------------------- ----------
10988 3527 latch: shared pool ACTIVE
SYS@orcl(orcldg): 10988> @sesscli 10988
SID_SERIAL
INST_PID USERNAME OSUSER status EVENT machine TERMINAL PROGRAM CLIENT_INFO
---------------------- --------------- ---------- ---------- ------------------------------ --------------- ---------- -------------------- --------------------
10988,3527 1 260855 OGG_ADMI oracle ACTIVE Streams: resolve low memory co khfwpt-scoggdb- replicat@khfwpt-scog
SID_SERIAL
INST_PID SQL_ID STATE LAST_EXECTIME CALLET SEQ# BLOCKING
---------------------- -------------- ----------------- -------------- ------ ------ --------
10988,3527 1 260855 .2vh58kxmbt9gc WAITING 08-26 10:44:54 18S 14145
发现进程等待事件1: latch: shared pool
发现进程等待事件2: Streams: resolve low memory condition
未直接发现进程应用相关的SQL
-- 检查数据库负载和内存stream参数
SYS@orcl(orcldg): 1> !free -g
total used free shared buffers cached
Mem: 1008 462 546 1 0 11
-/+ buffers/cache: 449 558
Swap: 15 0 15
SYS@orcl(orcldg): 1> @sga
-- V$SGAINFO
NAME SIZE UN RES
------------------------------- ------------------ -- ---
Maximum SGA Size 408.2 GB No
Buffer Cache Size 319.5 GB Yes
Streams Pool Size 45.5 GB Yes
Shared Pool Size 35.5 GB Yes
Startup overhead in Shared Pool 21.1 GB No
Large Pool Size 3.5 GB Yes
Java Pool Size 3.5 GB Yes
Redo Buffers 684 MB No
Granule Size 512 MB No
Fixed SGA Size 2 MB No
Shared IO Pool Size 0 B Yes
Free SGA Memory Available 0 B
top - 10:55:25 up 4 days, 21:29, 4 users, load average: 12.52, 13.26, 13.86
Tasks: 7438 total, 16 running, 7422 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.0%us, 1.7%sy, 0.0%ni, 91.0%id, 0.1%wa, 0.0%hi, 0.2%si, 0.0%st
-- 故障分析7:检查UNDO,发现一个大事务,用到了8GB的UNDO空间
T-S TRA OSUSER
SEGMENT_NAME START_TIME SQL_ID SID_SERIAL STAT used_ublk SIZE_MB BLKS MACHINE PROGRAM TERMINAL USERNAME
------------------------- ----------------- -------------- ------------------------- ---- --------- ---------- ---------- ---------- --------------- ------------------------------ -------------------
_SYSSMU89_178821965$ 08/26/21 10:56:03 5k2jgfrr8nyav. 9244,1051 363181 A A 9 8 1024 orcl3 oracle@orcl3 ( UNKNOWN oracle OGG_ADMI
_SYSSMU77_682973236$ 08/26/21 09:21:46 9x4xnbb41m42r. 13764,7177 269166 A A 968193 7593 971904 orcl3 oracle@orcl3 ( UNKNOWN oracle OGG_ADMI
-- 检查大事务的SQL和进程:
SYS@orcl(orcldg): 1> @sql 9x4xnbb41m42r
DELETE /*+ restrict_all_ref_cons */ FROM "gz1"."MK_*_LOG" WHERE "FLOW_SN"=:1
SYS@orcl(orcldg): 1> @sesscli 13764
SID_SERIAL
INST_PID USERNAME OSUSER STATUS EVENT machine TERMINAL PROGRAM CLIENT_INFO
---------------------- --------------- ---------- ------------ ------------------------------ --------------- ---------- -------------------- --------------------
13764,7177 1 269166 OGG_ADMI oracle ACTIVE latch: row cache objects orcl3 UNKNOWN oracle@orcl3 (AS1E)
SID_SERIAL
INST_PID USERNAME OSUSER status EVENT machine TERMINAL PROGRAM CLIENT_INFO
---------------------- --------------- ---------- ---------- ------------------------------ --------------- ---------- -------------------- --------------------
13764,7177 1 269166 OGG_ADMI oracle ACTIVE gc current request orcl3 UNKNOWN oracle@orcl3 (AS1E)
1 row selected.
-- 故障分析8: 查询出来的表在OGG应用配置中匹配
-- 发现大事务操作的表在这个异常的进程中,基本定位问题
-- 和应用调协,这个一个不太重要的日志分区表,有13亿数据量,可以先跳过这张日志表的同步,待追平之后再启用这张表的同步,后续由应用自己处理数据表差异
MAP gz1.MK_*_LOG ,TARGET gz1.MK_*_LOG;
-- 解决方案步骤:
1 OGG端kill掉应用进程
GGSCI (ogg_db1) 53> kill R_07
2 数据库查看并杀掉大事务,等待完成回滚
SYS@orcl(orcldg): 1> ! kill -9 269166
SYS@orcl(orcldg): 1> ! kill -9 260855
3 应用进程中取消大事务表同步
-- MAP gz1.MK_*_LOG ,TARGET gz1.MK_*_LOG;
4 启动应用进程
GGSCI (ogg_db1) 53> start R_07
-- 跳过这个大事务日志表之后,应用队列文件从sequence# 6 直接跳到了sequence# 19 (2000M一个队列文件)
-- 之前重启应用进程,应用了3个小时,用了8GB的UNDO空间,日志文件读了2个多一点
5 待追平之后,重新启用日志表同步,停启应用进程
-- 加上冲突解决
MAP gz1.MK_*_LOG ,TARGET gz1.MK_*_LOG,RESOLVECONFLICT(INSERTROWEXISTS,(DEFAULT,OVERWRITE)),RESOLVECONFLICT(DELETEROWMISSING,(DEFAULT,DISCARD)),RESOLVECONFLICT(UPDATEROWMISSING,(DEFAULT,OVERWRITE));
-- 总结
1 特大分区日志表,删除数据操作维护,需要按照分区、子分区,按主键或rowid范围拆分成小事务再提交,或者直接truncate,避免delete操作
2 OGG对用大事务,长事务的支持不是很友好,应尽量避免大事务,长事务
最后修改时间:2021-08-26 14:44:11
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




