暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

OGG同步大库时复制进程延迟问题处理

原创 小林番薯 云和恩墨 2021-10-08
2760

1.背景

某银行一套1T的数据库在做ogg升级迁移时,复制进程存在很大的延迟,并且出现卡死现象,RBA不变,延迟时间也不变,原先认为可能是由于库比较大,只使用一个进程做同步而导致延迟,则先对复制进程进程拆分,拆分后发现其中一个进程依旧卡死,通过获取数据库当前卡死的sql分析,此库存在大量无主键表,且有大量的delete操作,出现全表扫描,从而使ogg复制进程所执行的sql存在性能问题。

2.现象

GGSCI (testhost) 16> info testrep1

REPLICAT   testrep1  Last Started 2021-09-16 11:51   Status RUNNING

INTEGRATED

Checkpoint Lag       55:31:35 (updated 02:34:51 ago)

Process ID           47229

Log Read Checkpoint  File ./dirdattest1/lt000000092

2021-09-14 04:20:11.159523  RBA 28231954

3.问题定位

3.1查询新库性能消耗情况

可通过不同方式:

1)获取AWR查看等待事件以及top sql

发现存在大量的delete操作

2)通过查询数据库当前执行时间最长和次数最多的sql

select *

from (select sa.SQL_TEXT,

sa.SQL_FULLTEXT,

sa.EXECUTIONS “执行次数”,

round(sa.ELAPSED_TIME / 1000000, 2) “总执行时间”,

round(sa.ELAPSED_TIME / 1000000 / sa.EXECUTIONS, 2) “平均执行时间”,

sa.COMMAND_TYPE,

sa.PARSING_USER_ID “用户ID”,

u.username “用户名”,

sa.HASH_VALUE

from v$sqlarea sa

left join all_users u

on sa.PARSING_USER_ID = u.user_id

where sa.EXECUTIONS > 0

order by (sa.ELAPSED_TIME / sa.EXECUTIONS) desc)

where rownum <= 10;

3)ogg使用logdump工具查看当前RBA在执行什么事务:

$ logdump

Logdump 467 >POS 28231954

Reading forward from RBA 28231954

Logdump 468 >n

2021/09/14 04:20:20.561.577 Metadata             Len 708 RBA 28231954

Table Name: TESTOWNER.TESTTABLE

3040 be02 0000 0100 0200 0b00 0200 4000 0100 0600 | 0@…@…

0100 0200 0800 0200 0400 0000 0000 0300 0200 0000 | …

0400 0200 0000 0500 0800 0000 0000 0000 0000 0600 | …

0400 0000 0000 0700 0200 0000 0800 0400 0000 0000 | …

0340 6e02 0000 0800 4e00 3a00 0c00 5043 4f4d 5041 | .@n…N.:…PCOMPA

4e59 434f 4445 0000 0800 0000 0800 0000 0800 0000 | NYCODE…

0000 0000 0800 0000 0000 0000 0100 0000 6000 0100 | …`…

Logdump 469 >n

2021/09/14 04:20:18.000.000 Delete              Len   138 RBA 28232746

Name: TESTOWNER,TESTTABLE  (TDR Index: 9)

Before Image:                                             Partition x0c   G  b

0000 0a00 0000 0038 0038 0038 0038 0100 0c00 0000 | …8.8.8.8…

0800 5171 4eab 5206 884c 0200 2d00 0000 2900 2ae6 | …QqN.R…L…-…).*.

a48d e789 a9e6 b2b9 2ae6 9bbc e892 82e7 89b9 e4bc | …*…

a6e5 b0bc e6a9 84e6 a684 e6b2 b9e7 a4bc e79b 9203 | …

0005 0000 0001 0031 0400 0a00 0000 0000 0000 0000 | …1…

6720 0500 0400 ffff 0000 0600 0800 0000 0400 3230 | g …20

3231 0700 0c00 0000 0800 3230 3231 3039 3133      | 21…20210913

Logdump 470 >n

一直执行n,后可看到当前事物在对什么表对象做什么事物操作

4)通过给gv$session视图查看ogg用户所执行的sql

select sid,serial#,inst_id,status,program,sql_id from gv$session where program like ‘%AS%’ and username=’GOLDENGATE’;

3.2新库获取sql的执行计划

新库:

select sql_id,sql_text from vsqlarea where upper(sql_text) like '%TESTTABLE%'; (或使用select sid,serial#,inst_id,status,program,sql_id from gvsession where program like ‘%AS%’ and username=’GOLDENGATE’;)

SQL_ID                       
------------------- 
SQL_TEXT                
----------------------------------------------------------------------- 
55mw6aktcu387                                 
DELETE /*+ restrict_all_ref_cons */ FROM "TESTOWNER"."TESTTABLE"                                 
WHERE "PCOMPANYCODE"=:1    AND "PCOMPANYNAME"=:2    AND "COMMODITYNAME"=:3    AN                                  
D "LINECOUNT"=:4    AND "AMOUNTTAX"=:5    AND "RN" is null AND "DATA_SEASON"=:6                                 
   AND "BATCH_DATE"=:7    and rownum < 2   
  

select plan_table_output from table(dbms_xplan.display_cursor(‘55mw6aktcu387’,0));

SQL_ID  55mw6aktcu387, child number 0                                 
-------------------------------------                                 
 DELETE /*+ restrict_all_ref_cons */ FROM                                 
TESTOWNER."TESTTABLE" WHERE "PCOMPANYCODE"=:1    AND                                  
PCOMPANYNAME=:2    AND "COMMODITYNAME"=:3    AND "LINECOUNT"=:4                                 
AND "AMOUNTTAX"=:5    AND "RN" is null AND "DATA_SEASON"=:6    AND                                  
BATCH_DATE=:7    and rownum < 2                                 
                                  
Plan hash value: 1033196827                        
--------------------------------------------------------------------------------------------------                                  
| Id  | Operation           | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |                                  
--------------------------------------------------------------------------------------------------                                  
|   0 | DELETE STATEMENT    |                            |       |       |   230K(100)|          |                                  
|   1 |  DELETE             | TESTTABLE |       |       |            |          |                                  
|*  2 |   COUNT STOPKEY     |                            |       |       |            |          |                                  
|*  3 |    TABLE ACCESS FULL| TESTTABLE |     1 |    82 |   230K  (1)| 00:00:10 |                                  
--------------------------------------------------------------------------------------------------                                  
                                  
Predicate Information (identified by operation id):                                 
---------------------------------------------------                                 
                                  
   2 - filter(ROWNUM<2)                                 
   3 - filter(("COMMODITYNAME"=:3 AND "AMOUNTTAX"=:5 AND "LINECOUNT"=:4 AND                                 
              "BATCH_DATE"=:7 AND "PCOMPANYNAME"=:2 AND "DATA_SEASON"=:6 AND "PCOMPANYCODE"=:1 AND "RN"                                 
              IS NULL))                
            

问题1:在源库,不存在新库此类的问题,那么我们需要看看新库复制进程所执行的sql,在源库是如何执行的。

3.3源库获取对应sql

通过表名和操作类型进行筛选出源库想匹配的sql

select sql_id,sql_text from v$sqlarea where upper(sql_text) like ‘%TESTTABLE%’;

SQL_ID                                  
--------------------------                                  
SQL_TEXT                                  
--------------------------------------------------------------------------------                                  
                                  
3zar2g3g5kmrp                                 
DELETE FROM TESTOWNER.TESTTABLE T WHERE SUBSTR(T.DATA_SEASON, 1, 4                                  
) = :B1                                 

获取源库sql的执行计划

select plan_table_output from table(dbms_xplan.display_cursor(‘3zar2g3g5kmrp’,0));

PLAN_HASH_VALUE CHILD_NUMBER  OPTIMIZER ID  OPERATION OBJECT_OWNER  OBJECT_NAME OBJECT_TYPE ROWS  BYTES COST  Cost (%CPU) TIME

474800996 ..  ALL_ROWS  0 DELETE STATEMENT                                                                          35376             
474800996 ..            1    DELETE                      TESTOWNER    TESTTABLE                      
474800996 ..            2       TABLE ACCESS FULL        TESTOWNER  TESTTABLE    TABLE   106869   641214     35376     0.62   425         

3.4对比俩端的sql

可以看出新库ogg的复制进程所执行的sql,将表所有列都做个where条件过滤,并且俩边都是走全表扫描,而新库所消耗为230k,源库消耗35376,新库的消耗是源库的6.5倍。

查看该表的主键和索引情况:

select index_name from dba_indexes where owner=’TESTOWNER’ and table_name=‘TESTTABLE’;

select CONSTRAINT_NAME,CONSTRAINT_TYPE from dba_constraints where owner=’ TESTOWNER’ and table_name=‘TESTTABLE’;

该表既没有主键,也没有索引,则ogg将该表的所有列都添加附加日志,所有列都被作为where条件过滤。

4.问题处理

4.1在新库对该表进程创建索引

查询表每列的选择率,选择率越高则越优作为创建索引的列,由于表无主键,我们可以选择多列索引

select * from

(

select A.TABLE_NAME,

A.OWNER,

A.column_name,

A.num_distinct,

B.NUM_ROWS,

TRUNC(A.num_distinct / B.NUM_ROWS * 100) pct,

row_number() over(partition by a.table_name order by A.num_distinct desc ) rn

from dba_tab_col_statistics A, DBA_TABLES B

WHERE A.TABLE_NAME = B.TABLE_NAME

AND A.table_name in (’‘TESTTABLE’,‘TESTTABLE2’)

) where rn <=4

ORDER BY table_name,num_distinct desc;

TABLE_NAME   OWNER                COLUMN_NAME                    NUM_DISTINCT   NUM_ROWS        PCT         RN                                  
------------------------------ -------------------- ------------------------------ ------------ ---------- ---------- ----------                                  
TESTTABLE     TESTOWNER               COMMODITYNAME                        62603109   69559010          90          1                                  
TESTTABLE     TESTOWNER               AMOUNTTAX                             75936   69559010          0          2                                  
TESTTABLE     TESTOWNER               LINECOUNT                              1269   69559010          0          3                                  
TESTTABLE     TESTOWNER               BATCH_DATE                              434   69559010          0          4        

创建索引

create index TESTOWNER.IDX_TESTTABLE on TESTOWNER.TESTTABLE (COMMODITYNAME,AMOUNTTAX,LINECOUNT) tablespace tablespace_name parallel 4 online;

alter index TESTOWNER.IDX_TESTTABLE  noparallel;

4.2再次查询新库sql的执行情况

select sql_id,sql_text from v$sqlarea where upper(sql_text) like ‘%TESTTABLE%’;

SQL_ID                                  
-------------                                 
SQL_TEXT                                  
--------------------------------------------------------------------------------                                  
55mw6aktcu387                                 
 DELETE /*+ restrict_all_ref_cons */ FROM "TESTOWNER"."TESTTABLE"                                 
WHERE "PCOMPANYCODE"=:1    AND "PCOMPANYNAME"=:2    AND "COMMODITYNAME"=:3    AN                                  
D "LINECOUNT"=:4    AND "AMOUNTTAX"=:5    AND "RN" is null AND "DATA_SEASON"=:6                                 
   AND "BATCH_DATE"=:7    and rownum < 2   

可查询sql的不同执行计划CHILD_NUMBER值进行比对

select sql_id,CHILD_NUMBER,PLAN_HASH_VALUE,last_active_time,EXECUTIONS,BUFFER_GETS,CPU_TIME/1e6,ELAPSED_TIME/1e6  from v$sql where sql_id=‘55mw6aktcu387’;

查询执行计划

select plan_table_output from table(dbms_xplan.display_cursor(‘55mw6aktcu387’,1));

SQL_ID  55mw6aktcu387, child number 1                                 
-------------------------------------                                 
 DELETE /*+ restrict_all_ref_cons */ FROM                                 
TESTOWNER."TESTTABLE" WHERE "PCOMPANYCODE"=:1    AND                                  
PCOMPANYNAME=:2    AND "COMMODITYNAME"=:3    AND "LINECOUNT"=:4                                 
AND "AMOUNTTAX"=:5    AND "RN" is null AND "DATA_SEASON"=:6    AND                                  
BATCH_DATE=:7    and rownum < 2                                 
                                  
Plan hash value:                                  
                                  
---------------------------------------------------------------------------------------------------------------------------                                     
| Id  | Operation                     | Name                                 | Rows  | Bytes | Cost (%CPU)| Time     |                                  
---------------------------------------------------------------------------------------------------------------------------                                     
|   0 | DELETE STATEMENT              |                                   |       |       |   190  (1) |          |                                  
|   1 |  DELETE                       | TESTTABLE                          |       |       |            |          |                                  
|*  2 |   COUNT STOPKEY               |                                   |       |       |            |          |                                  
|*  3 |    TABLE ACCESS BY INDEX ROWID| TESTTABLE                          |     1 |    82 |   190  (1) | 00:00:01 |     
|*  4 |      INDEX RANGE SCAN         | IDX_TESTTABLE               |     1 |    82 |   5  (1)   | 00:00:01 |                               
--------------------------------------------------------------------------------------------------------------------------- 

加索引后的sql消耗为190,消耗减少了1210倍。复制进程也开始恢复正常。

4.3查询统计信息情况

注意:在新库做同步时,会将触发器给禁用了,统计信息容易出现不准确的现象,在出现延迟,并执行计划正常时,我们需要确认一下统计信息的情况,必要时手动收集一下统计信息

select table_name,NUM_ROWS,LAST_ANALYZED from dba_tables where owner=’ TESTOWNER ’ and table_name=’ TESTTABLE ';

针对某个表进行收集统计信息:

exec dbms_stats.gather_table_stats(ownname=>‘TESTOWNER’,tabname=>‘TESTTABLE’,degree=>4);

按用户进行收集统计信息:

exec dbms_stats.gather_schema_stats(ownname=>‘TESTOWNER’,degree=>4);

5.扩展内容

5.1复制进程拆分

1.停止源端投递进程,并记录其csn相关信息(重要)

GGSCI>stop TESTREP1

Sending STOP request to REPLICAT TESTREP1 …

Request processed.

GGSCI>info TESTREP1

2.停止当前应用进程,并记录其csn相关信息

GGSCI>info  TESTREP1

REPLICAT   TESTREP1   Last Started 2016-08-31 22:09   Status RUNNING

Checkpoint Lag       00:00:00 (updated 00:00:07 ago)

Process ID           65116

Log Read Checkpoint  File ./dirdat/lr000001429

2016-09-21 15:27:14.147180  RBA 57049346

3.dblogin 登录数据库

GGSCI>dblogin USERID ogg, password xxxxxx

4.从老的应用进程参数配置中,将拆分出的表删除

GGSCI>edit params TESTREP1

5.为新的应用进程进行参数配置,将拆分出的表加入

GGSCI>edit params NEW1REP1

6.添加新的进程

GGSCI>add replicat NEW1REP1 integrated,exttrail ./dirdat/**

7.修改目标端应用进程开始抽取点(以原来的应用进程scn信息为准)

GGSCI > alter NEW1REP1,extseqno 1429,extrba 57049346

8.启动拆分后的两个(多个)进程并查看其状态

GGSCI > start NEW1REP1

GGSCI > info NEW1REP1

GGSCI > start TESTREP1

GGSCI > info TESTREP1

9.启动源端投递进程,并查看其状态

5.2ogg复制进程的开启10046事件

1,修改复制进程参数文件

添加一下参数,然后重启复制进程

sqlexec “alter session set tracefile_identifier=‘ogg_debug’”

sqlexec “alter session set statistivs_level=all”

sqlexec “alter session set max_dump_file_size_unlimited”

sq;exec “alter session set events ‘10046 trace name context forever,level 12’”

2.重启进程

3.查看跟踪文件

到数据库的日志文件目录里,查看ogg_bebug开头的trace文件

最后修改时间:2021-10-08 17:17:34
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论