1.背景
某银行一套1T的数据库在做ogg升级迁移时,复制进程存在很大的延迟,并且出现卡死现象,RBA不变,延迟时间也不变,原先认为可能是由于库比较大,只使用一个进程做同步而导致延迟,则先对复制进程进程拆分,拆分后发现其中一个进程依旧卡死,通过获取数据库当前卡死的sql分析,此库存在大量无主键表,且有大量的delete操作,出现全表扫描,从而使ogg复制进程所执行的sql存在性能问题。
2.现象
GGSCI (testhost) 16> info testrep1
REPLICAT testrep1 Last Started 2021-09-16 11:51 Status RUNNING
INTEGRATED
Checkpoint Lag 55:31:35 (updated 02:34:51 ago)
Process ID 47229
Log Read Checkpoint File ./dirdattest1/lt000000092
2021-09-14 04:20:11.159523 RBA 28231954
3.问题定位
3.1查询新库性能消耗情况
可通过不同方式:
1)获取AWR查看等待事件以及top sql
发现存在大量的delete操作
2)通过查询数据库当前执行时间最长和次数最多的sql
select *
from (select sa.SQL_TEXT,
sa.SQL_FULLTEXT,
sa.EXECUTIONS “执行次数”,
round(sa.ELAPSED_TIME / 1000000, 2) “总执行时间”,
round(sa.ELAPSED_TIME / 1000000 / sa.EXECUTIONS, 2) “平均执行时间”,
sa.COMMAND_TYPE,
sa.PARSING_USER_ID “用户ID”,
u.username “用户名”,
sa.HASH_VALUE
from v$sqlarea sa
left join all_users u
on sa.PARSING_USER_ID = u.user_id
where sa.EXECUTIONS > 0
order by (sa.ELAPSED_TIME / sa.EXECUTIONS) desc)
where rownum <= 10;
3)ogg使用logdump工具查看当前RBA在执行什么事务:
$ logdump
Logdump 467 >POS 28231954
Reading forward from RBA 28231954
Logdump 468 >n
2021/09/14 04:20:20.561.577 Metadata Len 708 RBA 28231954
Table Name: TESTOWNER.TESTTABLE
3040 be02 0000 0100 0200 0b00 0200 4000 0100 0600 | 0@…@…
0100 0200 0800 0200 0400 0000 0000 0300 0200 0000 | …
0400 0200 0000 0500 0800 0000 0000 0000 0000 0600 | …
0400 0000 0000 0700 0200 0000 0800 0400 0000 0000 | …
0340 6e02 0000 0800 4e00 3a00 0c00 5043 4f4d 5041 | .@n…N.:…PCOMPA
4e59 434f 4445 0000 0800 0000 0800 0000 0800 0000 | NYCODE…
0000 0000 0800 0000 0000 0000 0100 0000 6000 0100 | …`…
Logdump 469 >n
2021/09/14 04:20:18.000.000 Delete Len 138 RBA 28232746
Name: TESTOWNER,TESTTABLE (TDR Index: 9)
Before Image: Partition x0c G b
0000 0a00 0000 0038 0038 0038 0038 0100 0c00 0000 | …8.8.8.8…
0800 5171 4eab 5206 884c 0200 2d00 0000 2900 2ae6 | …QqN.R…L…-…).*.
a48d e789 a9e6 b2b9 2ae6 9bbc e892 82e7 89b9 e4bc | …*…
a6e5 b0bc e6a9 84e6 a684 e6b2 b9e7 a4bc e79b 9203 | …
0005 0000 0001 0031 0400 0a00 0000 0000 0000 0000 | …1…
6720 0500 0400 ffff 0000 0600 0800 0000 0400 3230 | g …20
3231 0700 0c00 0000 0800 3230 3231 3039 3133 | 21…20210913
Logdump 470 >n
一直执行n,后可看到当前事物在对什么表对象做什么事物操作
4)通过给gv$session视图查看ogg用户所执行的sql
select sid,serial#,inst_id,status,program,sql_id from gv$session where program like ‘%AS%’ and username=’GOLDENGATE’;
3.2新库获取sql的执行计划
新库:
select sql_id,sql_text from vsqlarea where upper(sql_text) like '%TESTTABLE%'; (或使用select sid,serial#,inst_id,status,program,sql_id from gvsession where program like ‘%AS%’ and username=’GOLDENGATE’;)
SQL_ID
-------------------
SQL_TEXT
-----------------------------------------------------------------------
55mw6aktcu387
DELETE /*+ restrict_all_ref_cons */ FROM "TESTOWNER"."TESTTABLE"
WHERE "PCOMPANYCODE"=:1 AND "PCOMPANYNAME"=:2 AND "COMMODITYNAME"=:3 AN
D "LINECOUNT"=:4 AND "AMOUNTTAX"=:5 AND "RN" is null AND "DATA_SEASON"=:6
AND "BATCH_DATE"=:7 and rownum < 2
select plan_table_output from table(dbms_xplan.display_cursor(‘55mw6aktcu387’,0));
SQL_ID 55mw6aktcu387, child number 0
-------------------------------------
DELETE /*+ restrict_all_ref_cons */ FROM
TESTOWNER."TESTTABLE" WHERE "PCOMPANYCODE"=:1 AND
PCOMPANYNAME=:2 AND "COMMODITYNAME"=:3 AND "LINECOUNT"=:4
AND "AMOUNTTAX"=:5 AND "RN" is null AND "DATA_SEASON"=:6 AND
BATCH_DATE=:7 and rownum < 2
Plan hash value: 1033196827
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | DELETE STATEMENT | | | | 230K(100)| |
| 1 | DELETE | TESTTABLE | | | | |
|* 2 | COUNT STOPKEY | | | | | |
|* 3 | TABLE ACCESS FULL| TESTTABLE | 1 | 82 | 230K (1)| 00:00:10 |
--------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(ROWNUM<2)
3 - filter(("COMMODITYNAME"=:3 AND "AMOUNTTAX"=:5 AND "LINECOUNT"=:4 AND
"BATCH_DATE"=:7 AND "PCOMPANYNAME"=:2 AND "DATA_SEASON"=:6 AND "PCOMPANYCODE"=:1 AND "RN"
IS NULL))
问题1:在源库,不存在新库此类的问题,那么我们需要看看新库复制进程所执行的sql,在源库是如何执行的。
3.3源库获取对应sql
通过表名和操作类型进行筛选出源库想匹配的sql
select sql_id,sql_text from v$sqlarea where upper(sql_text) like ‘%TESTTABLE%’;
SQL_ID
--------------------------
SQL_TEXT
--------------------------------------------------------------------------------
3zar2g3g5kmrp
DELETE FROM TESTOWNER.TESTTABLE T WHERE SUBSTR(T.DATA_SEASON, 1, 4
) = :B1
获取源库sql的执行计划
select plan_table_output from table(dbms_xplan.display_cursor(‘3zar2g3g5kmrp’,0));
PLAN_HASH_VALUE CHILD_NUMBER OPTIMIZER ID OPERATION OBJECT_OWNER OBJECT_NAME OBJECT_TYPE ROWS BYTES COST Cost (%CPU) TIME
474800996 .. ALL_ROWS 0 DELETE STATEMENT 35376
474800996 .. 1 DELETE TESTOWNER TESTTABLE
474800996 .. 2 TABLE ACCESS FULL TESTOWNER TESTTABLE TABLE 106869 641214 35376 0.62 425
3.4对比俩端的sql
可以看出新库ogg的复制进程所执行的sql,将表所有列都做个where条件过滤,并且俩边都是走全表扫描,而新库所消耗为230k,源库消耗35376,新库的消耗是源库的6.5倍。
查看该表的主键和索引情况:
select index_name from dba_indexes where owner=’TESTOWNER’ and table_name=‘TESTTABLE’;
select CONSTRAINT_NAME,CONSTRAINT_TYPE from dba_constraints where owner=’ TESTOWNER’ and table_name=‘TESTTABLE’;
该表既没有主键,也没有索引,则ogg将该表的所有列都添加附加日志,所有列都被作为where条件过滤。
4.问题处理
4.1在新库对该表进程创建索引
查询表每列的选择率,选择率越高则越优作为创建索引的列,由于表无主键,我们可以选择多列索引
select * from
(
select A.TABLE_NAME,
A.OWNER,
A.column_name,
A.num_distinct,
B.NUM_ROWS,
TRUNC(A.num_distinct / B.NUM_ROWS * 100) pct,
row_number() over(partition by a.table_name order by A.num_distinct desc ) rn
from dba_tab_col_statistics A, DBA_TABLES B
WHERE A.TABLE_NAME = B.TABLE_NAME
AND A.table_name in (’‘TESTTABLE’,‘TESTTABLE2’)
) where rn <=4
ORDER BY table_name,num_distinct desc;
TABLE_NAME OWNER COLUMN_NAME NUM_DISTINCT NUM_ROWS PCT RN
------------------------------ -------------------- ------------------------------ ------------ ---------- ---------- ----------
TESTTABLE TESTOWNER COMMODITYNAME 62603109 69559010 90 1
TESTTABLE TESTOWNER AMOUNTTAX 75936 69559010 0 2
TESTTABLE TESTOWNER LINECOUNT 1269 69559010 0 3
TESTTABLE TESTOWNER BATCH_DATE 434 69559010 0 4
创建索引
create index TESTOWNER.IDX_TESTTABLE on TESTOWNER.TESTTABLE (COMMODITYNAME,AMOUNTTAX,LINECOUNT) tablespace tablespace_name parallel 4 online;
alter index TESTOWNER.IDX_TESTTABLE noparallel;
4.2再次查询新库sql的执行情况
select sql_id,sql_text from v$sqlarea where upper(sql_text) like ‘%TESTTABLE%’;
SQL_ID
-------------
SQL_TEXT
--------------------------------------------------------------------------------
55mw6aktcu387
DELETE /*+ restrict_all_ref_cons */ FROM "TESTOWNER"."TESTTABLE"
WHERE "PCOMPANYCODE"=:1 AND "PCOMPANYNAME"=:2 AND "COMMODITYNAME"=:3 AN
D "LINECOUNT"=:4 AND "AMOUNTTAX"=:5 AND "RN" is null AND "DATA_SEASON"=:6
AND "BATCH_DATE"=:7 and rownum < 2
可查询sql的不同执行计划CHILD_NUMBER值进行比对
select sql_id,CHILD_NUMBER,PLAN_HASH_VALUE,last_active_time,EXECUTIONS,BUFFER_GETS,CPU_TIME/1e6,ELAPSED_TIME/1e6 from v$sql where sql_id=‘55mw6aktcu387’;
查询执行计划
select plan_table_output from table(dbms_xplan.display_cursor(‘55mw6aktcu387’,1));
SQL_ID 55mw6aktcu387, child number 1
-------------------------------------
DELETE /*+ restrict_all_ref_cons */ FROM
TESTOWNER."TESTTABLE" WHERE "PCOMPANYCODE"=:1 AND
PCOMPANYNAME=:2 AND "COMMODITYNAME"=:3 AND "LINECOUNT"=:4
AND "AMOUNTTAX"=:5 AND "RN" is null AND "DATA_SEASON"=:6 AND
BATCH_DATE=:7 and rownum < 2
Plan hash value:
---------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------------------------------
| 0 | DELETE STATEMENT | | | | 190 (1) | |
| 1 | DELETE | TESTTABLE | | | | |
|* 2 | COUNT STOPKEY | | | | | |
|* 3 | TABLE ACCESS BY INDEX ROWID| TESTTABLE | 1 | 82 | 190 (1) | 00:00:01 |
|* 4 | INDEX RANGE SCAN | IDX_TESTTABLE | 1 | 82 | 5 (1) | 00:00:01 |
---------------------------------------------------------------------------------------------------------------------------
加索引后的sql消耗为190,消耗减少了1210倍。复制进程也开始恢复正常。
4.3查询统计信息情况
注意:在新库做同步时,会将触发器给禁用了,统计信息容易出现不准确的现象,在出现延迟,并执行计划正常时,我们需要确认一下统计信息的情况,必要时手动收集一下统计信息
select table_name,NUM_ROWS,LAST_ANALYZED from dba_tables where owner=’ TESTOWNER ’ and table_name=’ TESTTABLE ';
针对某个表进行收集统计信息:
exec dbms_stats.gather_table_stats(ownname=>‘TESTOWNER’,tabname=>‘TESTTABLE’,degree=>4);
按用户进行收集统计信息:
exec dbms_stats.gather_schema_stats(ownname=>‘TESTOWNER’,degree=>4);
5.扩展内容
5.1复制进程拆分
1.停止源端投递进程,并记录其csn相关信息(重要)
GGSCI>stop TESTREP1
Sending STOP request to REPLICAT TESTREP1 …
Request processed.
GGSCI>info TESTREP1
2.停止当前应用进程,并记录其csn相关信息
GGSCI>info TESTREP1
REPLICAT TESTREP1 Last Started 2016-08-31 22:09 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:07 ago)
Process ID 65116
Log Read Checkpoint File ./dirdat/lr000001429
2016-09-21 15:27:14.147180 RBA 57049346
3.dblogin 登录数据库
GGSCI>dblogin USERID ogg, password xxxxxx
4.从老的应用进程参数配置中,将拆分出的表删除
GGSCI>edit params TESTREP1
5.为新的应用进程进行参数配置,将拆分出的表加入
GGSCI>edit params NEW1REP1
6.添加新的进程
GGSCI>add replicat NEW1REP1 integrated,exttrail ./dirdat/**
7.修改目标端应用进程开始抽取点(以原来的应用进程scn信息为准)
GGSCI > alter NEW1REP1,extseqno 1429,extrba 57049346
8.启动拆分后的两个(多个)进程并查看其状态
GGSCI > start NEW1REP1
GGSCI > info NEW1REP1
GGSCI > start TESTREP1
GGSCI > info TESTREP1
9.启动源端投递进程,并查看其状态
5.2ogg复制进程的开启10046事件
1,修改复制进程参数文件
添加一下参数,然后重启复制进程
sqlexec “alter session set tracefile_identifier=‘ogg_debug’”
sqlexec “alter session set statistivs_level=all”
sqlexec “alter session set max_dump_file_size_unlimited”
sq;exec “alter session set events ‘10046 trace name context forever,level 12’”
2.重启进程
3.查看跟踪文件
到数据库的日志文件目录里,查看ogg_bebug开头的trace文件




