暂无图片
暂无图片
13
暂无图片
暂无图片
暂无图片

Oracle数千万、上亿条记录大表的数据去重示例

原创 Oracle6 2023-01-05
3490

引言:
    以下是一次为Oracle数千万、上亿条记录的大表进行数据去重(以下简称“大表去重”)所搜集、所编写、所采纳的纯SQL、PL/SQL代码示例。笔者针对不同数量级的基表采用了多种去重方案,这些方案中有高效的,有不够高效的,有相对安全的,有以资源换取效率的,如此等等。鉴于笔者技术有限,资料不全,错误在所难免,恳请读者批评指正。

【关键词】 Oracle,数据去重,纯SQL,PL/SQL存储过程,Row_Number,count over

一、 通用语法示意

  1. 查询单字段重复记录

  查找表中多余的重复记录,重复记录是根据单个字段(Id)来判断

select * from 表 where Id in (select Id from 表 group by Id having count(Id) >
1);

  1. 删除单字段重复记录

 删除表中多余的重复记录,重复记录是根据单个字段(Id)来判断,只留有rowid最小的记录

DELETE from 表 WHERE (id) IN ( SELECT id FROM 表 GROUP BY id HAVING COUNT(id) >

  1. AND ROWID NOT IN (SELECT MIN(ROWID) FROM 表 GROUP BY id HAVING COUNT(*) >
    1);
  1. 查询多字段重复记录

 查找表中多余的重复记录(多个字段)

select * from 表 a

where (a.Id,a.seq) in (select Id,seq from 表 group by Id,seq having count(*) >
1);

  1. 删除多字段重复记录

 删除表中多余的重复记录(多个字段),只留有rowid最小的记录

delete from 表 a

where (a.Id,a.seq) in (select Id,seq from 表 group by Id,seq having count(*) >
1)

and rowid not in (select min(rowid) from 表 group by Id,seq having
count(*)>1);

  1. 查询多字段重复记录

 查找表中多余的重复记录(多个字段),不包含rowid最小的记录

select * from 表 a

where (a.Id,a.seq) in (select Id,seq from 表 group by Id,seq having count(*) >
1)

and rowid not in (select min(rowid) from 表 group by Id,seq having
count(*)>1);

二、 专用语法示例

  1. 使用SubQuery删除重复记录

DELETE FROM table_name A

WHERE a.rowid > ANY (

SELECT B.rowid FROM table_name B WHERE A.col1 = B.col1 AND A.col2 = B.col2

);

  1. 使用Rank删除重复记录

delete from $table_name where rowid in

(

select “rowid” from

(select “rowid”, rank_n from

(select rank() over (partition by $primary_key order by rowid) rank_n, rowid as
“rowid”

from $table_name

where $primary_key in

(select $primary_key from $table_name

group by $all_columns

having count(*) > 1

)

)

)

where rank_n > 1

);

  1. 使用自关联Self_Join删除

delete from table_name a

where

a.rowid > any (

select b.rowid from table_name b where

a.col1 = b.col1 and a.col2 = b.col2

);

  1. 使用分析Analytics窗口函数删除,分组排序后只保留一条

delete from

tablename

where rowid in

(select rowid from

(select

rowid,

row_number()

over

(partition by col1,col2,col3 order by col1,col2,col3) dup

from tablename)

where dup > 1);

  1. GroupBy删除,只保留最后一条

delete from tablename t

where t.rowid not in

(select max(rowid) from tablename t1 group by t1.col_2, t1.col_3);

  1. 删除所有重复行,一条不留,重复则Count Over大于1

delete from tablename t

where t.rowid in (

select rid

from (select t1.rowid rid,

count(1) over(partition by t1.col_2, t1.col_3) rn

from tablename t1) t1

where t1.rn > 1);

  1. 查询所有重复记录,重复n-1次则显示n 条

select *

from (select t1.*,

count(1) over(partition by t1.col_2, t1.col_3) rn

from tablename t1) t1

where t1.rn > 1;

  1. 查询所有重复行,两次读表法

select *

from tablename t

where (t.col_2, t.col_3) in (select t1.col_2, t1.col_3

from tablename t1

group by t1.col_2, t1.col_3

having count(1) > 1);

三、 3种大表数据去重方案示例

方案一:建中间表、约束、索引,删改原表,新表改名(性能好些)

– ①按去重结果建临时表

**SQL1:两次读表法,每条记录级联扫描比较子查询,效率低 **

create table temp as

select *

from test_t a

where a.rowid = (select min(b.rowid)

from test_t b

where a.name = b.name);

SQL2:一次读表法,从全集中一次性筛选出首记录,效率高

create table t2 nologging as

select *

from t

where rowid IN (select RID from

( select t.ROWID RID, row_number() over (partition by col1,col2,col3 order
by rowid desc) rn

from t) where rn = 1);

– ②创建索引、主键约束

– ③删旧表(或改名)、改新表

方案二:直接删除,回滚段与索引段高强度刷新,效率较低

delete test_t a
where a.rowid > (select min(b.rowid) from test_t b where a.name = b.name);

delete from test1 where rowid not in (select min(rowid) from test1 group by
col1,col2,col3);

/*强制使用规则优化法,绕开CBO*/

delete /*+RULE*/ from t
where rowid in ( select rid
from ( select rowid rid,row_number() over (partition by col1,col2,col3 order by rowid) rn
from t) where rn <> 1 );

/*最低效Row_by_Row示例*/

delete from t
where rowid <> ( select min(rowid) from t t2
where t2.primary_key_column1 = t.primary_key_column1
and t2.primary_key_column2 = t.primary_key_column2
… …
and t2.primary_key_columnN = t.primary_key_columnN );

方案三:PL/SQL存储过程,尽管代码繁琐,但按批次提交,防止事务回滚,效率良好

create or replace procedure pr_del_duplicate_rows AUTHID current_user

is

ls_sql varchar2(512); —备份、删除、回调动态SQL

li_cnt number(5); —是否已经去重的标志

begin

—①创建备份中间表及索引

/* ls_sql := ‘create table t_bigtab_back nologging as select * from t_bigtab
where 1=2’;

execute immediate ls_sql;

ls_sql := ‘create index ix_bigtab on t_bigtab_back(patient_id,visit_id,item_no)
nologging’;

execute immediate ls_sql;*/

for dd in (select ‘K0692040’ patient_id,1 visit_id,1 item_no from dual) loop
—首次单样本测试

—检测是否已经做过去重处理

Select count(1) into li_cnt from t_bigtab where
patient_id=’’’||dd.patient_id||’’’ and visit_id =’||dd.visit_id||’ and
item_no=’||dd.item_no;

If li_cnt > 1 then

—②先备份再删除重复记录

ls_sql:= ‘insert into t_bigtab_back select * from t_bigtab where
patient_id=’’’||dd.patient_id||’’’ and visit_id =’||dd.visit_id||’ and
item_no=’||dd.item_no;

– dbms_output.put_line(ls_sql);

execute immediate ls_sql;

ls_sql:= ‘delete from t_bigtab where patient_id=’’’||dd.patient_id||’’’ and
visit_id =’||dd.visit_id||’ and item_no=’||dd.item_no;

– dbms_output.put_line(ls_sql);

execute immediate ls_sql;

—③保留最后一次变更

ls_sql:= ‘insert into t_bigtab select * from (select * from t_bigtab_back
where patient_id=’’’||dd.patient_id||’’’ and visit_id =’||dd.visit_id||’
and item_no=’||dd.item_no||’ order by enter_date_time desc,amount desc )
where rownum=1’;

– dbms_output.put_line(ls_sql);

execute immediate ls_sql;

—提交事务,完成去重

commit;
end if;
end loop;
– 例外处理;
exception
when others
then rollback;
dbms_output.put_line(‘SQL:’||ls_sql);
dbms_output.put_line(‘SQLCODE:’||sqlcode||’ ERROR MSG:’||sqlerrm);
end ;
/

四、 小结
    以上是笔者在一次具体的case中使用过的大表去重的示例。
    针对本次需要去重的数个百万级记录的中型表,为了省事,笔者优先使用方案二,直接在原表基础上去重,由于用于查重判断所涉及的字段基本都是索引字段,故去重的过程就是边判断、边删除、边索引的“三边”动态重组过程,效率相当低下。尤其是有两三个记录高度重复的“窄表”,居然耗时十多个小时而不能完成。于是,只好手工Kill当前会话,被迫改为方案一。试验表明,方案一中的“一次读表法”比“两次读表法”高效,且前者比后者能提供更灵活的数据保留选项,后者则是假定最大或最小的rowID即是最先或最后一次变更,而前者可以提供更多的排序规则,以判断“最后一次变更”。

    需要特别说明的是,在手工kill大表去重这种超大事务时,Oracle通常会因为有数百兆、千兆字节的UNDO日志需要处理以完成实例恢复,因而必然出现数据库长时间“挂起”(suspend)的不可用状态——即实例被无限期“hang”住。这是实例恢复所必需的,也是十分危险的操作。这种操作对于生产库来说,是不可挽回的,甚至是致命的危险,提请读者务必切记!!!

    针对此次去重的一两个上亿条记录的大表,笔者没有考虑前两个方案,而是直接编写方案三的代码,主要考虑两点:一是这两个大表数据比较重要,无法接受出现意外丢失或损坏的情形;二是在原表基础上直接去重效率低下姑且不论,回滚段(UNDO segment)很可能会出现ORA-01555 snapshot too old错误。所以,在方案三的代码设计中,笔者优先考虑了数据的安全转储、尽量缩小事务规模、分批次提交,以及事务中断再重启后“现有成果”不再重复消耗时间等多个因素。实验表明,方案三的效率相对稳定,不会出现被迫中断后,Oracle数据库实例恢复时那种超长时间hang住的不可控状况。

    最后,大表去重是一个耗时的、低CPU使用的、高密度读写硬盘的漫长事务。在硬件条件许可的情况下,强烈建议大家使用方案一,一来数据安全可靠,用空间换取效率;二来以DDL代替DML,事务即便失败、回滚也不会对数据库实例产生不可逆的影响。

曹 挚iyft_hws150@sina.com
京弘达(上海)医疗科技有限公司
2023-01-05

参考文献
   1. To remove duplicate records https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:15258974323143
   2. oracle数据库去除重复数据常用的方法总结 https://www.jb51.net/article/248738.htm
   3. Delete duplicate rows from Oracle tables http://www.dba-oracle.com/t_delete_duplicate_table_rows.htm
   4. 解决Oracle删除重复数据只留一条的方法详解 https://www.jb51.net/article/37613.htm
   5. How to select first value in a group by bunch of rows https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:122801500346829407

最后修改时间:2023-01-05 15:03:50
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论