引言:
以下是一次为Oracle数千万、上亿条记录的大表进行数据去重(以下简称“大表去重”)所搜集、所编写、所采纳的纯SQL、PL/SQL代码示例。笔者针对不同数量级的基表采用了多种去重方案,这些方案中有高效的,有不够高效的,有相对安全的,有以资源换取效率的,如此等等。鉴于笔者技术有限,资料不全,错误在所难免,恳请读者批评指正。
【关键词】 Oracle,数据去重,纯SQL,PL/SQL存储过程,Row_Number,count over
一、 通用语法示意
- 查询单字段重复记录
查找表中多余的重复记录,重复记录是根据单个字段(Id)来判断
select * from 表 where Id in (select Id from 表 group by Id having count(Id) >
1);
- 删除单字段重复记录
删除表中多余的重复记录,重复记录是根据单个字段(Id)来判断,只留有rowid最小的记录
DELETE from 表 WHERE (id) IN ( SELECT id FROM 表 GROUP BY id HAVING COUNT(id) >
- AND ROWID NOT IN (SELECT MIN(ROWID) FROM 表 GROUP BY id HAVING COUNT(*) >
1);
- 查询多字段重复记录
查找表中多余的重复记录(多个字段)
select * from 表 a
where (a.Id,a.seq) in (select Id,seq from 表 group by Id,seq having count(*) >
1);
- 删除多字段重复记录
删除表中多余的重复记录(多个字段),只留有rowid最小的记录
delete from 表 a
where (a.Id,a.seq) in (select Id,seq from 表 group by Id,seq having count(*) >
1)
and rowid not in (select min(rowid) from 表 group by Id,seq having
count(*)>1);
- 查询多字段重复记录
查找表中多余的重复记录(多个字段),不包含rowid最小的记录
select * from 表 a
where (a.Id,a.seq) in (select Id,seq from 表 group by Id,seq having count(*) >
1)
and rowid not in (select min(rowid) from 表 group by Id,seq having
count(*)>1);
二、 专用语法示例
- 使用SubQuery删除重复记录
DELETE FROM table_name A
WHERE a.rowid > ANY (
SELECT B.rowid FROM table_name B WHERE A.col1 = B.col1 AND A.col2 = B.col2
);
- 使用Rank删除重复记录
delete from $table_name where rowid in
(
select “rowid” from
(select “rowid”, rank_n from
(select rank() over (partition by $primary_key order by rowid) rank_n, rowid as
“rowid”
from $table_name
where $primary_key in
(select $primary_key from $table_name
group by $all_columns
having count(*) > 1
)
)
)
where rank_n > 1
);
- 使用自关联Self_Join删除
delete from table_name a
where
a.rowid > any (
select b.rowid from table_name b where
a.col1 = b.col1 and a.col2 = b.col2
);
- 使用分析Analytics窗口函数删除,分组排序后只保留一条
delete from
tablename
where rowid in
(select rowid from
(select
rowid,
row_number()
over
(partition by col1,col2,col3 order by col1,col2,col3) dup
from tablename)
where dup > 1);
- GroupBy删除,只保留最后一条
delete from tablename t
where t.rowid not in
(select max(rowid) from tablename t1 group by t1.col_2, t1.col_3);
- 删除所有重复行,一条不留,重复则Count Over大于1
delete from tablename t
where t.rowid in (
select rid
from (select t1.rowid rid,
count(1) over(partition by t1.col_2, t1.col_3) rn
from tablename t1) t1
where t1.rn > 1);
- 查询所有重复记录,重复n-1次则显示n 条
select *
from (select t1.*,
count(1) over(partition by t1.col_2, t1.col_3) rn
from tablename t1) t1
where t1.rn > 1;
- 查询所有重复行,两次读表法
select *
from tablename t
where (t.col_2, t.col_3) in (select t1.col_2, t1.col_3
from tablename t1
group by t1.col_2, t1.col_3
having count(1) > 1);
三、 3种大表数据去重方案示例
方案一:建中间表、约束、索引,删改原表,新表改名(性能好些)
– ①按去重结果建临时表
**SQL1:两次读表法,每条记录级联扫描比较子查询,效率低 **
create table temp as
select *
from test_t a
where a.rowid = (select min(b.rowid)
from test_t b
where a.name = b.name);
SQL2:一次读表法,从全集中一次性筛选出首记录,效率高
create table t2 nologging as
select *
from t
where rowid IN (select RID from
( select t.ROWID RID, row_number() over (partition by col1,col2,col3 order
by rowid desc) rn
from t) where rn = 1);
– ②创建索引、主键约束
– ③删旧表(或改名)、改新表
方案二:直接删除,回滚段与索引段高强度刷新,效率较低
delete test_t a
where a.rowid > (select min(b.rowid) from test_t b where a.name = b.name);
delete from test1 where rowid not in (select min(rowid) from test1 group by
col1,col2,col3);
/*强制使用规则优化法,绕开CBO*/
delete /*+RULE*/ from t
where rowid in ( select rid
from ( select rowid rid,row_number() over (partition by col1,col2,col3 order by rowid) rn
from t) where rn <> 1 );
/*最低效Row_by_Row示例*/
delete from t
where rowid <> ( select min(rowid) from t t2
where t2.primary_key_column1 = t.primary_key_column1
and t2.primary_key_column2 = t.primary_key_column2
… …
and t2.primary_key_columnN = t.primary_key_columnN );
方案三:PL/SQL存储过程,尽管代码繁琐,但按批次提交,防止事务回滚,效率良好
create or replace procedure pr_del_duplicate_rows AUTHID current_user
is
ls_sql varchar2(512); —备份、删除、回调动态SQL
li_cnt number(5); —是否已经去重的标志
begin
—①创建备份中间表及索引
/* ls_sql := ‘create table t_bigtab_back nologging as select * from t_bigtab
where 1=2’;
execute immediate ls_sql;
ls_sql := ‘create index ix_bigtab on t_bigtab_back(patient_id,visit_id,item_no)
nologging’;
execute immediate ls_sql;*/
for dd in (select ‘K0692040’ patient_id,1 visit_id,1 item_no from dual) loop
—首次单样本测试
—检测是否已经做过去重处理
Select count(1) into li_cnt from t_bigtab where
patient_id=’’’||dd.patient_id||’’’ and visit_id =’||dd.visit_id||’ and
item_no=’||dd.item_no;
If li_cnt > 1 then
—②先备份再删除重复记录
ls_sql:= ‘insert into t_bigtab_back select * from t_bigtab where
patient_id=’’’||dd.patient_id||’’’ and visit_id =’||dd.visit_id||’ and
item_no=’||dd.item_no;
– dbms_output.put_line(ls_sql);
execute immediate ls_sql;
ls_sql:= ‘delete from t_bigtab where patient_id=’’’||dd.patient_id||’’’ and
visit_id =’||dd.visit_id||’ and item_no=’||dd.item_no;
– dbms_output.put_line(ls_sql);
execute immediate ls_sql;
—③保留最后一次变更
ls_sql:= ‘insert into t_bigtab select * from (select * from t_bigtab_back
where patient_id=’’’||dd.patient_id||’’’ and visit_id =’||dd.visit_id||’
and item_no=’||dd.item_no||’ order by enter_date_time desc,amount desc )
where rownum=1’;
– dbms_output.put_line(ls_sql);
execute immediate ls_sql;
—提交事务,完成去重
commit;
end if;
end loop;
– 例外处理;
exception
when others
then rollback;
dbms_output.put_line(‘SQL:’||ls_sql);
dbms_output.put_line(‘SQLCODE:’||sqlcode||’ ERROR MSG:’||sqlerrm);
end ;
/
四、 小结
以上是笔者在一次具体的case中使用过的大表去重的示例。
针对本次需要去重的数个百万级记录的中型表,为了省事,笔者优先使用方案二,直接在原表基础上去重,由于用于查重判断所涉及的字段基本都是索引字段,故去重的过程就是边判断、边删除、边索引的“三边”动态重组过程,效率相当低下。尤其是有两三个记录高度重复的“窄表”,居然耗时十多个小时而不能完成。于是,只好手工Kill当前会话,被迫改为方案一。试验表明,方案一中的“一次读表法”比“两次读表法”高效,且前者比后者能提供更灵活的数据保留选项,后者则是假定最大或最小的rowID即是最先或最后一次变更,而前者可以提供更多的排序规则,以判断“最后一次变更”。
需要特别说明的是,在手工kill大表去重这种超大事务时,Oracle通常会因为有数百兆、千兆字节的UNDO日志需要处理以完成实例恢复,因而必然出现数据库长时间“挂起”(suspend)的不可用状态——即实例被无限期“hang”住。这是实例恢复所必需的,也是十分危险的操作。这种操作对于生产库来说,是不可挽回的,甚至是致命的危险,提请读者务必切记!!!
针对此次去重的一两个上亿条记录的大表,笔者没有考虑前两个方案,而是直接编写方案三的代码,主要考虑两点:一是这两个大表数据比较重要,无法接受出现意外丢失或损坏的情形;二是在原表基础上直接去重效率低下姑且不论,回滚段(UNDO segment)很可能会出现ORA-01555 snapshot too old错误。所以,在方案三的代码设计中,笔者优先考虑了数据的安全转储、尽量缩小事务规模、分批次提交,以及事务中断再重启后“现有成果”不再重复消耗时间等多个因素。实验表明,方案三的效率相对稳定,不会出现被迫中断后,Oracle数据库实例恢复时那种超长时间hang住的不可控状况。
最后,大表去重是一个耗时的、低CPU使用的、高密度读写硬盘的漫长事务。在硬件条件许可的情况下,强烈建议大家使用方案一,一来数据安全可靠,用空间换取效率;二来以DDL代替DML,事务即便失败、回滚也不会对数据库实例产生不可逆的影响。
曹 挚iyft_hws150@sina.com
京弘达(上海)医疗科技有限公司
2023-01-05
参考文献:
1. To remove duplicate records https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:15258974323143
2. oracle数据库去除重复数据常用的方法总结 https://www.jb51.net/article/248738.htm
3. Delete duplicate rows from Oracle tables http://www.dba-oracle.com/t_delete_duplicate_table_rows.htm
4. 解决Oracle删除重复数据只留一条的方法详解 https://www.jb51.net/article/37613.htm
5. How to select first value in a group by bunch of rows https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:122801500346829407




