去重,非常重要。这里只是在ORACLE环境里测试,应该移植到其它DB上问题也不大。
单纯去重,对保留记录无要求
用PLSQL
利用记录ROWID唯一的属性,编一个可递归调用的PROCEDURE,还利用了SQL%ROWCOUNT属性进行判断和输出,实现此功能。
SQL> create table t (id number,name2 varchar2(10));
Table created.
运行以下语句数遍,产生重复记录。
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1) from dual connect by level<=100;
commit;
create or replace procedure pr_deduplicate1 (p_count in number) is
begin
dbms_output.enable(50000);
delete from t where rowid in (select max(rowid) from t group by id,name2 having count(*)>1);
–you must read SQL%ROWCOUNT befroe commit, otherwise its value is 0
dbms_output.put_line(‘Invokation Frequency ‘||p_count||’, ‘||SQL%ROWCOUNT||’ record(s) deleted.’);
if SQL%ROWCOUNT>0 then
commit;
pr_deduplicate(p_count+1);
end if;
end ;
/
传入参数P_COUNT用于输出运行次数,最后一次运行显示0 record deleted, 因为每次直接delete from t where rowid in (select…,那最后一次必然是DELETE了0条记录,也说明表里没有重复记录了。
exec pr_deduplicate(1);
Invokation Frequency 1, 212 record(s) deleted.
Invokation Frequency 2, 155 record(s) deleted.
Invokation Frequency 3, 87 record(s) deleted.
Invokation Frequency 4, 46 record(s) deleted.
Invokation Frequency 5, 22 record(s) deleted.
Invokation Frequency 6, 8 record(s) deleted.
Invokation Frequency 7, 4 record(s) deleted.
Invokation Frequency 8, 2 record(s) deleted.
Invokation Frequency 9, 0 record(s) deleted.
用SQL
当然了,如果表比较小的话,可以运行一条语句,删除所有重复记录,如下:
delete from tb_weather
where rowid in
(select rid
from
(select rowid rid, row_number over (partition by sampled_date order by rowid) as dup_ct from tb_weather
)
where dup_ct>1
);
这里比如tb_weather里有重复的sampled_date, 那么dup_ct>1的记录,必然是重复的,把这些rowid的记录一次删除。当然这个写法,也有弊端,如果表很大,那么在INNER QUERY里,还会查出所有本已经不重复的记录,即DUP_CT=1,但不可能有DUP_CT=2的记录,所以比较费资源。有没有可能一次找出GROUP BY HAVING COUNT(*)>1的记录,然后刨除ROWID最小的那条,其余都删除掉呢?有待试验。
不仅去重,还对所保留记录有要求
以上两种方法,都是将重复记录里ROWID最小的一条保留,其余删除。但在实际业务中,往往要求是这样的:将重复记录里,WORK_DATE最新的那条保留,其余删除。比如:
SQL> create table t (id number,name2 varchar2(10),work_date date);
Table created.
运行以下语句数遍,产生重复记录。
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1),sysdate + numtodsinterval(round(DBMS_RANDOM.value(0,800)),‘second’) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1),sysdate + numtodsinterval(round(DBMS_RANDOM.value(0,800)),‘second’) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1),sysdate + numtodsinterval(round(DBMS_RANDOM.value(0,800)),‘second’) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1),sysdate + numtodsinterval(round(DBMS_RANDOM.value(0,800)),‘second’) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1),sysdate + numtodsinterval(round(DBMS_RANDOM.value(0,800)),‘second’) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1),sysdate + numtodsinterval(round(DBMS_RANDOM.value(0,800)),‘second’) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1),sysdate + numtodsinterval(round(DBMS_RANDOM.value(0,800)),‘second’) from dual connect by level<=100;
insert into t select round(DBMS_RANDOM.value(0,10)),dbms_random.string(‘U’,1),sysdate + numtodsinterval(round(DBMS_RANDOM.value(0,800)),‘second’) from dual connect by level<=100;
commit;
800条记录里,按id和name2进行distinct, 只有270条:
SQL> select count(*) from (select distinct id,name2 from t);
COUNT(*)
270
对于重复记录,需要保留work_date最新的。
首先要对相关group字段(id和name2)和筛选字段(work_date)建立复合索引:
create index idx1 on t(id,name2,work_date);
保留记录少且可以空窗处理,用SQL
鉴于最后保留的记录数约为总记录数的三分之一,可以考虑用CTAS的方式,另外建表,将要保留的记录INSERT到新表里,之后再DROP旧表,并对新表做RENAME。当然,如果t表有空窗处理时间,即处理期间,业务系统可以不用t表,可考虑此方案:
SQL> create table t_new as select * from (select t.*,row_number() over(partition by id,name2 order by work_date desc) as rn from t) where rn=1;
Table created.
SQL> alter table t_new drop column rn;
Table altered.
记下表t的相关DDL语句。
SQL> drop table t;
Table dropped.
SQL> alter table t_new rename to t;
Table altered.
运行表t有关的DDL语句,比如建立索引,约束等。
保留记录多,且表没有空窗期
如果distinct记录占总记录数比重较大,且没有空窗期,那只好在表t上处理了。
重建表t, 这次插入1600条记录。记得在id,name2,work_date上建符合索引。运行以下SQL:
SQL> select cc,count() from (select id,name2,count() as cc from t group by id,name2) group by cc order by 1;
CC COUNT(*)
1 17
2 20
3 20
4 41
5 40
6 56
7 24
8 30
9 18
10 7
11 6
12 3
13 1
14 2
可以看到,distinct后,count()为1的有17条,为2的有20条。。。count()最大为14,有两条。如何去重呢?有两个方案:
方案一:
一条SQL解决问题:
delete from t where rowid in
(select rowid from (select t.*,rowid,row_number() over(partition by id,name2 order by work_date desc) as rn from t) where rn>1);
方案二:
写PROCEDURE,以细粒度化(fine granularity)的方式处理,一来可以使得每条DML占用时间较少,从而尽快释放资源,二来也避免了大事务导致UNDO表空间消耗过大。尽管方案一看上去更简单直接,但也有可能因为一次性删除记录过多导致锁表时间长,UNDO消耗过大乃至语句被HANG住。所以原则上推荐方案二,具体如下:
create or replace procedure pr_deduplicate2 (p_count in number,max_count in number) is
begin
dbms_output.enable(50000);
delete from t where (id,name2,work_date) in
(select id,name2,min(work_date) from t group by id, name2 having count(*)=p_count);
dbms_output.put_line(‘Invokation Frequency ‘||to_char(max_count -p_count+1)||’, ‘||SQL%ROWCOUNT||’ record(s) deleted.’);
if p_count>2 then
commit;
pr_deduplicate2(p_count-1,max_count);
end if;
end;
/
不过,在运行pr_deduplicate2之前,要运行以下这句,剔除哪些ID,NAME2和WORK_DATE三列均相同的记录,例如表里id=4,name2=Z的只有这两条记录,如不事先处理,在运行之后的PROCEDURE时,会将两条都删除,因为它们的WORK_DATE也一样:
ID NAME2 WORK_DATE
4 Z 2021-11-10 13:44:05
4 Z 2021-11-10 13:44:05
反复运行以下SQL,当显示0 rows deleted时,即可终止运行:
delete from t where rowid in
(select min_rowid from (select id,name2,work_date,min(rowid) as min_rowid from t group by id, name2, work_date having count(*)>1));
4 rows deleted.
SQL> commit;
Commit complete.
delete from t where rowid in
2 (select min_rowid from (select id,name2,work_date,min(rowid) as min_rowid from t group by id, name2, work_date having count(*)>1));
0 rows deleted.
最后执行procedure, 完成去重。
SQL> exec pr_deduplicate2(14,14);
Invokation Frequency 1, 2 record(s) deleted.
Invokation Frequency 2, 3 record(s) deleted.
Invokation Frequency 3, 6 record(s) deleted.
Invokation Frequency 4, 12 record(s) deleted.
Invokation Frequency 5, 19 record(s) deleted.
Invokation Frequency 6, 34 record(s) deleted.
Invokation Frequency 7, 67 record(s) deleted.
Invokation Frequency 8, 91 record(s) deleted.
Invokation Frequency 9, 147 record(s) deleted.
Invokation Frequency 10, 187 record(s) deleted.
Invokation Frequency 11, 227 record(s) deleted.
Invokation Frequency 12, 248 record(s) deleted.
Invokation Frequency 13, 268 record(s) deleted.




