暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
19-Efficient Similarity Join and Search on Multi-Attribute Data.pdf
131
15页
2次
2022-03-02
免费下载
Efficient Similarity Join and Search on Multi-Attribute Data
Guoliang Li
Jian He
Dong Deng
Jian Li
Department of Computer Science, Tsinghua University, Beijing, China
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
{liguoliang, lijian83}@tsinghua.edu.cn; {hejian13, dd11}@mails.tsinghua.edu.cn
ABSTRACT
In this paper we study similarity join and search on multi-
attribute data. Traditional methods on single-attribute data
have pruning power only on single attributes and cannot
eciently support multi-attribute data. To address this
problem, we propose a prefix tree index which has holis-
tic pruning ability on multiple attributes. We propose a
cost model to quantify the prefix tree which can guide the
prefix tree construction. Based on the prefix tree, we devise
a filter-verification framework to support similarity search
and join on multi-attribute data. The filter step prunes a
large number of dissimilar results and identifies some candi-
dates using the prefix tree and the verification step verifies
the candidates to generate the final answer. For similar-
ity join, we prove that constructing an optimal prefix tree
is NP-complete and develop a greedy algorithm to achieve
high performance. For similarity search, since one prefix
tree cannot support all possible search queries, we extend
the cost model to support similarity search and devise a
budget-based algorithm to construct multiple high-quality
prefix trees. We also devise a hybrid verification algorithm
to improve the verification step. Experimental results show
our metho d significantly outperforms baseline approaches.
Categories and Subject Descriptors
H.2 [Database Management]: Database applications;
H.3.3 [Information Search and Retrieval]: Search process
Keywords
Similarity Search; Similarity Join; Multi-Attribute Data
1. INTRODUCTION
Real-world data is rather dirty due to typographical errors
and dierent representations of the same entity [11]. Tra-
ditional exact-matching join and search op erations cannot
tolerate the dirty data and thus similarity join and search
are recently proposed to tolerate the dirty data. Given two
multi-attribute tables (e.g., products and movies), similar-
ity join finds all similar pairs from the two tables, where the
similarity can be quantified by similarity functions, e.g., edit
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia.
Copyright
c
2015 ACM 978-1-4503-2758-9/15/05 ...$15.00.
http://dx.doi.org/10.1145/2723372.2723733.
distance and Jaccard. Given a table with multiple attributes
and a query, similarity search finds the records from the ta-
ble that are similar to the query. Similarity join and search
are two important operations in data cleaning and integra-
tion and have many real-world applications, e.g., duplicate
detection, data fusion, and spell checking [11]. For exam-
ple, in the event of plane crash like Malaysia airlines flight
MH17, there are many data sources about passengers on
MH17, e.g., tables from Malaysia airlines, departure coun-
try, destination county, and passengers’ countries. We re-
quire to integrate them to generate high-quality data. As
another example, a user wants to search movies directed
by James Cameron and acted by Arnold Schwarzenegger
from a movie database with actors, directors, and movie
names. However the user does not know the exact spelling
of Arnold Schwarzenegger and inputs a query with ty-
p os. Obvi ously exact-matching search cannot find any re-
sults while similarity search can alleviate this problem and
help the user to find the relevant answers.
Existing algorithms on string similarity join [20,26,28,29]
and search [9,14,15,22,30] have pruning power only on single
attributes. For similarity join and search with constraints on
multiple attributes, they have to first use a single attribute
to identify candidates and then check whether the candi-
dates satisfy the constraints on other attributes. Obviously
these algorithms are rather expensive because they have no
pruning power on other attributes and generate many in-
termediate results. It calls for eective methods to support
holistic similarity join and search on multi-attribute data.
To address this problem, we propose a prefix tree index
which has holistic pruning power on multiple attributes.
Based on the prefix tree, we devise a filter-verification frame-
work. The filter step prunes a large number of dissimilar
results and identifies some candidates using the prefix tree
and the verification step verifies the candidates to gener-
ate the final answer. For similarity join, we propose a cost
mo del to quantify the prefix tree. We prove that construct-
ing an optimal prefix tree is NP-complete and we develop
a greedy algorithm to achieve high performance. Dierent
from similarity join, one prefix tree cannot support all pos-
sible search queries with dierent similarity functions and
thresholds. To address this issue, we extend the cost model
to support similarity search. We develop a budge-based al-
gorithm to build high-quality prefix trees to achieve high
search performance. Since the similarity join and search
queries contain multiple attributes, the verification order on
attributes has a significant eect on the verification perfor-
mance. In addition, many filtering algorithms can be used
to verify the candidates which also have a great eect on
the performance. To this end, we devise a hybrid algorithm
to improve the verification performance. To summarize, we
make the following contributions. (1) We propose a prefix
tree index which has holistic pruning power on multiple at-
tributes and can be utilized to support both similarity join
and search queries. (2) For similarity join, we develop a cost
mo del to quantify the prefix tree. We prove that construct-
ing an optimal prefix tree is NP-complete and we develop a
greedy algorithm to construct a high-quality prefix tree (see
Section 3). (3) For similarity search, we devise a budget-
based method to construct multiple high-quality prefix trees
to support similarity search queries (see Section 4). (4) We
prop ose a hybrid verification algorithm to improve the ver-
ification performance (see Section 5). (5) Experimental re-
sults on real-world datasets show our method significantly
outperform s baseline approaches (see Section 6).
2. PRELIMINARIES
We first formally define the problem in Section 2. 1 and
then briefly introduce a well-known technique prefix filter
in Section 2.2. Lastly, we review related works in Section 2.3.
2.1 Problem Definition
Consider two multi-attribute tables R and S. Table R has
x attributes R
1
, R
2
, ··· , R
x
and m rows r
1
,r
2
, ··· ,r
m
.Ta-
ble S has y attributes S
1
, S
2
, ··· , S
y
and n rows s
1
,s
2
, ··· ,s
n
.
Let r
p
=[r
1
p
,r
2
p
, ··· ,r
x
p
] where r
i
p
denotes the value of r
p
on
attribute R
i
, and s
q
=[s
1
q
,s
2
q
, ··· ,s
y
q
] where s
j
q
denotes the
value of s
q
on attribute S
j
.
An atomic similarity join operation R
i
S
j
returns all
similar pairs {(r
p
2R,s
q
2S)} such that r
i
p
s
j
q
, where
is a similarity operator which will be defined later.
A complex similarity join operation is composed by more
than one atomic similarity join operation, e.g., = R
i
1
S
j
1
^R
i
2
S
j
2
^···^R
i
k
S
j
k
, which finds all similar
pairs {(r
p
2R,s
q
2S)} such that r
i
1
p
s
j
1
q
^ r
i
2
p
s
j
2
q
^···^
r
i
k
p
s
j
k
q
where 8t2[1,k], i
t
2[1,x] and j
t
2[1,y]. We call
t
=R
i
t
S
j
t
a predicate or attribute if the context i s clear.
We utilize similarity measures to define the similarity op-
erator , which can be broadly classified into two categories,
character-based similarity and token-based similarity.
Character-based Similarity. It defines the similarity b e-
tween two strings based on character transformations. The
well-known character-based similarity is edit distance. Con-
sider two string values r
i
p
and s
j
q
. The edit distance ED(r
i
p
,s
j
q
)
is the minimum number of single-character edit ope rations
(including insertion, deletion and substitution) needed to
transform r
i
p
to s
j
q
. The edit similarity extends edit distance
by taking into account the string length, which is defined
as ES(r
i
p
,s
j
q
)=1
ED(r
i
p
,s
j
q
)
max(|r
i
p
|,|s
j
q
|)
, where |r
i
p
| is the length of
r
i
p
. Two strings r
i
p
and s
j
q
are similar with respect to edit
similarity if their similarity exceeds a given threshold ,i.e.,
r
i
p
ES,
s
j
q
i. ES(r
i
p
,s
j
q
) .
For example, given the Name attribute of table R in Fig-
ure 1, r
1
1
=‘Jeffery Ullman’, r
1
4
=‘Jeffer Ullman’. We have
ED(r
1
1
,r
1
4
) = 1 and ES(r
1
1
,r
1
4
)=1
1
14
=
13
14
.
Token-based Similarity. It first splits each string into a
set of tokens by white spaces and then utilizes the set-based
similarity functions to quantify the similarity. The well-
R : Customer records of BOA
Id Name (R
1
) Address (R
2
) Country (R
3
)
r
1
Jeery Ullman EE UCLA USA
r
2
Jennifer Widom CS Stanford CA USA
r
3
Jerry Wang CS Berkeley ENG
r
4
Jeer Ullman CS Stanford CA USA
r
5
Don Antenio CS Stanford CA USA
S : Customer records of Wells Fargo
Id Name (S
1
) Address (S
2
) Country (S
3
)
s
1
Jeery Ullman Stanford CA USA
s
2
Herry Wang Berkeley CA ENG
s
3
Don Ertenio CS Berkeley ENG
Figure 1: Two example tables: R and S.
known similarity functions include Overlap and Jaccard, de-
fined as be low (see Appendix B for more functions).
Overlap similarity: OLP(r
i
p
,s
j
q
)=|r
i
p
\ s
j
q
|
Jaccard similarity: JAC(r
i
p
,s
j
q
)=
|r
i
p
\s
j
q
|
|r
i
p
[s
j
q
|
where |r
i
p
| denotes the size of set r
i
p
.
r
i
p
and s
j
q
are similar with respect to the token-based sim-
ilarity if their similarity exceeds a threshold ,e.g.,
r
i
p
JAC,
s
j
q
i. JAC(r
i
p
,s
j
q
) .
For example, given the Address attribute of table R in
Figure 1, r
2
3
=‘CS Berkeley and r
2
4
=‘CS Stanford CA’.
OLP(r
2
3
,r
2
4
) = 1 and JAC(r
2
3
,r
2
4
)=
1
4
.
Traditional similarity join and search methods focus on
atomic operations, while in this paper we emphasize on com-
plex operations. Next, we formalize the problems of string
similarly join and search on multi-attribute data.
Definition 1. (Similarity Join on Multi-Attribute
Da t a) Given two multi-attribute tables R and S and a com-
plex similarity operation =R
i
1
S
j
1
^R
i
2
S
j
2
^···^
R
i
k
S
j
k
,itfindsallsimilarpairs{(r
p
2R,s
q
2S)} such
that r
i
1
p
s
j
1
q
^ r
i
2
p
s
j
2
q
^···^r
i
k
p
s
j
k
q
.
Definition 2. (Similarity Search on Multi-Attribute
Da t a) Given a multi-attribute table R and a query Q =
(R
i
1
Q
i
1
, R
i
2
Q
i
2
, ··· , R
i
k
Q
i
k
),whereQ
i
t
denotes
the query value on attribute R
i
t
for t 2 [1,k],itfindssimilar
strings r
p
2Rsuch that r
i
1
p
Q
i
1
^ r
i
2
p
Q
i
2
^···^r
i
k
p
Q
i
k
.
For example, given two multi-attribute tables R and S in
Figure 1 and a complex operation R
1
ES,0.8
S
1
^R
2
JAC,0.5
S
2
. The first predicate is ES(r
1
p
,s
1
q
) 0.8 and the second
is JAC(r
2
p
,s
2
q
) 0.5. Similarity join on multi-attribute data
returns {(r
4
,s
1
)} as ES(r
1
4
,s
1
1
)=0.93 0.8 and JAC(r
2
4
,s
2
1
)
=0.67 0.5. Although r
1
1
= s
1
1
,(r
1
,s
1
) is not a similar pair,
as their Address values are not similar (JAC(r
2
1
,s
2
1
)=00.5).
Given a multi-attribute table R in Figure 1 and a query
Q =(R
1
ES,0.8
Jenifer Widom’, R
2
OLP,2
CS Stanford’).
The first predicate is ES(r
1
p
, Q
1
) 0.8 and the second is
OLP(r
2
p
, Q
2
) 2. Similarity search on multi-attribute data
returns {r
2
} as ES(r
1
2
, Q
1
)=0.930.8 and OLP(r
2
2
, Q
2
)=22.
2.2 Prefix Filter
We briefly introduce the prefix filter [1,2]. It first trans-
forms each string to a set. For edit distance, the set contains
all q-grams of t he string where a q-gram is a substring with
length q. For token-based similarity, the set contains all to-
kens in the string. Then it converts character-based and
token-based similarities to the overlap similarity: if string s
is similar to string r, the overlap of their corresponding sets
must exceed o, where o can be deduced as below.
If OLP(r, s) then o = .
of 15
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜