19-Efficient Similarity Join and Search on Multi-Attribute Data.pdf

虎头虎脑

131

15页

2次

2022-03-02

免费下载

Efﬁcient Similarity Join and Search on Multi-Attribute Data
Guoliang Li
†
Jian He
†
Dong Deng
†
Jian Li
‡
†
Department of Computer Science, Tsinghua University, Beijing, China
‡
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
{liguoliang, lijian83}@tsinghua.edu.cn; {hejian13, dd11}@mails.tsinghua.edu.cn
ABSTRACT
In this paper we study similarity join and search on multi-
attribute data. Traditional methods on single-attribute data
have pruning power only on single attributes and cannot
eﬃciently support multi-attribute data. To address this
problem, we propose a preﬁx tree index which has holis-
tic pruning ability on multiple attributes. We propose a
cost model to quantify the preﬁx tree which can guide the
preﬁx tree construction. Based on the preﬁx tree, we devise
a ﬁlter-veriﬁcation framework to support similarity search
and join on multi-attribute data. The ﬁlter step prunes a
large number of dissimilar results and identiﬁes some candi-
dates using the preﬁx tree and the veriﬁcation step veriﬁes
the candidates to generate the ﬁnal answer. For similar-
ity join, we prove that constructing an optimal preﬁx tree
is NP-complete and develop a greedy algorithm to achieve
high performance. For similarity search, since one preﬁx
tree cannot support all possible search queries, we extend
the cost model to support similarity search and devise a
budget-based algorithm to construct multiple high-quality
preﬁx trees. We also devise a hybrid veriﬁcation algorithm
to improve the veriﬁcation step. Experimental results show
our metho d signiﬁcantly outperforms baseline approaches.
Categories and Subject Descriptors
H.2 [Database Management]: Database applications;
H.3.3 [Information Search and Retrieval]: Search process
Keywords
Similarity Search; Similarity Join; Multi-Attribute Data
1. INTRODUCTION
Real-world data is rather dirty due to typographical errors
and di↵erent representations of the same entity [11]. Tra-
ditional exact-matching join and search op erations cannot
tolerate the dirty data and thus similarity join and search
are recently proposed to tolerate the dirty data. Given two
multi-attribute tables (e.g., products and movies), similar-
ity join ﬁnds all similar pairs from the two tables, where the
similarity can be quantiﬁed by similarity functions, e.g., edit
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proﬁt or commercial advantage and that copies bear this notice and the full cita-
tion on the ﬁrst page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission
and/or a fee. Request permissions from permissions@acm.org.
SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia.
Copyright
c
 2015 ACM 978-1-4503-2758-9/15/05 ...$15.00.
http://dx.doi.org/10.1145/2723372.2723733.
distance and Jaccard. Given a table with multiple attributes
and a query, similarity search ﬁnds the records from the ta-
ble that are similar to the query. Similarity join and search
are two important operations in data cleaning and integra-
tion and have many real-world applications, e.g., duplicate
detection, data fusion, and spell checking [11]. For exam-
ple, in the event of plane crash like Malaysia airlines ﬂight
MH17, there are many data sources about passengers on
MH17, e.g., tables from Malaysia airlines, departure coun-
try, destination county, and passengers’ countries. We re-
quire to integrate them to generate high-quality data. As
another example, a user wants to search movies directed
by “James Cameron” and acted by “Arnold Schwarzenegger”
from a movie database with actors, directors, and movie
names. However the user does not know the exact spelling
of “Arnold Schwarzenegger” and inputs a query with ty-
p os. Obvi ously exact-matching search cannot ﬁnd any re-
sults while similarity search can alleviate this problem and
help the user to ﬁnd the relevant answers.
Existing algorithms on string similarity join [20,26,28,29]
and search [9,14,15,22,30] have pruning power only on single
attributes. For similarity join and search with constraints on
multiple attributes, they have to ﬁrst use a single attribute
to identify candidates and then check whether the candi-
dates satisfy the constraints on other attributes. Obviously
these algorithms are rather expensive because they have no
pruning power on other attributes and generate many in-
termediate results. It calls for e↵ective methods to support
holistic similarity join and search on multi-attribute data.
To address this problem, we propose a preﬁx tree index
which has holistic pruning power on multiple attributes.
Based on the preﬁx tree, we devise a ﬁlter-veriﬁcation frame-
work. The ﬁlter step prunes a large number of dissimilar
results and identiﬁes some candidates using the preﬁx tree
and the veriﬁcation step veriﬁes the candidates to gener-
ate the ﬁnal answer. For similarity join, we propose a cost
mo del to quantify the preﬁx tree. We prove that construct-
ing an optimal preﬁx tree is NP-complete and we develop
a greedy algorithm to achieve high performance. Di↵erent
from similarity join, one preﬁx tree cannot support all pos-
sible search queries with di↵erent similarity functions and
thresholds. To address this issue, we extend the cost model
to support similarity search. We develop a budge-based al-
gorithm to build high-quality preﬁx trees to achieve high
search performance. Since the similarity join and search
queries contain multiple attributes, the veriﬁcation order on
attributes has a signiﬁcant e↵ect on the veriﬁcation perfor-
mance. In addition, many ﬁltering algorithms can be used

to verify the candidates which also have a great e↵ect on
the performance. To this end, we devise a hybrid algorithm
to improve the veriﬁcation performance. To summarize, we
make the following contributions. (1) We propose a preﬁx
tree index which has holistic pruning power on multiple at-
tributes and can be utilized to support both similarity join
and search queries. (2) For similarity join, we develop a cost
mo del to quantify the preﬁx tree. We prove that construct-
ing an optimal preﬁx tree is NP-complete and we develop a
greedy algorithm to construct a high-quality preﬁx tree (see
Section 3). (3) For similarity search, we devise a budget-
based method to construct multiple high-quality preﬁx trees
to support similarity search queries (see Section 4). (4) We
prop ose a hybrid veriﬁcation algorithm to improve the ver-
iﬁcation performance (see Section 5). (5) Experimental re-
sults on real-world datasets show our method signiﬁcantly
outperform s baseline approaches (see Section 6).
2. PRELIMINARIES
We ﬁrst formally deﬁne the problem in Section 2. 1 and
then brieﬂy introduce a well-known technique – preﬁx ﬁlter
in Section 2.2. Lastly, we review related works in Section 2.3.
2.1 Problem Deﬁnition
Consider two multi-attribute tables R and S. Table R has
x attributes R
1
, R
2
, ··· , R
x
and m rows r
1
,r
2
, ··· ,r
m
.Ta-
ble S has y attributes S
1
, S
2
, ··· , S
y
and n rows s
1
,s
2
, ··· ,s
n
.
Let r
p
=[r
1
p
,r
2
p
, ··· ,r
x
p
] where r
i
p
denotes the value of r
p
on
attribute R
i
, and s
q
=[s
1
q
,s
2
q
, ··· ,s
y
q
] where s
j
q
denotes the
value of s
q
on attribute S
j
.
An atomic similarity join operation R
i
⇠S
j
returns all
similar pairs {(r
p
2R,s
q
2S)} such that r
i
p
⇠ s
j
q
, where ⇠
is a similarity operator which will be deﬁned later.
A complex similarity join operation is composed by more
than one atomic similarity join operation, e.g.,  = R
i
1
⇠
S
j
1
^R
i
2
⇠S
j
2
^···^R
i
k
⇠S
j
k
, which ﬁnds all similar
pairs {(r
p
2R,s
q
2S)} such that r
i
1
p
⇠ s
j
1
q
^ r
i
2
p
⇠ s
j
2
q
^···^
r
i
k
p
⇠ s
j
k
q
where 8t2[1,k], i
t
2[1,x] and j
t
2[1,y]. We call

t
=R
i
t
⇠S
j
t
a predicate or attribute if the context i s clear.
We utilize similarity measures to deﬁne the similarity op-
erator ⇠, which can be broadly classiﬁed into two categories,
character-based similarity and token-based similarity.
Character-based Similarity. It deﬁnes the similarity b e-
tween two strings based on character transformations. The
well-known character-based similarity is edit distance. Con-
sider two string values r
i
p
and s
j
q
. The edit distance ED(r
i
p
,s
j
q
)
is the minimum number of single-character edit ope rations
(including insertion, deletion and substitution) needed to
transform r
i
p
to s
j
q
. The edit similarity extends edit distance
by taking into account the string length, which is deﬁned
as ES(r
i
p
,s
j
q
)=1
ED(r
i
p
,s
j
q
)
max(|r
i
p
|,|s
j
q
|)
, where |r
i
p
| is the length of
r
i
p
. Two strings r
i
p
and s
j
q
are similar with respect to edit
similarity if their similarity exceeds a given threshold ⌧,i.e.,
r
i
p
ES,⌧
⇠ s
j
q
i↵. ES(r
i
p
,s
j
q
)  ⌧.
For example, given the Name attribute of table R in Fig-
ure 1, r
1
1
=‘Jeffery Ullman’, r
1
4
=‘Jeffer Ullman’. We have
ED(r
1
1
,r
1
4
) = 1 and ES(r
1
1
,r
1
4
)=1
1
14
=
13
14
.
Token-based Similarity. It ﬁrst splits each string into a
set of tokens by white spaces and then utilizes the set-based
similarity functions to quantify the similarity. The well-
R : Customer records of BOA
Id Name (R
1
) Address (R
2
) Country (R
3
)
r
1
Je↵ery Ullman EE UCLA USA
r
2
Jennifer Widom CS Stanford CA USA
r
3
Jerry Wang CS Berkeley ENG
r
4
Je↵er Ullman CS Stanford CA USA
r
5
Don Antenio CS Stanford CA USA
S : Customer records of Wells Fargo
Id Name (S
1
) Address (S
2
) Country (S
3
)
s
1
Je↵ery Ullman Stanford CA USA
s
2
Herry Wang Berkeley CA ENG
s
3
Don Ertenio CS Berkeley ENG
Figure 1: Two example tables: R and S.
known similarity functions include Overlap and Jaccard, de-
ﬁned as be low (see Appendix B for more functions).
• Overlap similarity: OLP(r
i
p
,s
j
q
)=|r
i
p
\ s
j
q
|
• Jaccard similarity: JAC(r
i
p
,s
j
q
)=
|r
i
p
\s
j
q
|
|r
i
p
[s
j
q
|
where |r
i
p
| denotes the size of set r
i
p
.
r
i
p
and s
j
q
are similar with respect to the token-based sim-
ilarity if their similarity exceeds a threshold ⌧ ,e.g.,
r
i
p
JAC,⌧
⇠ s
j
q
i↵. JAC(r
i
p
,s
j
q
)  ⌧.
For example, given the Address attribute of table R in
Figure 1, r
2
3
=‘CS Berkeley’ and r
2
4
=‘CS Stanford CA’.
OLP(r
2
3
,r
2
4
) = 1 and JAC(r
2
3
,r
2
4
)=
1
4
.
Traditional similarity join and search methods focus on
atomic operations, while in this paper we emphasize on com-
plex operations. Next, we formalize the problems of string
similarly join and search on multi-attribute data.
Definition 1. (Similarity Join on Multi-Attribute
Da t a) Given two multi-attribute tables R and S and a com-
plex similarity operation =R
i
1
⇠S
j
1
^R
i
2
⇠S
j
2
^···^
R
i
k
⇠S
j
k
,itﬁndsallsimilarpairs{(r
p
2R,s
q
2S)} such
that r
i
1
p
⇠ s
j
1
q
^ r
i
2
p
⇠ s
j
2
q
^···^r
i
k
p
⇠ s
j
k
q
.
Definition 2. (Similarity Search on Multi-Attribute
Da t a) Given a multi-attribute table R and a query Q =
(R
i
1
⇠Q
i
1
, R
i
2
⇠Q
i
2
, ··· , R
i
k
⇠Q
i
k
),whereQ
i
t
denotes
the query value on attribute R
i
t
for t 2 [1,k],itﬁndssimilar
strings r
p
2Rsuch that r
i
1
p
⇠Q
i
1
^ r
i
2
p
⇠Q
i
2
^···^r
i
k
p
⇠Q
i
k
.
For example, given two multi-attribute tables R and S in
Figure 1 and a complex operation R
1
ES,0.8
⇠S
1
^R
2
JAC,0.5
⇠
S
2
. The ﬁrst predicate is ES(r
1
p
,s
1
q
)  0.8 and the second
is JAC(r
2
p
,s
2
q
)  0.5. Similarity join on multi-attribute data
returns {(r
4
,s
1
)} as ES(r
1
4
,s
1
1
)=0.93  0.8 and JAC(r
2
4
,s
2
1
)
=0.67  0.5. Although r
1
1
= s
1
1
,(r
1
,s
1
) is not a similar pair,
as their Address values are not similar (JAC(r
2
1
,s
2
1
)=00.5).
Given a multi-attribute table R in Figure 1 and a query
Q =(R
1
ES,0.8
⇠ ‘Jenifer Widom’, R
2
OLP,2
⇠ ‘CS Stanford’).
The ﬁrst predicate is ES(r
1
p
, Q
1
)  0.8 and the second is
OLP(r
2
p
, Q
2
)  2. Similarity search on multi-attribute data
returns {r
2
} as ES(r
1
2
, Q
1
)=0.930.8 and OLP(r
2
2
, Q
2
)=22.
2.2 Preﬁx Filter
We brieﬂy introduce the preﬁx ﬁlter [1,2]. It ﬁrst trans-
forms each string to a set. For edit distance, the set contains
all q-grams of t he string where a q-gram is a substring with
length q. For token-based similarity, the set contains all to-
kens in the string. Then it converts character-based and
token-based similarities to the overlap similarity: if string s
is similar to string r, the overlap of their corresponding sets
must exceed o, where o can be deduced as below.
• If OLP(r, s)  ⌧ then o = ⌧.

of 15

免费下载

paper 李国良

关注

评论