Efficient Similarity Join and Search on Multi-Attribute Data
Guoliang Li
†
Jian He
†
Dong Deng
†
Jian Li
‡
†
Department of Computer Science, Tsinghua University, Beijing, China
‡
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
{liguoliang, lijian83}@tsinghua.edu.cn; {hejian13, dd11}@mails.tsinghua.edu.cn
ABSTRACT
In this paper we study similarity join and search on multi-
attribute data. Traditional methods on single-attribute data
have pruning power only on single attributes and cannot
efficiently support multi-attribute data. To address this
problem, we propose a prefix tree index which has holis-
tic pruning ability on multiple attributes. We propose a
cost model to quantify the prefix tree which can guide the
prefix tree construction. Based on the prefix tree, we devise
a filter-verification framework to support similarity search
and join on multi-attribute data. The filter step prunes a
large number of dissimilar results and identifies some candi-
dates using the prefix tree and the verification step verifies
the candidates to generate the final answer. For similar-
ity join, we prove that constructing an optimal prefix tree
is NP-complete and develop a greedy algorithm to achieve
high performance. For similarity search, since one prefix
tree cannot support all possible search queries, we extend
the cost model to support similarity search and devise a
budget-based algorithm to construct multiple high-quality
prefix trees. We also devise a hybrid verification algorithm
to improve the verification step. Experimental results show
our metho d significantly outperforms baseline approaches.
Categories and Subject Descriptors
H.2 [Database Management]: Database applications;
H.3.3 [Information Search and Retrieval]: Search process
Keywords
Similarity Search; Similarity Join; Multi-Attribute Data
1. INTRODUCTION
Real-world data is rather dirty due to typographical errors
and di↵erent representations of the same entity [11]. Tra-
ditional exact-matching join and search op erations cannot
tolerate the dirty data and thus similarity join and search
are recently proposed to tolerate the dirty data. Given two
multi-attribute tables (e.g., products and movies), similar-
ity join finds all similar pairs from the two tables, where the
similarity can be quantified by similarity functions, e.g., edit
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia.
Copyright
c
2015 ACM 978-1-4503-2758-9/15/05 ...$15.00.
http://dx.doi.org/10.1145/2723372.2723733.
distance and Jaccard. Given a table with multiple attributes
and a query, similarity search finds the records from the ta-
ble that are similar to the query. Similarity join and search
are two important operations in data cleaning and integra-
tion and have many real-world applications, e.g., duplicate
detection, data fusion, and spell checking [11]. For exam-
ple, in the event of plane crash like Malaysia airlines flight
MH17, there are many data sources about passengers on
MH17, e.g., tables from Malaysia airlines, departure coun-
try, destination county, and passengers’ countries. We re-
quire to integrate them to generate high-quality data. As
another example, a user wants to search movies directed
by “James Cameron” and acted by “Arnold Schwarzenegger”
from a movie database with actors, directors, and movie
names. However the user does not know the exact spelling
of “Arnold Schwarzenegger” and inputs a query with ty-
p os. Obvi ously exact-matching search cannot find any re-
sults while similarity search can alleviate this problem and
help the user to find the relevant answers.
Existing algorithms on string similarity join [20,26,28,29]
and search [9,14,15,22,30] have pruning power only on single
attributes. For similarity join and search with constraints on
multiple attributes, they have to first use a single attribute
to identify candidates and then check whether the candi-
dates satisfy the constraints on other attributes. Obviously
these algorithms are rather expensive because they have no
pruning power on other attributes and generate many in-
termediate results. It calls for e↵ective methods to support
holistic similarity join and search on multi-attribute data.
To address this problem, we propose a prefix tree index
which has holistic pruning power on multiple attributes.
Based on the prefix tree, we devise a filter-verification frame-
work. The filter step prunes a large number of dissimilar
results and identifies some candidates using the prefix tree
and the verification step verifies the candidates to gener-
ate the final answer. For similarity join, we propose a cost
mo del to quantify the prefix tree. We prove that construct-
ing an optimal prefix tree is NP-complete and we develop
a greedy algorithm to achieve high performance. Di↵erent
from similarity join, one prefix tree cannot support all pos-
sible search queries with di↵erent similarity functions and
thresholds. To address this issue, we extend the cost model
to support similarity search. We develop a budge-based al-
gorithm to build high-quality prefix trees to achieve high
search performance. Since the similarity join and search
queries contain multiple attributes, the verification order on
attributes has a significant e↵ect on the verification perfor-
mance. In addition, many filtering algorithms can be used
评论