
subsets of all-pair SimRank computation. Both queries are more
practically useful because we are only interested in the node pairs
with non-negligible similarities [48].
Definition 2 (Threshold-based Similarity Join). Given a
graph
𝐺 = (𝑉, 𝐸)
and a threshold
𝜃 >
0, return the set of node pairs
R(𝜃) with 𝑢 ≠ 𝑣 and SimRank value 𝑠(𝑢, 𝑣) ≥ 𝜃.
Definition 3 (Top-
𝑘
Similarity Join). Given a graph
𝐺 = (𝑉, 𝐸)
and a parameter
𝑘 >
0, return a
𝑘
-sized set
R(𝑘)
containing the
node pairs with the top-
𝑘
largest SimRank values among all pairs
{(𝑢, 𝑣)|𝑢, 𝑣 ∈ 𝑉 , 𝑢 ≠ 𝑣}.
Analogous to all-pair SimRank computation that an error pa-
rameter
𝜀
is allowed, we introduce the approximation bound
𝜌
for
similarity joins. As we shall see in Section 4, without
𝜌
(i.e., setting
𝜌 =
1), in worst cases no algorithm can be more ecient than
computing the exact SimRank values
1
.
Definition 4 (Approximation bound). Given an approximation
parameter
𝜌 ∈ (
0
,
1
)
, we say an algorithm
A
has approximation
bound for threshold-based (resp. top-
𝑘
) similarity join of SimRank, if
for any input graph
𝐺
, the returned set of node pairs contains at least
𝜌 fraction of the ground truth answer.
1.1 Motivations
We note that existing algorithms for threshold-based and top-
𝑘
similarity joins signicantly fall short of both query eciency and
approximation guarantee. We identify the following issues.
Motivation 1. Lack of experimental analysis of eciency and accuracy,
especially on medium and large graphs. Even for state-of-the-art
algorithms [
39
,
48
] of general all-pair queries, the result accuracy
is not evaluated except for small graphs due to lack of ground
truth. Other representative algorithms [
20
,
46
] even do not evaluate
accuracy on small graphs.
Motivation 2. Lack of approximation guarantee for threshold-based
and top-
𝑘
similarity joins. Surprisingly, there does not exist an algo-
rithm that guarantees the approximation bound for threshold-based
or top-
𝑘
similarity join. Algorithms with absolute error guarantee
cannot be directly extended to achieve this goal.
Motivation 3. Lack of understanding of the computational complexity.
As far as we know, it is unclear how the input parameters and graph
properties impact the algorithmic complexity of both queries.
1.2 Contributions
For both types of all-pair SimRank queries, we analyze state-of-
the-art algorithms in detail, provide improved algorithms with
approximation guarantee and lower theoretical complexity, and
discuss the problem complexity. Our contributions are summarized
as follows.
Detailed analysis of state-of-the-art algorithms. We choose
four representative algorithms, UISim [
48
], FLP & Opt-LP [
39
], H-
go SRJ [
46
], and KSimJoin [
20
] for comparison. UISim and FLP &
Opt-LP are state-of-the-art algorithms for general all-pair queries,
which are directly extended to answer SimRank-based similarity
joins [
48
]. H-go SRJ and KSimJoin are the state-of-the-art algorithms
1
We distinguish between the exact and the ground truth SimRank values, where the for-
mer is the exact solution to Equation 1 and the latter is a very accurate approximation.
for threshold-based and top-
𝑘
similarity joins, respectively. Apart
from their performance, we choose UISim [
48
] and KSimJoin [
20
]
because they adopt the random walk interpretation, whereas Opt-
LP [
39
] and H-go SRJ [
46
] are the best methods based on the Sim-
Rank matrix formation. For each algorithm, we discuss its basic
idea, theoretical guarantee, complexity analysis and empirical study.
We propose additional theoretical results (w.r.t. complexity or error
guarantee) that are incorrect or missed in the original papers.
Improved algorithms for SimRank-based similarity joins. We
rst propose two simple yet eective algorithm frameworks for
both queries respectively with approximation bound. The basic idea
is to invoke an ecient algorithm for all-pair queries that estimates
SimRank values within
𝜀
error, and to gradually shrink
𝜀
until the
approximation bound is theoretically guaranteed. Next, we utilize
the equivalence of all-pair SimRank on the input graph
𝐺
and single-
target Personalized PageRank (PPR) on the SimRank graph
𝐺
𝑠
[
39
],
and devise Randomized Reverse Local Push (R
2
LP) inspired by ran-
domized backward push for PPR [
35
]. It improves the all-pair query
complexity from
𝑂 (
𝑢,𝑣∈𝑉
𝑑
𝑖𝑛
(𝑢)𝑑
𝑖𝑛
(𝑣 )𝑠 (𝑢,𝑣 )
𝜀
)
(the best known algo-
rithm [39]) to 𝑂
˜
(
𝑢,𝑣∈𝑉
√
𝑑
𝑖𝑛
(𝑢)𝑑
𝑖𝑛
(𝑣 )𝑠 (𝑢,𝑣 )
𝜀
). We further propose a
pruning strategy to dramatically enhance practical eciency while
retaining error guarantee.
Complexity analysis by utilizing SimRank distribution. We
nd that it is almost impossible to give a non-trivial complexity
bound for both similarity join queries without any assumption
of SimRank distribution. Our empirical analysis validates that for
most real-world graphs, the ground truth SimRank values follow
some generalized version of power-law distribution [
2
,
6
], which is
consistent with previous studies [
23
,
36
]. We leverage this property
to propose veriable assumptions of SimRank distribution and
present complexity analysis for our algorithms, which serves as
the upper bound of the algorithmic complexity of SimRank-based
similarity joins. Our theoretical results nicely match the empirical
study in Section 6.
Extensive experiments on medium and large graphs. We con-
duct extensive experiments on small, medium, and large sized
graphs and compare our algorithms with state-of-the-art methods.
As far as we know, we rst employ medium and large datasets to
systematically evaluate threshold-based and top-
𝑘
similarity joins
in terms of running time, absolute error and query accuracy, and
to demonstrate the necessities of holding approximation bound
as well as devising more ecient algorithms. Experimental study
demonstrates that our algorithms outperform existing ones in most
cases, achieving better accuracy while being faster by up to an order
of magnitude. More importantly, our empirical ndings validate
that it is the skewness of SimRank distribution that mainly deter-
mines the problem hardness. This results veries the eectiveness
of our theoretical analysis.
2 PRELIMINARIES
The denition in Equation 1 can be interpreted via the node-pairs
graph
𝐺
2
[
12
]. Each node in
𝐺
2
represents a pair of nodes
(𝑢, 𝑣)
in
𝐺
. If there exist edges
(𝑢, 𝑣)
and
(𝑥,𝑦)
in
𝐺
, edge
((𝑢, 𝑥), (𝑣,𝑦))
is
included in
𝐺
2
. Specically, every node
(𝑣, 𝑣)
is called the singleton
node. To this end, SimRank can be thought of as propagating the
similarity among nodes in
𝐺
2
, starting from all the singleton nodes.
618
文档被以下合辑收录
评论