暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
VLDB2024_Efficient and Accurate SimRank-based Similarity Joins Experiments, Analysis, and Improvement_gStore.pdf
667
13页
2次
2024-09-05
免费下载
Eicient and Accurate SimRank-based Similarity Joins:
Experiments, Analysis, and Improvement
Qian Ge
Peking University
geqian@pku.edu.cn
Yu Liu
†§
Yinghao Zhao
Yuetian Sun
Beijing Jiaotong University
Beijing Key Laboratory of Trac Data Analysis and
Mining, Beijing, China
{yul,yinghaozhao,yuetiansun}@bjtu.edu.cn
Lei Zou
Peking University
zoulei@pku.edu.cn
Yuxing Chen
Anqun Pan
Tencent Inc.
{axingguchen,aaronpan}@tencent.com
ABSTRACT
SimRank-based similarity joins, which mainly include threshold-
based and top-
𝑘
similarity joins, are important types of all-pair
SimRank queries. Although a line of related algorithms have been
proposed recently, they still fall short of providing approximation
guarantee and suer from scalability issues on medium and large
graphs. Meanwhile, we also lack an extensive analysis of existing
techniques in terms of accuracy and eciency. Motivated by these
challenges, we rst conduct detailed analysis of state-of-the-art
algorithms and provide additional theoretical results. Second, to
address the limitations of existing techniques, we propose simple
yet eective algorithm frameworks for both queries to theoret-
ically guarantee the approximation bound, and present a more
ecient all-pair algorithm inspired by randomized local push of
Personalized PageRank. Next, we analyze the algorithmic complex-
ity of threshold-based and top-
𝑘
similarity joins by leveraging a
reasonable assumption of SimRank distribution. Through extensive
experiments, we nd that our proposed methods far exceed existing
ones with respect to query eciency, approximation guarantee and
practical accuracy, while our theoretical analysis nicely matches
the empirical study.
PVLDB Reference Format:
Qian Ge, Yu Liu, Yinghao Zhao, Yuetian Sun, Lei Zou, Yuxing Chen, and
Anqun Pan. Ecient and Accurate SimRank-based Similarity Joins:
Experiments, Analysis, and Improvement. PVLDB, 17(4): 617-629, 2023.
doi:10.14778/3636218.3636219
PVLDB Artifact Availability:
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 17, No. 4 ISSN 2150-8097.
doi:10.14778/3636218.3636219
These authors contributed equally to this work.
§Corresponding author.
The source code, data, and/or other artifacts have been made available at
https://github.com/xinghun0525/R2LP.
1 INTRODUCTION
SimRank [
12
] is one of the most important measures for pairwise
node similarity, which nds its applications in recommender sys-
tems [
19
], link prediction [
27
], and graph embeddings [
34
]. The
SimRank similarity of node
𝑢
and
𝑣
is dened by the following
recursive equation, with the intuition that “two nodes in a graph
are similar if their in-neighbors are similar, and a node is most
similar to itself”:
𝑠 (𝑢, 𝑣) =
1, if 𝑢 = 𝑣
𝑐
|𝐼 (𝑢)||𝐼 (𝑣)|
𝑥 𝐼 (𝑢)
𝑦𝐼 (𝑣)
𝑠 (𝑥,𝑦), otherwise.
(1)
Here,
𝐼 (𝑢)
denote the in-neighbors of node
𝑢
, and
𝑐 (
0
,
1
)
is the
decay factor, which is typically set to 0.6 [
33
,
36
,
39
] or 0.8 [
12
]. Due
to its computational complexity, ecient computation of SimRank
has been extensively studied in the past two decades [
8
,
16
,
18
,
22
,
24
,
26
,
28
,
32
], and remains to be a hot topic in the very recent
years [23, 31, 36, 39, 45, 48].
According to the query inputs and outputs, SimRank queries can
be broadly categorized as single-pair queries, single-source queries,
and all-pair queries with the following denition.
Definition 1 (All-Pair SimRank Computation). Given a graph
𝐺 = (𝑉, 𝐸)
of
𝑛
nodes and an error parameter
𝜀
, return the SimRank
estimation
𝑠
ˆ
(𝑢, 𝑣)
for every node pair
(𝑢, 𝑣)
with absolute error guar-
antee, so that |𝑠
ˆ
(𝑢, 𝑣) 𝑠 (𝑢, 𝑣)| 𝜀 (with high probability).
Among existing work [
20
,
26
,
28
,
29
,
32
,
39
,
45
48
] that tack-
les all-pair queries, state-of-the-art algorithms [
39
,
48
] can hardly
guarantee an additive error of
𝜀 =
0
.
01 on million-node graphs.
Besides, a recent work [
36
] indicates that it is “essentially hopeless”
to compute all-pair queries on large graphs, as non-zero items can
be as large as
𝑂 (𝑛
2
)
. Fortunately, two modied versions of all-pair
queries have been studied, namely, threshold-based [
29
,
46
,
47
] and
top-
𝑘
[
20
,
32
,
48
] similarity joins, of which the outputs are small
617
subsets of all-pair SimRank computation. Both queries are more
practically useful because we are only interested in the node pairs
with non-negligible similarities [48].
Definition 2 (Threshold-based Similarity Join). Given a
graph
𝐺 = (𝑉, 𝐸)
and a threshold
𝜃 >
0, return the set of node pairs
R(𝜃) with 𝑢 𝑣 and SimRank value 𝑠(𝑢, 𝑣) 𝜃.
Definition 3 (Top-
𝑘
Similarity Join). Given a graph
𝐺 = (𝑉, 𝐸)
and a parameter
𝑘 >
0, return a
𝑘
-sized set
R(𝑘)
containing the
node pairs with the top-
𝑘
largest SimRank values among all pairs
{(𝑢, 𝑣)|𝑢, 𝑣 𝑉 , 𝑢 𝑣}.
Analogous to all-pair SimRank computation that an error pa-
rameter
𝜀
is allowed, we introduce the approximation bound
𝜌
for
similarity joins. As we shall see in Section 4, without
𝜌
(i.e., setting
𝜌 =
1), in worst cases no algorithm can be more ecient than
computing the exact SimRank values
1
.
Definition 4 (Approximation bound). Given an approximation
parameter
𝜌 (
0
,
1
)
, we say an algorithm
A
has approximation
bound for threshold-based (resp. top-
𝑘
) similarity join of SimRank, if
for any input graph
𝐺
, the returned set of node pairs contains at least
𝜌 fraction of the ground truth answer.
1.1 Motivations
We note that existing algorithms for threshold-based and top-
𝑘
similarity joins signicantly fall short of both query eciency and
approximation guarantee. We identify the following issues.
Motivation 1. Lack of experimental analysis of eciency and accuracy,
especially on medium and large graphs. Even for state-of-the-art
algorithms [
39
,
48
] of general all-pair queries, the result accuracy
is not evaluated except for small graphs due to lack of ground
truth. Other representative algorithms [
20
,
46
] even do not evaluate
accuracy on small graphs.
Motivation 2. Lack of approximation guarantee for threshold-based
and top-
𝑘
similarity joins. Surprisingly, there does not exist an algo-
rithm that guarantees the approximation bound for threshold-based
or top-
𝑘
similarity join. Algorithms with absolute error guarantee
cannot be directly extended to achieve this goal.
Motivation 3. Lack of understanding of the computational complexity.
As far as we know, it is unclear how the input parameters and graph
properties impact the algorithmic complexity of both queries.
1.2 Contributions
For both types of all-pair SimRank queries, we analyze state-of-
the-art algorithms in detail, provide improved algorithms with
approximation guarantee and lower theoretical complexity, and
discuss the problem complexity. Our contributions are summarized
as follows.
Detailed analysis of state-of-the-art algorithms. We choose
four representative algorithms, UISim [
48
], FLP & Opt-LP [
39
], H-
go SRJ [
46
], and KSimJoin [
20
] for comparison. UISim and FLP &
Opt-LP are state-of-the-art algorithms for general all-pair queries,
which are directly extended to answer SimRank-based similarity
joins [
48
]. H-go SRJ and KSimJoin are the state-of-the-art algorithms
1
We distinguish between the exact and the ground truth SimRank values, where the for-
mer is the exact solution to Equation 1 and the latter is a very accurate approximation.
for threshold-based and top-
𝑘
similarity joins, respectively. Apart
from their performance, we choose UISim [
48
] and KSimJoin [
20
]
because they adopt the random walk interpretation, whereas Opt-
LP [
39
] and H-go SRJ [
46
] are the best methods based on the Sim-
Rank matrix formation. For each algorithm, we discuss its basic
idea, theoretical guarantee, complexity analysis and empirical study.
We propose additional theoretical results (w.r.t. complexity or error
guarantee) that are incorrect or missed in the original papers.
Improved algorithms for SimRank-based similarity joins. We
rst propose two simple yet eective algorithm frameworks for
both queries respectively with approximation bound. The basic idea
is to invoke an ecient algorithm for all-pair queries that estimates
SimRank values within
𝜀
error, and to gradually shrink
𝜀
until the
approximation bound is theoretically guaranteed. Next, we utilize
the equivalence of all-pair SimRank on the input graph
𝐺
and single-
target Personalized PageRank (PPR) on the SimRank graph
𝐺
𝑠
[
39
],
and devise Randomized Reverse Local Push (R
2
LP) inspired by ran-
domized backward push for PPR [
35
]. It improves the all-pair query
complexity from
𝑂 (
𝑢,𝑣𝑉
𝑑
𝑖𝑛
(𝑢)𝑑
𝑖𝑛
(𝑣 )𝑠 (𝑢,𝑣 )
𝜀
)
(the best known algo-
rithm [39]) to 𝑂
˜
(
𝑢,𝑣𝑉
𝑑
𝑖𝑛
(𝑢)𝑑
𝑖𝑛
(𝑣 )𝑠 (𝑢,𝑣 )
𝜀
). We further propose a
pruning strategy to dramatically enhance practical eciency while
retaining error guarantee.
Complexity analysis by utilizing SimRank distribution. We
nd that it is almost impossible to give a non-trivial complexity
bound for both similarity join queries without any assumption
of SimRank distribution. Our empirical analysis validates that for
most real-world graphs, the ground truth SimRank values follow
some generalized version of power-law distribution [
2
,
6
], which is
consistent with previous studies [
23
,
36
]. We leverage this property
to propose veriable assumptions of SimRank distribution and
present complexity analysis for our algorithms, which serves as
the upper bound of the algorithmic complexity of SimRank-based
similarity joins. Our theoretical results nicely match the empirical
study in Section 6.
Extensive experiments on medium and large graphs. We con-
duct extensive experiments on small, medium, and large sized
graphs and compare our algorithms with state-of-the-art methods.
As far as we know, we rst employ medium and large datasets to
systematically evaluate threshold-based and top-
𝑘
similarity joins
in terms of running time, absolute error and query accuracy, and
to demonstrate the necessities of holding approximation bound
as well as devising more ecient algorithms. Experimental study
demonstrates that our algorithms outperform existing ones in most
cases, achieving better accuracy while being faster by up to an order
of magnitude. More importantly, our empirical ndings validate
that it is the skewness of SimRank distribution that mainly deter-
mines the problem hardness. This results veries the eectiveness
of our theoretical analysis.
2 PRELIMINARIES
The denition in Equation 1 can be interpreted via the node-pairs
graph
𝐺
2
[
12
]. Each node in
𝐺
2
represents a pair of nodes
(𝑢, 𝑣)
in
𝐺
. If there exist edges
(𝑢, 𝑣)
and
(𝑥,𝑦)
in
𝐺
, edge
((𝑢, 𝑥), (𝑣,𝑦))
is
included in
𝐺
2
. Specically, every node
(𝑣, 𝑣)
is called the singleton
node. To this end, SimRank can be thought of as propagating the
similarity among nodes in
𝐺
2
, starting from all the singleton nodes.
618
of 13
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜