VLDB2024_Efficient and Accurate SimRank-based Similarity Joins Experiments, Analysis, and Improvement_gStore.pdf

墨天轮福利君

757

13页

2次

2024-09-05

免费下载

Eicient and Accurate SimRank-based Similarity Joins:

Experiments, Analysis, and Improvement

Qian Ge

†

Peking University

geqian@pku.edu.cn

Yu Liu

†§

Yinghao Zhao

Yuetian Sun

Beijing Jiaotong University

Beijing Key Laboratory of Trac Data Analysis and

Mining, Beijing, China

{yul,yinghaozhao,yuetiansun}@bjtu.edu.cn

Lei Zou

Peking University

zoulei@pku.edu.cn

Yuxing Chen

Anqun Pan

Tencent Inc.

{axingguchen,aaronpan}@tencent.com

ABSTRACT

SimRank-based similarity joins, which mainly include threshold-

based and top-

𝑘

similarity joins, are important types of all-pair

SimRank queries. Although a line of related algorithms have been

proposed recently, they still fall short of providing approximation

guarantee and suer from scalability issues on medium and large

graphs. Meanwhile, we also lack an extensive analysis of existing

techniques in terms of accuracy and eciency. Motivated by these

challenges, we rst conduct detailed analysis of state-of-the-art

algorithms and provide additional theoretical results. Second, to

address the limitations of existing techniques, we propose simple

yet eective algorithm frameworks for both queries to theoret-

ically guarantee the approximation bound, and present a more

ecient all-pair algorithm inspired by randomized local push of

Personalized PageRank. Next, we analyze the algorithmic complex-

ity of threshold-based and top-

𝑘

similarity joins by leveraging a

reasonable assumption of SimRank distribution. Through extensive

experiments, we nd that our proposed methods far exceed existing

ones with respect to query eciency, approximation guarantee and

practical accuracy, while our theoretical analysis nicely matches

the empirical study.

PVLDB Reference Format:

Qian Ge, Yu Liu, Yinghao Zhao, Yuetian Sun, Lei Zou, Yuxing Chen, and

Anqun Pan. Ecient and Accurate SimRank-based Similarity Joins:

Experiments, Analysis, and Improvement. PVLDB, 17(4): 617-629, 2023.

doi:10.14778/3636218.3636219

PVLDB Artifact Availability:

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 17, No. 4 ISSN 2150-8097.

doi:10.14778/3636218.3636219

†These authors contributed equally to this work.

§Corresponding author.

The source code, data, and/or other artifacts have been made available at

https://github.com/xinghun0525/R2LP.

1 INTRODUCTION

SimRank [

] is one of the most important measures for pairwise

node similarity, which nds its applications in recommender sys-

tems [

], link prediction [

], and graph embeddings [

]. The

SimRank similarity of node

𝑢

and

𝑣

is dened by the following

recursive equation, with the intuition that “two nodes in a graph

are similar if their in-neighbors are similar, and a node is most

similar to itself”:

𝑠 (𝑢, 𝑣) =

⎧

⎪

⎪⎨

⎪

⎩

1, if 𝑢 = 𝑣

𝑐

|𝐼 (𝑢)||𝐼 (𝑣)|



𝑥 ∈𝐼 (𝑢)



𝑦∈𝐼 (𝑣)

𝑠 (𝑥,𝑦), otherwise.

(1)

Here,

𝐼 (𝑢)

denote the in-neighbors of node

𝑢

, and

𝑐 ∈ (

)

is the

decay factor, which is typically set to 0.6 [

] or 0.8 [

]. Due

to its computational complexity, ecient computation of SimRank

has been extensively studied in the past two decades [

], and remains to be a hot topic in the very recent

years [23, 31, 36, 39, 45, 48].

According to the query inputs and outputs, SimRank queries can

be broadly categorized as single-pair queries, single-source queries,

and all-pair queries with the following denition.

Definition 1 (All-Pair SimRank Computation). Given a graph

𝐺 = (𝑉, 𝐸)

𝑛

nodes and an error parameter

𝜀

, return the SimRank

estimation

𝑠

(𝑢, 𝑣)

for every node pair

(𝑢, 𝑣)

with absolute error guar-

antee, so that |𝑠

(𝑢, 𝑣) −𝑠 (𝑢, 𝑣)| ≤ 𝜀 (with high probability).

Among existing work [

–

] that tack-

les all-pair queries, state-of-the-art algorithms [

] can hardly

guarantee an additive error of

𝜀 =

01 on million-node graphs.

Besides, a recent work [

] indicates that it is “essentially hopeless”

to compute all-pair queries on large graphs, as non-zero items can

be as large as

𝑂 (𝑛

)

. Fortunately, two modied versions of all-pair

queries have been studied, namely, threshold-based [

] and

top-

𝑘

[

] similarity joins, of which the outputs are small

617

subsets of all-pair SimRank computation. Both queries are more

practically useful because we are only interested in the node pairs

with non-negligible similarities [48].

Definition 2 (Threshold-based Similarity Join). Given a

graph

𝐺 = (𝑉, 𝐸)

and a threshold

𝜃 >

0, return the set of node pairs

R(𝜃) with 𝑢 ≠ 𝑣 and SimRank value 𝑠(𝑢, 𝑣) ≥ 𝜃.

Definition 3 (Top-

𝑘

Similarity Join). Given a graph

𝐺 = (𝑉, 𝐸)

and a parameter

𝑘 >

0, return a

𝑘

-sized set

R(𝑘)

containing the

node pairs with the top-

𝑘

largest SimRank values among all pairs

{(𝑢, 𝑣)|𝑢, 𝑣 ∈ 𝑉 , 𝑢 ≠ 𝑣}.

Analogous to all-pair SimRank computation that an error pa-

rameter

𝜀

is allowed, we introduce the approximation bound

𝜌

for

similarity joins. As we shall see in Section 4, without

𝜌

(i.e., setting

𝜌 =

1), in worst cases no algorithm can be more ecient than

computing the exact SimRank values

Definition 4 (Approximation bound). Given an approximation

parameter

𝜌 ∈ (

)

, we say an algorithm

has approximation

bound for threshold-based (resp. top-

𝑘

) similarity join of SimRank, if

for any input graph

𝐺

, the returned set of node pairs contains at least

𝜌 fraction of the ground truth answer.

1.1 Motivations

We note that existing algorithms for threshold-based and top-

𝑘

similarity joins signicantly fall short of both query eciency and

approximation guarantee. We identify the following issues.

Motivation 1. Lack of experimental analysis of eciency and accuracy,

especially on medium and large graphs. Even for state-of-the-art

algorithms [

] of general all-pair queries, the result accuracy

is not evaluated except for small graphs due to lack of ground

truth. Other representative algorithms [

] even do not evaluate

accuracy on small graphs.

Motivation 2. Lack of approximation guarantee for threshold-based

and top-

𝑘

similarity joins. Surprisingly, there does not exist an algo-

rithm that guarantees the approximation bound for threshold-based

or top-

𝑘

similarity join. Algorithms with absolute error guarantee

cannot be directly extended to achieve this goal.

Motivation 3. Lack of understanding of the computational complexity.

As far as we know, it is unclear how the input parameters and graph

properties impact the algorithmic complexity of both queries.

1.2 Contributions

For both types of all-pair SimRank queries, we analyze state-of-

the-art algorithms in detail, provide improved algorithms with

approximation guarantee and lower theoretical complexity, and

discuss the problem complexity. Our contributions are summarized

as follows.

Detailed analysis of state-of-the-art algorithms. We choose

four representative algorithms, UISim [

], FLP & Opt-LP [

], H-

go SRJ [

], and KSimJoin [

] for comparison. UISim and FLP &

Opt-LP are state-of-the-art algorithms for general all-pair queries,

which are directly extended to answer SimRank-based similarity

joins [

]. H-go SRJ and KSimJoin are the state-of-the-art algorithms

We distinguish between the exact and the ground truth SimRank values, where the for-

mer is the exact solution to Equation 1 and the latter is a very accurate approximation.

for threshold-based and top-

𝑘

similarity joins, respectively. Apart

from their performance, we choose UISim [

] and KSimJoin [

]

because they adopt the random walk interpretation, whereas Opt-

LP [

] and H-go SRJ [

] are the best methods based on the Sim-

Rank matrix formation. For each algorithm, we discuss its basic

idea, theoretical guarantee, complexity analysis and empirical study.

We propose additional theoretical results (w.r.t. complexity or error

guarantee) that are incorrect or missed in the original papers.

Improved algorithms for SimRank-based similarity joins. We

rst propose two simple yet eective algorithm frameworks for

both queries respectively with approximation bound. The basic idea

is to invoke an ecient algorithm for all-pair queries that estimates

SimRank values within

𝜀

error, and to gradually shrink

𝜀

until the

approximation bound is theoretically guaranteed. Next, we utilize

the equivalence of all-pair SimRank on the input graph

𝐺

and single-

target Personalized PageRank (PPR) on the SimRank graph

𝐺

𝑠

[

and devise Randomized Reverse Local Push (R

LP) inspired by ran-

domized backward push for PPR [

]. It improves the all-pair query

complexity from

𝑂 (



𝑢,𝑣∈𝑉

𝑑

𝑖𝑛

(𝑢)𝑑

𝑖𝑛

(𝑣 )𝑠 (𝑢,𝑣 )

𝜀

)

(the best known algo-

rithm [39]) to 𝑂

(



𝑢,𝑣∈𝑉

√

𝑑

𝑖𝑛

(𝑢)𝑑

𝑖𝑛

(𝑣 )𝑠 (𝑢,𝑣 )

𝜀

). We further propose a

pruning strategy to dramatically enhance practical eciency while

retaining error guarantee.

Complexity analysis by utilizing SimRank distribution. We

nd that it is almost impossible to give a non-trivial complexity

bound for both similarity join queries without any assumption

of SimRank distribution. Our empirical analysis validates that for

most real-world graphs, the ground truth SimRank values follow

some generalized version of power-law distribution [

], which is

consistent with previous studies [

]. We leverage this property

to propose veriable assumptions of SimRank distribution and

present complexity analysis for our algorithms, which serves as

the upper bound of the algorithmic complexity of SimRank-based

similarity joins. Our theoretical results nicely match the empirical

study in Section 6.

Extensive experiments on medium and large graphs. We con-

duct extensive experiments on small, medium, and large sized

graphs and compare our algorithms with state-of-the-art methods.

As far as we know, we rst employ medium and large datasets to

systematically evaluate threshold-based and top-

𝑘

similarity joins

in terms of running time, absolute error and query accuracy, and

to demonstrate the necessities of holding approximation bound

as well as devising more ecient algorithms. Experimental study

demonstrates that our algorithms outperform existing ones in most

cases, achieving better accuracy while being faster by up to an order

of magnitude. More importantly, our empirical ndings validate

that it is the skewness of SimRank distribution that mainly deter-

mines the problem hardness. This results veries the eectiveness

of our theoretical analysis.

2 PRELIMINARIES

The denition in Equation 1 can be interpreted via the node-pairs

graph

𝐺

[

]. Each node in

𝐺

represents a pair of nodes

(𝑢, 𝑣)

𝐺

. If there exist edges

(𝑢, 𝑣)

and

(𝑥,𝑦)

𝐺

, edge

((𝑢, 𝑥), (𝑣,𝑦))

included in

𝐺

. Specically, every node

(𝑣, 𝑣)

is called the singleton

node. To this end, SimRank can be thought of as propagating the

similarity among nodes in

𝐺

, starting from all the singleton nodes.

618

of 13

免费下载

文档被以下合辑收录

VLDB2024 数据库顶会论文（共31篇）

本合辑收录了VLDB2024 数据库顶会论文。

关注

文档被以下合辑收录

评论