数据库顶会VLDB 2021最佳研究论文奖《Scaling Attributed Network Embedding to Massive Graphs》- Renchi Yang.pdf

小小亮

795

13页

0次

2021-08-19

免费下载

Scaling Aributed Network Embedding to Massive Graphs

Renchi Yang

Nanyang Technological University

yang0461@ntu.edu.sg

Jieming Shi

Hong Kong Polytechnic University

jieming.shi@polyu.edu.hk

Xiaokui Xiao

National University of Singapore

xkxiao@nus.edu.sg

Yin Yang

Hamad bin Khalifa University

yyang@hbku.edu.qa

Juncheng Liu

National University of Singapore

juncheng@comp.nus.edu.sg

Sourav S. Bhowmick

Nanyang Technological University

assourav@ntu.edu.sg

ABSTRACT

Given a graph

𝐺

where each node is associated with a set of at-

tributes, attributed network embedding (ANE) maps each node

𝑣 ∈ 𝐺

to a compact vector

𝑋

𝑣

, which can be used in downstream machine

learning tasks. Ideally,

𝑋

𝑣

should capture node

𝑣

’s anity to each at-

tribute, which considers not only

𝑣

’s own attribute associations, but

also those of its connected nodes along edges in

𝐺

. It is challenging

to obtain high-utility embeddings that enable accurate predictions;

scaling eective ANE computation to massive graphs with millions

of nodes pushes the diculty of the problem to a whole new level.

Existing solutions largely fail on such graphs, leading to prohibitive

costs, low-quality embeddings, or both.

This paper proposes

PANE

, an eective and scalable approach to

ANE computation for massive graphs that achieves state-of-the-art

result quality on multiple benchmark datasets, measured by the

accuracy of three common prediction tasks: attribute inference,

link prediction, and node classication. In particular, for the large

MAG data with over 59 million nodes, 0.98 billion edges, and 2000

attributes,

PANE

is the only known viable solution that obtains

eective embeddings on a single server, within 12 hours.

PANE

obtains high scalability and eectiveness through three

main algorithmic designs. First, it formulates the learning objec-

tive based on a novel random walk model for attributed networks.

The resulting optimization task is still challenging on large graphs.

Second,

PANE

includes a highly ecient solver for the above opti-

mization problem, whose key module is a carefully designed initial-

ization of the embeddings, which drastically reduces the number of

iterations required to converge. Finally,

PANE

utilizes multi-core

CPUs through non-trivial parallelization of the above solver, which

achieves scalability while retaining the high quality of the result-

ing embeddings. Extensive experiments, comparing 10 existing

approaches on 8 real datasets, demonstrate that

PANE

consistently

outperforms all existing methods in terms of result quality, while

being orders of magnitude faster.

PVLDB Reference Format:

Renchi Yang, Jieming Shi, Xiaokui Xiao, Yin Yang, Juncheng Liu,

and Sourav S. Bhowmick. Scaling Attributed Network Embedding to

Massive Graphs. PVLDB, 14(1): 37 - 49, 2021.

doi:10.14778/3421424.3421430

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.

doi:10.14778/3421424.3421430

1 INTRODUCTION

Network embedding is a fundamental task for graph analytics,

which has attracted much attention from both academia (e.g., [

]) and industry (e.g., [

]). Given an input graph

𝐺

, network

embedding converts each node

𝑣 ∈ 𝐺

to a compact, xed-length

vector

𝑋

𝑣

, which captures the topological features of the graph

around node

𝑣

. In practice, however, graph data often comes with

attributes associated to nodes. While we could treat graph topology

and attributes as separate features, doing so loses the important

information of node-attribute anity [

], i.e., attributes that can

be reached by a node through one or more hops along the edges in

𝐺

. For instance, consider a graph containing companies and board

members. An important type of insights that can be gained from

such a network is that one company (e.g., Tesla Motors) can reach

attributes of another related company (e.g., SpaceX) connected

via a common board member (Elon Musk). To incorporate such

information, attributed network embedding maps both topological

and attribute information surrounding a node to an embedding

vector, which facilitates accurate predictions, either through the

embeddings themselves or in downstream machine learning tasks.

Eective ANE computation is a highly challenging task, espe-

cially for massive graphs, e.g., with millions of nodes and billions

of edges. In particular, each node

𝑣 ∈ 𝐺

could be associated with

a large number of attributes, which corresponds to a high dimen-

sional space; further, each attribute of

𝑣

could inuence not only

𝑣

’s own embedding, but also those of

𝑣

’s neighbors, neighbors’

neighbors, and far-reaching connections via multiple hops along

the edges in

𝐺

. Existing ANE solutions are immensely expensive

and largely fail on massive graphs. Specically, as reviewed in Sec-

tion 6, one class of previous methods e.g., [

], explicitly

construct and factorize an

𝑛 × 𝑛

matrix, where

𝑛

is the number of

nodes in

𝐺

. For a graph with 50 million nodes, storing such a matrix

of double-precision values would take over 20 petabytes of mem-

ory, which is clearly infeasible. Another category of methods, e.g.,

[

], employ deep neural networks to extract higher-level

features from nodes’ connections and attributes. For a large dataset,

training such a neural network incurs vast computational costs;

further, the training process is usually done on GPUs with limited

graphics memory, e.g., 32GB on Nvidia’s agship Tesla V100 cards.

Thus, for massive graphs, currently the only option is to compute

ANE with a large cluster, e.g., [

], which is nancially costly and

out of the reach of most researchers.

In addition, to our knowledge, all existing ANE solutions are

designed for undirected graphs, and it is unclear how to incorporate

edge direction information (e.g., asymmetric transitivity [

]) into

their resulting embeddings. In practice, many graphs are directed

(e.g., one paper citing another), and existing methods yield subopti-

mal result quality on such graphs, as shown in our experiments. Can

we compute eective ANE embeddings on a massive, attributed,

directed graph on a single server?

This paper provides a positive answer to the above question with

PANE

, a novel solution that signicantly advances the state of the

art in ANE computation. Specically, as demonstrated in our exper-

iments, the embeddings obtained by

PANE

simultaneously achieve

the highest prediction accuracy compared to previous methods for

3 common graph analytics tasks: attribute inference, link prediction,

and node classication, on common benchmark graph datasets. On

the largest Microsoft Academic Knowledge Graph (MAG) dataset,

PANE

is the only viable solution on a single server, whose result-

ing embeddings lead to 0.88 average precision (AP) for attribute

inference, 0.965 AP for link prediction, and 0.57 micro-F1 for node

classication.

PANE

obtains such results using 10 CPU cores, 1TB

memory, and 12 hours running time.

PANE

achieves eective and scalable ANE computation through

three main contributions: a well-thought-out problem formulation

based on a novel random walk model, a highly ecient solver,

and non-trivial parallelization of the algorithm. Specically, As

presented in Section 2.2,

PANE

formulates the ANE task as an

optimization problem with the objective of approximating nor-

malized multi-hop node-attribute anity using node-attribute co-

projections [

], guided by a shifted pairewise mutual informa-

tion (SPMI) metric that is inspired by natural language processing

techniques. The anity between a given node-attribute pair is de-

ned via a random walk model specically adapted to attributed

networks. Further, we incorporate edge direction information by

dening separate forward and backward anity, embeddings, and

SPMI metrics. Solving this optimization problem is still immensely

expensive with o-the-shelf algorithms, as it involves the joint

factorization of two

𝑂 (𝑛 · 𝑑)

-sized matrices, where

𝑛

and

𝑑

are

the numbers of nodes and attributes in the input data, respectively.

Thus,

PANE

includes a novel solver with a key module that seeds the

optimizer through a highly eective greedy algorithm, which dras-

tically reduces the number of iterations till convergence. Finally, we

devise non-trivial parallelization of the

PANE

algorithm, to utilize

modern multi-core CPUs without signicantly compromising result

utility. Extensive experiments, using 8 real datasets and comparing

against 10 existing solutions, demonstrate that

PANE

consistently

obtains high-utility embeddings with superior prediction accuracy

for attribute inference, link prediction and node classication, at a

fraction of the cost compared to existing methods.

Summing up, our contributions in this paper are as follows:

•

We formulate the ANE task as an optimization problem with the

objective of approximating multi-hop node-attribute anity.

•

We further consider edge direction in our objective by dening

forward and backward anity matrices using the SPMI metric.

• We propose several techniques to eciently solve the optimiza-

tion problem, including ecient approximation of the anity

matrices, fast joint factorization of the anity matrices, and a key

module to greedily seed the optimizer, which drastically reduces

the number of iterations till convergence.

•

We develop non-trivial parallelization techniques of

PANE

further boost eciency.

Table 1: Frequently used notations.

Notation Description

𝐺=(𝑉

𝐸

𝑉

𝑅, 𝐸

𝑅

)

A graph

𝐺

with node set

𝑉

, edge set

𝐸

𝑉

, attribute

set 𝑅, and node-attribute association set 𝐸

𝑅

𝑛

,𝑚

, 𝑑

The numbers of nodes, edges, and attributes in

𝐺, respectively.

𝑘 The

space budget

of embedding vectors.

A, D, P, R

The adjacency, out-degree, random walk and at-

tribute matrices of 𝐺.

𝑟

, R

𝑐

The row-normalized and column-normalized at-

tribute matrices. See Equation (1).

F, B

The forward and backward anity matrices. See

Equations (2) and (3).

′

, B

′

The approximate forward and backward anity

matrices. See Equation (7).

𝑓

, X

𝑏

, Y

The forward and backward embedding vectors,

and attribute embedding vectors.

𝛼 The

random walk

stopping probability.

𝑛

𝑏

The

number

of threads.

•

The superior performance of

PANE

, in terms of eciency and ef-

fectiveness, is evaluated against 10 competitors on 8 real datasets.

The rest of the paper is organized as follows. In Section 2, we

formally formulate our ANE objective by dening node-attribute

anity. We present single-thread

PANE

with several speedup tech-

niques in Section 3, and further develop non-trivial parallel PANE

in Section 4. The eectiveness and eciency of our solutions are

evaluated in Section 5. Related work is reviewed in Section 6. Fi-

nally, Section 7 concludes the paper. Note that proofs of lemmas

and additional experiments are given in our technical report [

2 PROBLEM FORMULATION

2.1 Preliminaries

Let

𝐺 = (𝑉, 𝐸

𝑉

, 𝑅, 𝐸

𝑅

)

be an attributed network, consisting of (i)

a node set

𝑉

with cardinality

𝑛

, (ii) a set of edges

𝐸

𝑉

of size

𝑚

each connecting two nodes in

𝑉

, (iii) a set of attributes

𝑅

with

cardinality

𝑑

, and (iv) a set of node-attribute associations

𝐸

𝑅

, where

each element is a tuple

(𝑣

𝑖

, 𝑟

𝑗

, 𝑤

𝑖,𝑗

)

signifying that node

𝑣

𝑖

∈ 𝑉

directly associated with attribute

𝑟

𝑗

∈ 𝑅

with a weight

𝑤

𝑖,𝑗

(i.e.,

the attribute value). Note that for a categorical attribute such as

marital status, we rst apply a pre-processing step that transforms

the attribute into a set of binary ones through one-hot encoding.

Without loss of generality, we assume that

𝐺

is a directed graph; if

𝐺

is undirected, then we treat each edge

(𝑣

𝑖

, 𝑣

𝑗

)

𝐺

as a pair of

directed edges with opposing directions: (𝑣

𝑖

, 𝑣

𝑗

) and (𝑣

𝑗

, 𝑣

𝑖

Given a space budget

𝑘 ≪ 𝑛

, a node embedding maps a node

𝑣 ∈

𝑉

to a length-

𝑘

vector. The general, hand-waving goal of attributed

network embedding (ANE) is to compute such an embedding

𝑋

𝑣

for each node

𝑣

in the input graph, such that

𝑋

𝑣

captures the graph

structure and attribute information surrounding node

𝑣

. In addition,

following previous work [

], we also allocate a space budget

𝑘

(explained later in Section 2.3) for each attribute

𝑟 ∈ 𝑅

, and aim to

compute an attribute embedding vector for 𝑟 of length

𝑘

Notations.

We denote matrices in bold uppercase, e.g.,

. We use

M[𝑣

𝑖

]

to denote the

𝑣

𝑖

-th row vector of

, and

, 𝑟

𝑗

]

to denote

of 13

免费下载

数据库 vldb2021

关注

评论