暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
数据库顶会VLDB 2021最佳研究论文奖《Scaling Attributed Network Embedding to Massive Graphs》- Renchi Yang.pdf
770
13页
0次
2021-08-19
免费下载
Scaling Aributed Network Embedding to Massive Graphs
Renchi Yang
Nanyang Technological University
yang0461@ntu.edu.sg
Jieming Shi
Hong Kong Polytechnic University
jieming.shi@polyu.edu.hk
Xiaokui Xiao
National University of Singapore
xkxiao@nus.edu.sg
Yin Yang
Hamad bin Khalifa University
yyang@hbku.edu.qa
Juncheng Liu
National University of Singapore
juncheng@comp.nus.edu.sg
Sourav S. Bhowmick
Nanyang Technological University
assourav@ntu.edu.sg
ABSTRACT
Given a graph
𝐺
where each node is associated with a set of at-
tributes, attributed network embedding (ANE) maps each node
𝑣 𝐺
to a compact vector
𝑋
𝑣
, which can be used in downstream machine
learning tasks. Ideally,
𝑋
𝑣
should capture node
𝑣
’s anity to each at-
tribute, which considers not only
𝑣
’s own attribute associations, but
also those of its connected nodes along edges in
𝐺
. It is challenging
to obtain high-utility embeddings that enable accurate predictions;
scaling eective ANE computation to massive graphs with millions
of nodes pushes the diculty of the problem to a whole new level.
Existing solutions largely fail on such graphs, leading to prohibitive
costs, low-quality embeddings, or both.
This paper proposes
PANE
, an eective and scalable approach to
ANE computation for massive graphs that achieves state-of-the-art
result quality on multiple benchmark datasets, measured by the
accuracy of three common prediction tasks: attribute inference,
link prediction, and node classication. In particular, for the large
MAG data with over 59 million nodes, 0.98 billion edges, and 2000
attributes,
PANE
is the only known viable solution that obtains
eective embeddings on a single server, within 12 hours.
PANE
obtains high scalability and eectiveness through three
main algorithmic designs. First, it formulates the learning objec-
tive based on a novel random walk model for attributed networks.
The resulting optimization task is still challenging on large graphs.
Second,
PANE
includes a highly ecient solver for the above opti-
mization problem, whose key module is a carefully designed initial-
ization of the embeddings, which drastically reduces the number of
iterations required to converge. Finally,
PANE
utilizes multi-core
CPUs through non-trivial parallelization of the above solver, which
achieves scalability while retaining the high quality of the result-
ing embeddings. Extensive experiments, comparing 10 existing
approaches on 8 real datasets, demonstrate that
PANE
consistently
outperforms all existing methods in terms of result quality, while
being orders of magnitude faster.
PVLDB Reference Format:
Renchi Yang, Jieming Shi, Xiaokui Xiao, Yin Yang, Juncheng Liu,
and Sourav S. Bhowmick. Scaling Attributed Network Embedding to
Massive Graphs. PVLDB, 14(1): 37 - 49, 2021.
doi:10.14778/3421424.3421430
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
doi:10.14778/3421424.3421430
1 INTRODUCTION
Network embedding is a fundamental task for graph analytics,
which has attracted much attention from both academia (e.g., [
14
,
31
,
35
]) and industry (e.g., [
23
,
52
]). Given an input graph
𝐺
, network
embedding converts each node
𝑣 𝐺
to a compact, xed-length
vector
𝑋
𝑣
, which captures the topological features of the graph
around node
𝑣
. In practice, however, graph data often comes with
attributes associated to nodes. While we could treat graph topology
and attributes as separate features, doing so loses the important
information of node-attribute anity [
27
], i.e., attributes that can
be reached by a node through one or more hops along the edges in
𝐺
. For instance, consider a graph containing companies and board
members. An important type of insights that can be gained from
such a network is that one company (e.g., Tesla Motors) can reach
attributes of another related company (e.g., SpaceX) connected
via a common board member (Elon Musk). To incorporate such
information, attributed network embedding maps both topological
and attribute information surrounding a node to an embedding
vector, which facilitates accurate predictions, either through the
embeddings themselves or in downstream machine learning tasks.
Eective ANE computation is a highly challenging task, espe-
cially for massive graphs, e.g., with millions of nodes and billions
of edges. In particular, each node
𝑣 𝐺
could be associated with
a large number of attributes, which corresponds to a high dimen-
sional space; further, each attribute of
𝑣
could inuence not only
𝑣
’s own embedding, but also those of
𝑣
’s neighbors, neighbors’
neighbors, and far-reaching connections via multiple hops along
the edges in
𝐺
. Existing ANE solutions are immensely expensive
and largely fail on massive graphs. Specically, as reviewed in Sec-
tion 6, one class of previous methods e.g., [
18
,
41
,
44
,
48
], explicitly
construct and factorize an
𝑛 × 𝑛
matrix, where
𝑛
is the number of
nodes in
𝐺
. For a graph with 50 million nodes, storing such a matrix
of double-precision values would take over 20 petabytes of mem-
ory, which is clearly infeasible. Another category of methods, e.g.,
[
8
,
25
,
30
,
49
], employ deep neural networks to extract higher-level
features from nodes’ connections and attributes. For a large dataset,
training such a neural network incurs vast computational costs;
further, the training process is usually done on GPUs with limited
graphics memory, e.g., 32GB on Nvidia’s agship Tesla V100 cards.
Thus, for massive graphs, currently the only option is to compute
ANE with a large cluster, e.g., [
52
], which is nancially costly and
out of the reach of most researchers.
In addition, to our knowledge, all existing ANE solutions are
designed for undirected graphs, and it is unclear how to incorporate
edge direction information (e.g., asymmetric transitivity [
50
]) into
37
their resulting embeddings. In practice, many graphs are directed
(e.g., one paper citing another), and existing methods yield subopti-
mal result quality on such graphs, as shown in our experiments. Can
we compute eective ANE embeddings on a massive, attributed,
directed graph on a single server?
This paper provides a positive answer to the above question with
PANE
, a novel solution that signicantly advances the state of the
art in ANE computation. Specically, as demonstrated in our exper-
iments, the embeddings obtained by
PANE
simultaneously achieve
the highest prediction accuracy compared to previous methods for
3 common graph analytics tasks: attribute inference, link prediction,
and node classication, on common benchmark graph datasets. On
the largest Microsoft Academic Knowledge Graph (MAG) dataset,
PANE
is the only viable solution on a single server, whose result-
ing embeddings lead to 0.88 average precision (AP) for attribute
inference, 0.965 AP for link prediction, and 0.57 micro-F1 for node
classication.
PANE
obtains such results using 10 CPU cores, 1TB
memory, and 12 hours running time.
PANE
achieves eective and scalable ANE computation through
three main contributions: a well-thought-out problem formulation
based on a novel random walk model, a highly ecient solver,
and non-trivial parallelization of the algorithm. Specically, As
presented in Section 2.2,
PANE
formulates the ANE task as an
optimization problem with the objective of approximating nor-
malized multi-hop node-attribute anity using node-attribute co-
projections [
27
,
28
], guided by a shifted pairewise mutual informa-
tion (SPMI) metric that is inspired by natural language processing
techniques. The anity between a given node-attribute pair is de-
ned via a random walk model specically adapted to attributed
networks. Further, we incorporate edge direction information by
dening separate forward and backward anity, embeddings, and
SPMI metrics. Solving this optimization problem is still immensely
expensive with o-the-shelf algorithms, as it involves the joint
factorization of two
𝑂 (𝑛 · 𝑑)
-sized matrices, where
𝑛
and
𝑑
are
the numbers of nodes and attributes in the input data, respectively.
Thus,
PANE
includes a novel solver with a key module that seeds the
optimizer through a highly eective greedy algorithm, which dras-
tically reduces the number of iterations till convergence. Finally, we
devise non-trivial parallelization of the
PANE
algorithm, to utilize
modern multi-core CPUs without signicantly compromising result
utility. Extensive experiments, using 8 real datasets and comparing
against 10 existing solutions, demonstrate that
PANE
consistently
obtains high-utility embeddings with superior prediction accuracy
for attribute inference, link prediction and node classication, at a
fraction of the cost compared to existing methods.
Summing up, our contributions in this paper are as follows:
We formulate the ANE task as an optimization problem with the
objective of approximating multi-hop node-attribute anity.
We further consider edge direction in our objective by dening
forward and backward anity matrices using the SPMI metric.
We propose several techniques to eciently solve the optimiza-
tion problem, including ecient approximation of the anity
matrices, fast joint factorization of the anity matrices, and a key
module to greedily seed the optimizer, which drastically reduces
the number of iterations till convergence.
We develop non-trivial parallelization techniques of
PANE
to
further boost eciency.
Table 1: Frequently used notations.
Notation Description
𝐺=(𝑉
,
𝐸
𝑉
,
𝑅, 𝐸
𝑅
)
A graph
𝐺
with node set
𝑉
, edge set
𝐸
𝑉
, attribute
set 𝑅, and node-attribute association set 𝐸
𝑅
.
𝑛
,𝑚
, 𝑑
The numbers of nodes, edges, and attributes in
𝐺, respectively.
𝑘 The
space budget
of embedding vectors.
A, D, P, R
The adjacency, out-degree, random walk and at-
tribute matrices of 𝐺.
R
𝑟
, R
𝑐
The row-normalized and column-normalized at-
tribute matrices. See Equation (1).
F, B
The forward and backward anity matrices. See
Equations (2) and (3).
F
, B
The approximate forward and backward anity
matrices. See Equation (7).
X
𝑓
, X
𝑏
, Y
The forward and backward embedding vectors,
and attribute embedding vectors.
𝛼 The
random walk
stopping probability.
𝑛
𝑏
The
number
of threads.
The superior performance of
PANE
, in terms of eciency and ef-
fectiveness, is evaluated against 10 competitors on 8 real datasets.
The rest of the paper is organized as follows. In Section 2, we
formally formulate our ANE objective by dening node-attribute
anity. We present single-thread
PANE
with several speedup tech-
niques in Section 3, and further develop non-trivial parallel PANE
in Section 4. The eectiveness and eciency of our solutions are
evaluated in Section 5. Related work is reviewed in Section 6. Fi-
nally, Section 7 concludes the paper. Note that proofs of lemmas
and additional experiments are given in our technical report [
47
].
2 PROBLEM FORMULATION
2.1 Preliminaries
Let
𝐺 = (𝑉, 𝐸
𝑉
, 𝑅, 𝐸
𝑅
)
be an attributed network, consisting of (i)
a node set
𝑉
with cardinality
𝑛
, (ii) a set of edges
𝐸
𝑉
of size
𝑚
,
each connecting two nodes in
𝑉
, (iii) a set of attributes
𝑅
with
cardinality
𝑑
, and (iv) a set of node-attribute associations
𝐸
𝑅
, where
each element is a tuple
(𝑣
𝑖
, 𝑟
𝑗
, 𝑤
𝑖,𝑗
)
signifying that node
𝑣
𝑖
𝑉
is
directly associated with attribute
𝑟
𝑗
𝑅
with a weight
𝑤
𝑖,𝑗
(i.e.,
the attribute value). Note that for a categorical attribute such as
marital status, we rst apply a pre-processing step that transforms
the attribute into a set of binary ones through one-hot encoding.
Without loss of generality, we assume that
𝐺
is a directed graph; if
𝐺
is undirected, then we treat each edge
(𝑣
𝑖
, 𝑣
𝑗
)
in
𝐺
as a pair of
directed edges with opposing directions: (𝑣
𝑖
, 𝑣
𝑗
) and (𝑣
𝑗
, 𝑣
𝑖
).
Given a space budget
𝑘 𝑛
, a node embedding maps a node
𝑣
𝑉
to a length-
𝑘
vector. The general, hand-waving goal of attributed
network embedding (ANE) is to compute such an embedding
𝑋
𝑣
for each node
𝑣
in the input graph, such that
𝑋
𝑣
captures the graph
structure and attribute information surrounding node
𝑣
. In addition,
following previous work [
27
], we also allocate a space budget
𝑘
2
(explained later in Section 2.3) for each attribute
𝑟 𝑅
, and aim to
compute an attribute embedding vector for 𝑟 of length
𝑘
2
.
Notations.
We denote matrices in bold uppercase, e.g.,
M
. We use
M[𝑣
𝑖
]
to denote the
𝑣
𝑖
-th row vector of
M
, and
M[
:
, 𝑟
𝑗
]
to denote
38
of 13
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜