
Scaling Aributed Network Embedding to Massive Graphs
Renchi Yang
Nanyang Technological University
yang0461@ntu.edu.sg
Jieming Shi
Hong Kong Polytechnic University
jieming.shi@polyu.edu.hk
Xiaokui Xiao
National University of Singapore
xkxiao@nus.edu.sg
Yin Yang
Hamad bin Khalifa University
yyang@hbku.edu.qa
Juncheng Liu
National University of Singapore
juncheng@comp.nus.edu.sg
Sourav S. Bhowmick
Nanyang Technological University
assourav@ntu.edu.sg
ABSTRACT
Given a graph
𝐺
where each node is associated with a set of at-
tributes, attributed network embedding (ANE) maps each node
𝑣 ∈ 𝐺
to a compact vector
𝑋
𝑣
, which can be used in downstream machine
learning tasks. Ideally,
𝑋
𝑣
should capture node
𝑣
’s anity to each at-
tribute, which considers not only
𝑣
’s own attribute associations, but
also those of its connected nodes along edges in
𝐺
. It is challenging
to obtain high-utility embeddings that enable accurate predictions;
scaling eective ANE computation to massive graphs with millions
of nodes pushes the diculty of the problem to a whole new level.
Existing solutions largely fail on such graphs, leading to prohibitive
costs, low-quality embeddings, or both.
This paper proposes
PANE
, an eective and scalable approach to
ANE computation for massive graphs that achieves state-of-the-art
result quality on multiple benchmark datasets, measured by the
accuracy of three common prediction tasks: attribute inference,
link prediction, and node classication. In particular, for the large
MAG data with over 59 million nodes, 0.98 billion edges, and 2000
attributes,
PANE
is the only known viable solution that obtains
eective embeddings on a single server, within 12 hours.
PANE
obtains high scalability and eectiveness through three
main algorithmic designs. First, it formulates the learning objec-
tive based on a novel random walk model for attributed networks.
The resulting optimization task is still challenging on large graphs.
Second,
PANE
includes a highly ecient solver for the above opti-
mization problem, whose key module is a carefully designed initial-
ization of the embeddings, which drastically reduces the number of
iterations required to converge. Finally,
PANE
utilizes multi-core
CPUs through non-trivial parallelization of the above solver, which
achieves scalability while retaining the high quality of the result-
ing embeddings. Extensive experiments, comparing 10 existing
approaches on 8 real datasets, demonstrate that
PANE
consistently
outperforms all existing methods in terms of result quality, while
being orders of magnitude faster.
PVLDB Reference Format:
Renchi Yang, Jieming Shi, Xiaokui Xiao, Yin Yang, Juncheng Liu,
and Sourav S. Bhowmick. Scaling Attributed Network Embedding to
Massive Graphs. PVLDB, 14(1): 37 - 49, 2021.
doi:10.14778/3421424.3421430
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
doi:10.14778/3421424.3421430
1 INTRODUCTION
Network embedding is a fundamental task for graph analytics,
which has attracted much attention from both academia (e.g., [
14
,
31
,
35
]) and industry (e.g., [
23
,
52
]). Given an input graph
𝐺
, network
embedding converts each node
𝑣 ∈ 𝐺
to a compact, xed-length
vector
𝑋
𝑣
, which captures the topological features of the graph
around node
𝑣
. In practice, however, graph data often comes with
attributes associated to nodes. While we could treat graph topology
and attributes as separate features, doing so loses the important
information of node-attribute anity [
27
], i.e., attributes that can
be reached by a node through one or more hops along the edges in
𝐺
. For instance, consider a graph containing companies and board
members. An important type of insights that can be gained from
such a network is that one company (e.g., Tesla Motors) can reach
attributes of another related company (e.g., SpaceX) connected
via a common board member (Elon Musk). To incorporate such
information, attributed network embedding maps both topological
and attribute information surrounding a node to an embedding
vector, which facilitates accurate predictions, either through the
embeddings themselves or in downstream machine learning tasks.
Eective ANE computation is a highly challenging task, espe-
cially for massive graphs, e.g., with millions of nodes and billions
of edges. In particular, each node
𝑣 ∈ 𝐺
could be associated with
a large number of attributes, which corresponds to a high dimen-
sional space; further, each attribute of
𝑣
could inuence not only
𝑣
’s own embedding, but also those of
𝑣
’s neighbors, neighbors’
neighbors, and far-reaching connections via multiple hops along
the edges in
𝐺
. Existing ANE solutions are immensely expensive
and largely fail on massive graphs. Specically, as reviewed in Sec-
tion 6, one class of previous methods e.g., [
18
,
41
,
44
,
48
], explicitly
construct and factorize an
𝑛 × 𝑛
matrix, where
𝑛
is the number of
nodes in
𝐺
. For a graph with 50 million nodes, storing such a matrix
of double-precision values would take over 20 petabytes of mem-
ory, which is clearly infeasible. Another category of methods, e.g.,
[
8
,
25
,
30
,
49
], employ deep neural networks to extract higher-level
features from nodes’ connections and attributes. For a large dataset,
training such a neural network incurs vast computational costs;
further, the training process is usually done on GPUs with limited
graphics memory, e.g., 32GB on Nvidia’s agship Tesla V100 cards.
Thus, for massive graphs, currently the only option is to compute
ANE with a large cluster, e.g., [
52
], which is nancially costly and
out of the reach of most researchers.
In addition, to our knowledge, all existing ANE solutions are
designed for undirected graphs, and it is unclear how to incorporate
edge direction information (e.g., asymmetric transitivity [
50
]) into
37
评论