
GeaBase: A High-Performance Distributed Graph
Database for Industry-Scale Applications
Zhisong Fu, Zhengwei Wu, Houyi Li, Yize Li, Min Wu, Xiaojie Chen, Xiaomeng Ye, Benquan Yu, Xi Hu
Ant Finacial, Inc.
Abstract—Graph analytics hav e been gaining tractions rapidly
in the past few years. It has a wide array of application areas
in the industry, ranging from e-commerce, social network and
recommendation systems to fraud detection and virtually any
problem that requires insights into data connections, not just
data itself. In this paper, we present GeaBase, a new distributed
graph database that provides the capability to store and analyze
graph-structured data in real-time at massiv e scale. We describe
the details of the system and the implementation, including a
novel update architecture, called Update Center (UC), and a
new language that is suitable for both graph tra versal and
analytics. We also compare the performance of GeaBase to a
widely used open-source graph database Titan. Experiments show
that GeaBase is up to 182x faster than Titan in our testing
scenarios. We also achieves 22x higher throughput on social
network workloads in the comparison.
I. INTRODUCTION
We are in the age of big data. Connections between data
are of same importance as data itself. Together, they record
information reflecting the real world. A Graph, defined as
<V, E>, is a natural way to represent data and their con-
nections. Here V represents data, or namely nodes, and E
represents connections between data, namely edges.
Graph databases are introduced to efficiently store and query
the graph. Graphs stored in the graph database are usually
property graph models (nodes, edges and properties) (see
Fig. 1).
A key feature of the graph database is that edges (or
connections) are treated as the core component of the model,
along with vertexes. Hence, complex topological structures
Fig. 1. An example of property graph. The user, item and bankcard are three
nodes, and their connections (like, buy, own) are modeled as directed edges.
can be retrieved efficiently. In contrast, with conventional
relational databases, connections between data are stored in
separate tables, and queries searching for connections require
join operations, which is usually very expensive.
However, it is challenging to design a high-performance
graph database for industry-scale applications. First, irregular
data structure o f a graph usually leads to random access pattern
to the storage system, and hence results in poor data locality;
Second, in order to store a large scale of graph, the data
is usually partitioned, which leads to high communication
cost and imbalanced workloads; Finally, data consistency in
a fast changing and distributed graph database is also very
challenging.
In this paper, we introduce GeaBase (Graph Exploration
and Analytics Database) that provides real-time g raph traversal
and analytics capabilities for industry-scale applications. We
will describe the full detail o f GeaBase architecture and
implementation. GeaBase employs techniques, such as moving
computation to where data is, double-queue update pipeline
and user-stickiness et al. , to achieve high-performance and
data consistency.
The rest of this paper is structured as follows. In Section II,
we describe the related work in the literature. In Section III, we
discuss implementation details and data structures of GeaBase.
In Section IV we discuss the performance of GeaBase and
compare the results with Titan, an open-source distributed
graph database. In Section V we summarize the results and
discuss future research directions related to this work.
II. R
ELATED WORK
Several graph databases and graph analytics systems have
been introduced in the literature, for unlo cking the value of
data connections. Neo4j [1] is the best-known graph database
according to db-engines.com [2]. Its initial release was more
than ten years ago, and Neo4j has built a large develper
community. However, Neo4j offers limited support for scal-
ability (scale-up only), and hence cannot handle very large
dataset for big companies like Alibaba, Inc. The state-of-the-
art distributed databases that h ave scale-ou t capability [1],
[3], [4] typically employ standard graph query language like
Gremlins [5] o r SparQL [6]. These query languages have
limited support of graph analytics. Offline graph analytics
systems, such as those proposed in [7]–[10], are not able to
update while processing queries or respond in real-time.
2017 Fifth International Conference on Advanced Cloud and Big Data
978-1-5386-1072-5/17 $31.00 © 2017 IEEE
DOI 10.1109/CBD.2017.37
170
2017 Fifth International Conference on Advanced Cloud and Big Data
978-1-5386-1072-5/17 $31.00 © 2017 IEEE
DOI 10.1109/CBD.2017.37
170
评论