暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
GeaBase A High-Performance Distributed Graph Database for Industry-Scale Applications (1).pdf
109
6页
0次
2023-05-31
免费下载
GeaBase: A High-Performance Distributed Graph
Database for Industry-Scale Applications
Zhisong Fu, Zhengwei Wu, Houyi Li, Yize Li, Min Wu, Xiaojie Chen, Xiaomeng Ye, Benquan Yu, Xi Hu
Ant Finacial, Inc.
Abstract—Graph analytics hav e been gaining tractions rapidly
in the past few years. It has a wide array of application areas
in the industry, ranging from e-commerce, social network and
recommendation systems to fraud detection and virtually any
problem that requires insights into data connections, not just
data itself. In this paper, we present GeaBase, a new distributed
graph database that provides the capability to store and analyze
graph-structured data in real-time at massiv e scale. We describe
the details of the system and the implementation, including a
novel update architecture, called Update Center (UC), and a
new language that is suitable for both graph tra versal and
analytics. We also compare the performance of GeaBase to a
widely used open-source graph database Titan. Experiments show
that GeaBase is up to 182x faster than Titan in our testing
scenarios. We also achieves 22x higher throughput on social
network workloads in the comparison.
I. INTRODUCTION
We are in the age of big data. Connections between data
are of same importance as data itself. Together, they record
information reflecting the real world. A Graph, defined as
<V, E>, is a natural way to represent data and their con-
nections. Here V represents data, or namely nodes, and E
represents connections between data, namely edges.
Graph databases are introduced to efficiently store and query
the graph. Graphs stored in the graph database are usually
property graph models (nodes, edges and properties) (see
Fig. 1).
A key feature of the graph database is that edges (or
connections) are treated as the core component of the model,
along with vertexes. Hence, complex topological structures
Fig. 1. An example of property graph. The user, item and bankcard are three
nodes, and their connections (like, buy, own) are modeled as directed edges.
can be retrieved efficiently. In contrast, with conventional
relational databases, connections between data are stored in
separate tables, and queries searching for connections require
join operations, which is usually very expensive.
However, it is challenging to design a high-performance
graph database for industry-scale applications. First, irregular
data structure o f a graph usually leads to random access pattern
to the storage system, and hence results in poor data locality;
Second, in order to store a large scale of graph, the data
is usually partitioned, which leads to high communication
cost and imbalanced workloads; Finally, data consistency in
a fast changing and distributed graph database is also very
challenging.
In this paper, we introduce GeaBase (Graph Exploration
and Analytics Database) that provides real-time g raph traversal
and analytics capabilities for industry-scale applications. We
will describe the full detail o f GeaBase architecture and
implementation. GeaBase employs techniques, such as moving
computation to where data is, double-queue update pipeline
and user-stickiness et al. , to achieve high-performance and
data consistency.
The rest of this paper is structured as follows. In Section II,
we describe the related work in the literature. In Section III, we
discuss implementation details and data structures of GeaBase.
In Section IV we discuss the performance of GeaBase and
compare the results with Titan, an open-source distributed
graph database. In Section V we summarize the results and
discuss future research directions related to this work.
II. R
ELATED WORK
Several graph databases and graph analytics systems have
been introduced in the literature, for unlo cking the value of
data connections. Neo4j [1] is the best-known graph database
according to db-engines.com [2]. Its initial release was more
than ten years ago, and Neo4j has built a large develper
community. However, Neo4j offers limited support for scal-
ability (scale-up only), and hence cannot handle very large
dataset for big companies like Alibaba, Inc. The state-of-the-
art distributed databases that h ave scale-ou t capability [1],
[3], [4] typically employ standard graph query language like
Gremlins [5] o r SparQL [6]. These query languages have
limited support of graph analytics. Offline graph analytics
systems, such as those proposed in [7]–[10], are not able to
update while processing queries or respond in real-time.
2017 Fifth International Conference on Advanced Cloud and Big Data
978-1-5386-1072-5/17 $31.00 © 2017 IEEE
DOI 10.1109/CBD.2017.37
170
2017 Fifth International Conference on Advanced Cloud and Big Data
978-1-5386-1072-5/17 $31.00 © 2017 IEEE
DOI 10.1109/CBD.2017.37
170
Fig. 2. System Architecture
III. SYSTEM OVERVIEW
In this section, we present the GeaBase system architecture
in detail. As seen in Fig 2, GeaBase is composed of four
modules: high availability module, query engine, update mod-
ule and graph storage. We will describe these four modules
respectively in the following subsections.
A. Graph Storage
In GeaBase, nodes and edges are all stored as key-value
pairs. Properties are serialized and stored in the value part
according a the user-defined schema. The records are stored
in different shards according to a hash-based sharding strategy.
Usually, a GeaBase instance contains tens or hundreds of
shards that are maintained by the shard m anager. We next
describe the data model and storage model respectively in
more detail.
1) Data Model: The basic data model of graph databases is
to describe object-link-property (OLP). GeaBase data model
designing aims to support a massive scale, multi-dimensional
graph which contain typed-nodes, directed-typed-edges with
properties on them. Cause multi-dimensional grap h is com-
posed by various types of nodes and edges, which represent
various class of objects and complicated relationships between
objects.
The data model of a GeaBase instance is described by a
Graph Schema as mentioned above, which contain a node
type list and a edge type list. A type of the nodes in Graph
Schema contains one filed declaring its type, and a list key-
value pairs reprenting the property fileds of the node. Source
node type, destination node type, edge type, property list and
orientation(directed or undirected) are all needed for a valid
node in the schema. Besides the basic reqirements mentioned
above, users can add optional fields to nodes and edges(e.g.
time to live field). In a word, relenvent topology structure and
serialization format can be known from this Graph Schema
when a specific edge or node type is given.
2) Storage Model: TableIII-A3 gives the structure of the
key values in the GeaBase storage. Specifically, a node is
stored as a single key-value pair, which is composed of node
id, node type, and node properties. A directed edge is as
two key-value pairs which are called as OutEdge and InEdge.
TABLE I
STRUCTURE OF NODE KEY-VAULE RECORD
key value
id (int64) type (int32) encoded_properties
TABLE II
STRUCTURE OF EDGE KEY-VAULE RECORD
key value
int64 int32 int64 int64
OutEdge srcid type timestamp dstid encoded_properties
InEdge dstid -type timestamp srcid encoded_properties
Both of the two are composed by source id, edge type,
timestamp, destination id and edge properties. The OutEdge
can be considered as forward index which is used for searching
the trace to destination when source node is given (called as
out-edge-navigation). The InEdge can be considered as reverse
index which is used for searching the trace to source node
when destination is given (called as in-edge-navigation). All
of the edge types, node types and properties format of a given
edge or node type are defined in GraphSchema mentiened
above.
To save storage, the edge properties can be stored in
one direction. For example, a certain type of edges stored
properties on OutEdge, that means storing the forward index
with real properties and reverse index with empty properties.
Assuming that the capacity rate of key and vaule is 1:n,
storing properties in both side need 2n +2 times capacity,
storing properties in single side need n +1 times capacity.
3) Sharding Strategy: Based on the data model and the
storage model above, sharding is performed as follows. All
records are sharded by the first 8-bytes (int64) of the key. For
example, Node record would be scatterred b y the node ID, an
OutEdge record would be scatterred by the source ID, whereas
InEdge is by destination ID.
The motivation of this sharding strategy is to avoid trans-
ferring massive data over network. The source node and all
OutEdges originating from it are located in the same shard
under the strategy. The same applys to the destination node and
the InEdges. H ence, the properties of edges and destination
node could be fetched in the same shard in case of forward
traversal.
B. High Availability Module
The availability of GeaBase service is implemented using
a ZooKeeper service, by keeping a heartbeat between the
zookeeper service and all GeaBase clusters. Each GeaBase
cluster has a full copy of graph data, and hence is named as a
GeaBase replica. If one GeaBase server, or one replica crashes,
the heartbeat between zookeeper and this server times out, and
then this server is removed from ZooKeeper. When clients and
other GeaBase servers detect this change, they redirect the
computations to other servers, who have the corresponding
data.
171171
of 6
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜