GeaBase A High-Performance Distributed Graph Database for Industry-Scale Applications (1).pdf

波风水门

109

6页

0次

2023-05-31

免费下载

GeaBase: A High-Performance Distributed Graph

Database for Industry-Scale Applications

Zhisong Fu, Zhengwei Wu, Houyi Li, Yize Li, Min Wu, Xiaojie Chen, Xiaomeng Ye, Benquan Yu, Xi Hu

Ant Finacial, Inc.

Abstract—Graph analytics hav e been gaining tractions rapidly

in the past few years. It has a wide array of application areas

in the industry, ranging from e-commerce, social network and

recommendation systems to fraud detection and virtually any

problem that requires insights into data connections, not just

data itself. In this paper, we present GeaBase, a new distributed

graph database that provides the capability to store and analyze

graph-structured data in real-time at massiv e scale. We describe

the details of the system and the implementation, including a

novel update architecture, called Update Center (UC), and a

new language that is suitable for both graph tra versal and

analytics. We also compare the performance of GeaBase to a

widely used open-source graph database Titan. Experiments show

that GeaBase is up to 182x faster than Titan in our testing

scenarios. We also achieves 22x higher throughput on social

network workloads in the comparison.

I. INTRODUCTION

We are in the age of big data. Connections between data

are of same importance as data itself. Together, they record

information reﬂecting the real world. A Graph, deﬁned as

<V, E>, is a natural way to represent data and their con-

nections. Here V represents data, or namely nodes, and E

represents connections between data, namely edges.

Graph databases are introduced to efﬁciently store and query

the graph. Graphs stored in the graph database are usually

property graph models (nodes, edges and properties) (see

Fig. 1).

A key feature of the graph database is that edges (or

connections) are treated as the core component of the model,

along with vertexes. Hence, complex topological structures

Fig. 1. An example of property graph. The user, item and bankcard are three

nodes, and their connections (like, buy, own) are modeled as directed edges.

can be retrieved efﬁciently. In contrast, with conventional

relational databases, connections between data are stored in

separate tables, and queries searching for connections require

join operations, which is usually very expensive.

However, it is challenging to design a high-performance

graph database for industry-scale applications. First, irregular

data structure o f a graph usually leads to random access pattern

to the storage system, and hence results in poor data locality;

Second, in order to store a large scale of graph, the data

is usually partitioned, which leads to high communication

cost and imbalanced workloads; Finally, data consistency in

a fast changing and distributed graph database is also very

challenging.

In this paper, we introduce GeaBase (Graph Exploration

and Analytics Database) that provides real-time g raph traversal

and analytics capabilities for industry-scale applications. We

will describe the full detail o f GeaBase architecture and

implementation. GeaBase employs techniques, such as moving

computation to where data is, double-queue update pipeline

and user-stickiness et al. , to achieve high-performance and

data consistency.

The rest of this paper is structured as follows. In Section II,

we describe the related work in the literature. In Section III, we

discuss implementation details and data structures of GeaBase.

In Section IV we discuss the performance of GeaBase and

compare the results with Titan, an open-source distributed

graph database. In Section V we summarize the results and

discuss future research directions related to this work.

II. R

ELATED WORK

Several graph databases and graph analytics systems have

been introduced in the literature, for unlo cking the value of

data connections. Neo4j [1] is the best-known graph database

according to db-engines.com [2]. Its initial release was more

than ten years ago, and Neo4j has built a large develper

community. However, Neo4j offers limited support for scal-

ability (scale-up only), and hence cannot handle very large

dataset for big companies like Alibaba, Inc. The state-of-the-

art distributed databases that h ave scale-ou t capability [1],

[3], [4] typically employ standard graph query language like

Gremlins [5] o r SparQL [6]. These query languages have

limited support of graph analytics. Ofﬂine graph analytics

systems, such as those proposed in [7]–[10], are not able to

update while processing queries or respond in real-time.

2017 Fifth International Conference on Advanced Cloud and Big Data

DOI 10.1109/CBD.2017.37

170

2017 Fifth International Conference on Advanced Cloud and Big Data

DOI 10.1109/CBD.2017.37

170

Fig. 2. System Architecture

III. SYSTEM OVERVIEW

In this section, we present the GeaBase system architecture

in detail. As seen in Fig 2, GeaBase is composed of four

modules: high availability module, query engine, update mod-

ule and graph storage. We will describe these four modules

respectively in the following subsections.

A. Graph Storage

In GeaBase, nodes and edges are all stored as key-value

pairs. Properties are serialized and stored in the value part

according a the user-deﬁned schema. The records are stored

in different shards according to a hash-based sharding strategy.

Usually, a GeaBase instance contains tens or hundreds of

shards that are maintained by the shard m anager. We next

describe the data model and storage model respectively in

more detail.

1) Data Model: The basic data model of graph databases is

to describe object-link-property (OLP). GeaBase data model

designing aims to support a massive scale, multi-dimensional

graph which contain typed-nodes, directed-typed-edges with

properties on them. Cause multi-dimensional grap h is com-

posed by various types of nodes and edges, which represent

various class of objects and complicated relationships between

objects.

The data model of a GeaBase instance is described by a

Graph Schema as mentioned above, which contain a node

type list and a edge type list. A type of the nodes in Graph

Schema contains one ﬁled declaring its type, and a list key-

value pairs reprenting the property ﬁleds of the node. Source

node type, destination node type, edge type, property list and

orientation(directed or undirected) are all needed for a valid

node in the schema. Besides the basic reqirements mentioned

above, users can add optional ﬁelds to nodes and edges(e.g.

time to live ﬁeld). In a word, relenvent topology structure and

serialization format can be known from this Graph Schema

when a speciﬁc edge or node type is given.

2) Storage Model: TableIII-A3 gives the structure of the

key values in the GeaBase storage. Speciﬁcally, a node is

stored as a single key-value pair, which is composed of node

id, node type, and node properties. A directed edge is as

two key-value pairs which are called as OutEdge and InEdge.

TABLE I

STRUCTURE OF NODE KEY-VAULE RECORD

key value

id (int64) type (int32) encoded_properties

TABLE II

STRUCTURE OF EDGE KEY-VAULE RECORD

key value

int64 int32 int64 int64

OutEdge srcid type timestamp dstid encoded_properties

InEdge dstid -type timestamp srcid encoded_properties

Both of the two are composed by source id, edge type,

timestamp, destination id and edge properties. The OutEdge

can be considered as forward index which is used for searching

the trace to destination when source node is given (called as

out-edge-navigation). The InEdge can be considered as reverse

index which is used for searching the trace to source node

when destination is given (called as in-edge-navigation). All

of the edge types, node types and properties format of a given

edge or node type are deﬁned in GraphSchema mentiened

above.

To save storage, the edge properties can be stored in

one direction. For example, a certain type of edges stored

properties on OutEdge, that means storing the forward index

with real properties and reverse index with empty properties.

Assuming that the capacity rate of key and vaule is 1:n,

storing properties in both side need 2n +2 times capacity,

storing properties in single side need n +1 times capacity.

3) Sharding Strategy: Based on the data model and the

storage model above, sharding is performed as follows. All

records are sharded by the ﬁrst 8-bytes (int64) of the key. For

example, Node record would be scatterred b y the node ID, an

OutEdge record would be scatterred by the source ID, whereas

InEdge is by destination ID.

The motivation of this sharding strategy is to avoid trans-

ferring massive data over network. The source node and all

OutEdges originating from it are located in the same shard

under the strategy. The same applys to the destination node and

the InEdges. H ence, the properties of edges and destination

node could be fetched in the same shard in case of forward

traversal.

B. High Availability Module

The availability of GeaBase service is implemented using

a ZooKeeper service, by keeping a heartbeat between the

zookeeper service and all GeaBase clusters. Each GeaBase

cluster has a full copy of graph data, and hence is named as a

GeaBase replica. If one GeaBase server, or one replica crashes,

the heartbeat between zookeeper and this server times out, and

then this server is removed from ZooKeeper. When clients and

other GeaBase servers detect this change, they redirect the

computations to other servers, who have the corresponding

data.

171171

GeaBase: A High-Performance Distributed Graph

Database for Industry-Scale Applications

Zhisong Fu, Zhengwei Wu, Houyi Li, Yize Li, Min Wu, Xiaojie Chen, Xiaomeng Ye, Benquan Yu, Xi Hu

Ant Finacial, Inc.

Abstract—Graph analytics hav e been gaining tractions rapidly

in the past few years. It has a wide array of application areas

in the industry, ranging from e-commerce, social network and

recommendation systems to fraud detection and virtually any

problem that requires insights into data connections, not just

data itself. In this paper, we present GeaBase, a new distributed

graph database that provides the capability to store and analyze

graph-structured data in real-time at massiv e scale. We describe

the details of the system and the implementation, including a

novel update architecture, called Update Center (UC), and a

new language that is suitable for both graph tra versal and

analytics. We also compare the performance of GeaBase to a

widely used open-source graph database Titan. Experiments show

that GeaBase is up to 182x faster than Titan in our testing

scenarios. We also achieves 22x higher throughput on social

network workloads in the comparison.

I. INTRODUCTION

We are in the age of big data. Connections between data

are of same importance as data itself. Together, they record

information reﬂecting the real world. A Graph, deﬁned as

<V, E>, is a natural way to represent data and their con-

nections. Here V represents data, or namely nodes, and E

represents connections between data, namely edges.

Graph databases are introduced to efﬁciently store and query

the graph. Graphs stored in the graph database are usually

property graph models (nodes, edges and properties) (see

Fig. 1).

A key feature of the graph database is that edges (or

connections) are treated as the core component of the model,

along with vertexes. Hence, complex topological structures

Fig. 1. An example of property graph. The user, item and bankcard are three

nodes, and their connections (like, buy, own) are modeled as directed edges.

can be retrieved efﬁciently. In contrast, with conventional

relational databases, connections between data are stored in

separate tables, and queries searching for connections require

join operations, which is usually very expensive.

However, it is challenging to design a high-performance

graph database for industry-scale applications. First, irregular

data structure o f a graph usually leads to random access pattern

to the storage system, and hence results in poor data locality;

Second, in order to store a large scale of graph, the data

is usually partitioned, which leads to high communication

cost and imbalanced workloads; Finally, data consistency in

a fast changing and distributed graph database is also very

challenging.

In this paper, we introduce GeaBase (Graph Exploration

and Analytics Database) that provides real-time g raph traversal

and analytics capabilities for industry-scale applications. We

will describe the full detail o f GeaBase architecture and

implementation. GeaBase employs techniques, such as moving

computation to where data is, double-queue update pipeline

and user-stickiness et al. , to achieve high-performance and

data consistency.

The rest of this paper is structured as follows. In Section II,

we describe the related work in the literature. In Section III, we

discuss implementation details and data structures of GeaBase.

In Section IV we discuss the performance of GeaBase and

compare the results with Titan, an open-source distributed

graph database. In Section V we summarize the results and

discuss future research directions related to this work.

II. R

ELATED WORK

Several graph databases and graph analytics systems have

been introduced in the literature, for unlo cking the value of

data connections. Neo4j [1] is the best-known graph database

according to db-engines.com [2]. Its initial release was more

than ten years ago, and Neo4j has built a large develper

community. However, Neo4j offers limited support for scal-

ability (scale-up only), and hence cannot handle very large

dataset for big companies like Alibaba, Inc. The state-of-the-

art distributed databases that h ave scale-ou t capability [1],

[3], [4] typically employ standard graph query language like

Gremlins [5] o r SparQL [6]. These query languages have

limited support of graph analytics. Ofﬂine graph analytics

systems, such as those proposed in [7]–[10], are not able to

update while processing queries or respond in real-time.

2017 Fifth International Conference on Advanced Cloud and Big Data

DOI 10.1109/CBD.2017.37

170

2017 Fifth International Conference on Advanced Cloud and Big Data

DOI 10.1109/CBD.2017.37

170

Fig. 2. System Architecture

III. SYSTEM OVERVIEW

In this section, we present the GeaBase system architecture

in detail. As seen in Fig 2, GeaBase is composed of four

modules: high availability module, query engine, update mod-

ule and graph storage. We will describe these four modules

respectively in the following subsections.

A. Graph Storage

In GeaBase, nodes and edges are all stored as key-value

pairs. Properties are serialized and stored in the value part

according a the user-deﬁned schema. The records are stored

in different shards according to a hash-based sharding strategy.

Usually, a GeaBase instance contains tens or hundreds of

shards that are maintained by the shard m anager. We next

describe the data model and storage model respectively in

more detail.

1) Data Model: The basic data model of graph databases is

to describe object-link-property (OLP). GeaBase data model

designing aims to support a massive scale, multi-dimensional

graph which contain typed-nodes, directed-typed-edges with

properties on them. Cause multi-dimensional grap h is com-

posed by various types of nodes and edges, which represent

various class of objects and complicated relationships between

objects.

The data model of a GeaBase instance is described by a

Graph Schema as mentioned above, which contain a node

type list and a edge type list. A type of the nodes in Graph

Schema contains one ﬁled declaring its type, and a list key-

value pairs reprenting the property ﬁleds of the node. Source

node type, destination node type, edge type, property list and

orientation(directed or undirected) are all needed for a valid

node in the schema. Besides the basic reqirements mentioned

above, users can add optional ﬁelds to nodes and edges(e.g.

time to live ﬁeld). In a word, relenvent topology structure and

serialization format can be known from this Graph Schema

when a speciﬁc edge or node type is given.

2) Storage Model: TableIII-A3 gives the structure of the

key values in the GeaBase storage. Speciﬁcally, a node is

stored as a single key-value pair, which is composed of node

id, node type, and node properties. A directed edge is as

two key-value pairs which are called as OutEdge and InEdge.

TABLE I

STRUCTURE OF NODE KEY-VAULE RECORD

key value

id (int64) type (int32) encoded_properties

TABLE II

STRUCTURE OF EDGE KEY-VAULE RECORD

key value

int64 int32 int64 int64

OutEdge srcid type timestamp dstid encoded_properties

InEdge dstid -type timestamp srcid encoded_properties

Both of the two are composed by source id, edge type,

timestamp, destination id and edge properties. The OutEdge

can be considered as forward index which is used for searching

the trace to destination when source node is given (called as

out-edge-navigation). The InEdge can be considered as reverse

index which is used for searching the trace to source node

when destination is given (called as in-edge-navigation). All

of the edge types, node types and properties format of a given

edge or node type are deﬁned in GraphSchema mentiened

above.

To save storage, the edge properties can be stored in

one direction. For example, a certain type of edges stored

properties on OutEdge, that means storing the forward index

with real properties and reverse index with empty properties.

Assuming that the capacity rate of key and vaule is 1:n,

storing properties in both side need 2n +2 times capacity,

storing properties in single side need n +1 times capacity.

3) Sharding Strategy: Based on the data model and the

storage model above, sharding is performed as follows. All

records are sharded by the ﬁrst 8-bytes (int64) of the key. For

example, Node record would be scatterred b y the node ID, an

OutEdge record would be scatterred by the source ID, whereas

InEdge is by destination ID.

The motivation of this sharding strategy is to avoid trans-

ferring massive data over network. The source node and all

OutEdges originating from it are located in the same shard

under the strategy. The same applys to the destination node and

the InEdges. H ence, the properties of edges and destination

node could be fetched in the same shard in case of forward

traversal.

B. High Availability Module

The availability of GeaBase service is implemented using

a ZooKeeper service, by keeping a heartbeat between the

zookeeper service and all GeaBase clusters. Each GeaBase

cluster has a full copy of graph data, and hence is named as a

GeaBase replica. If one GeaBase server, or one replica crashes,

the heartbeat between zookeeper and this server times out, and

then this server is removed from ZooKeeper. When clients and

other GeaBase servers detect this change, they redirect the

computations to other servers, who have the corresponding

data.

171171

of 6

免费下载

tugraph paper

关注

评论