VLDB2024_Galaxybase：A High Performance Native Distributed Graph Database for HTAP_创邻科技.pdf

迹部景吾

798

13页

5次

2024-09-09

免费下载

Galaxybase: A High Performance Native Distributed Graph

Database for HTAP

Bing Tong

CreateLink & HKUST(GZ)

tongbing@createlink.com

btong799@connect.hkust-gz.edu.cn

Yan Zhou

∗

CreateLink

zhouyan@createlink.com

Chen Zhang

CreateLink

zhangchen@createlink.com

Jianheng Tang

HKUST(GZ)

jtangbf@connect.ust.hk

Jing Tang

HKUST(GZ)

jingtang@ust.hk

Leihong Yang

CreateLink

yangleihong@createlink.com

Qiye Li

CreateLink

liqiye@createlink.com

Manwu Lin

CreateLink

linmanwu@createlink.com

Zhongxin Bao

CreateLink

baozhongxin@createlink.com

Jia Li

∗

HKUST(GZ)

jialee@ust.hk

Lei Chen

HKUST(GZ)

leichen@ust.hk

ABSTRACT

We introduce Galaxybase, a native distributed graph database that

addresses the increasing demands for processing large volumes of

graph data in diverse industries like nance, manufacturing, and

government. Designed to handle the requirements of both trans-

actional and analytical workloads, Galaxybase stands out with its

novel data storage and transaction mechanisms. At its core, Galaxy-

base utilizes a Log-Structured Adjacency List coupled with an Edge

Page structure, optimizing read-write operations across a spectrum

of tasks such as graph traversals and single edge queries. A no-

table aspect of Galaxybase is its execution of custom distributed

transaction modes tailored for HTAP transactions, allowing for the

facilitation of bidirectional and interactive transactions. It ensures

data integrity and minimal latency while enabling simultaneous

processing of OLTP and OLAP workloads without blocking. Ex-

perimental results show that Galaxybase achieves high throughput

and low latency in both OLTP and OLAP workloads, across var-

ious graph query scenarios and resource conditions. Galaxybase

has been deployed in leading banks, education, telecommunica-

tion and energy sectors in China, consistently maintaining robust

performance for HTAP workloads over the years.

PVLDB Reference Format:

Bing Tong, Yan Zhou, Chen Zhang, Jianheng Tang, Jing Tang, Leihong

Yang, Qiye Li, Manwu Lin, Zhongxin Bao, Jia Li, and Lei Chen. Galaxybase:

A High Performance Native Distributed Graph Database for HTAP. PVLDB,

17(12): 3893 - 3905, 2024.

doi:10.14778/3685800.3685814

* Corresponding Authors.

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 17, No. 12 ISSN 2150-8097.

doi:10.14778/3685800.3685814

1 INTRODUCTION

A graph database [

] is a type of database management system

specically designed to store, manage, and query complex rela-

tionships between data entities. Unlike conventional relational

databases, graph databases employ vertices, edges, and proper-

ties to model data entities and their relationships. It allows for

enhanced exibility and performance in handling structured and

highly interconnected data, making them particularly well-suited

for applications in elds such as social networking [

energy network optimization [

], nancial fraud detection

[23, 34, 40], and knowledge graphs [7, 41].

Many graph databases encounter performance challenges in

processing graph queries and transactions due to their design or

functional limitations. Non-native databases often use established

non-graph backends. For example, A1 [

] utilizes Key-Value stores.

Titan [

] and its successor, JanusGraph [

], are based on the wide-

column store, while ArangoDB [

] and OrientDB [

] employ doc-

ument stores for graph representation. Although non-native graph

storages rely on mature non-graph backends like HBase [

], which

are well-understood operationally, they typically struggle with han-

dling ecient graph-specic queries, particularly in graph traversal

scenarios. Conversely, native graph databases with their index-free

adjacency, such as Neo4j [

] and TigerGraph [

], signicantly

enhance traversal performance. However, Neo4j exhibits poor scal-

ability and struggles to meet high throughput and low latency

requirements on trillion-scale graphs. TigerGraph, focusing on in-

memory architectures, encounters diculties with large graphs in

low-memory environments. Additionally, native graph databases

also fall short of single edge queries, as they rely on traversal to

locate a specic edge.

Beyond handling graph-specic queries, another vital feature of

graph databases is their ability to preserve integrity and correctness

during concurrent operations in various scenarios. A key capability

3893

is the dual support for Online Transaction Processing (OLTP) and

Online Analytical Processing (OLAP). Our usage statistics show

that 70% of tasks involve Hybrid Transaction/Analytical Processing

(HTAP) [

], 20% are dedicated to OLTP, and the remaining 10%

to OLAP. Among existing systems, G-Tran [

] is notably adept at

OLTP tasks and prioritizes transactional integrity, while Grasper

[

] excels in managing OLAP transactions. However, using sepa-

rate systems for distinct OLTP and OLAP tasks can double costs in

terms of development, deployment, and maintenance.

Faced with the unique challenges of processing graph queries

and transactions, we developed Galaxybase

, a new native dis-

tributed graph database. Galaxybase features two distinct storage

structures, optimized for read and write performance. The rst is a

Log-Structured Adjacency List, which employs adjacency lists for

sequential data scanning and batch writing to reduce read/write

amplications. The second structure, Edge Page, co-locates edges

for the same vertex and maintains local order within each page

by type and direction while ensuring global order across all edges.

This design supports ecient graph traversal in various directions

and types, as well as quick and accurate single edge queries.

As a distributed graph database deployed in production-grade

environments, Galaxybase is designed to handle a variety of sce-

narios and data scales eectively. It supports transactions using

Two-Phase Commit (2PC) [

] and Raft [

] protocols to en-

sure atomicity and durability. The system maintains isolation levels

from read-committed to serializable for OLTP workloads using

Two-Phase Locking (2PL) [

]. Galaxybase integrates bidirectional

and interactive transactions, aligning with the unique storage struc-

tures and user demands of graph databases. For OLAP workloads, it

employs Multi-Version Concurrency Control (MVCC) [

] visibility

checks with lock-free mechanisms to maintain serializable snapshot

isolation.

Our experiments with OLTP and OLAP workloads demonstrate

that Galaxybase delivers strong performance in both single-machine

and distributed setups. It achieves throughput of up to 50,000

queries per second (q/s) in single-machine mode and 85,000 q/s in

distributed mode, signicantly surpassing baseline graph databases.

In terms of scalability, Galaxybase achieves throughput that is

up to an order of magnitude higher than that of baseline graph

databases. It also shows eciency in edge queries, operating three

times faster than its closest competitor. Furthermore, Galaxybase

handles queries eectively in low-memory environments, enabling

large graph loading and complex query execution without out-of-

memory issues. Additionally, we processed a trillion-scale dataset

that includes 5 billion account vertices and 5 trillion transaction

edges using only 50 machines, each equipped with 12 CPUs and

128GB of memory, achieving multi-hop query results in seconds.

In tracing these endeavors, our paper consolidates the following

contributions:

•

We introduce Galaxybase, a high-performance, native distributed

graph database designed specically for HTAP scenarios. It pro-

vides an ecient, robust, and scalable solution for managing

complex graph data.

https://www.createlink.com

locatedIn

follows locatedIn

TIME: 20200315

NAME: UK

NAME: Cindy

AGE: 7

NAME: Alice

AGE: 18

NAME: Bob

AGE: 25

NAME: David

AGE: 20

NAME: China

follows follows

follows

TIME: 20160820

locatedIn

TIME: 20190728

locatedIn

TIME: 20201102

person_1 person_2

person_3 person_4

country_1

country_2

Figure 1: An example of property graph

•

On the storage front, we propose the Log-Structured Adjacency

List, an approach for sequential disk reads and writes that dra-

matically reduces read/write amplications. Complementing this,

our Edge Page design enhances graph traversal eciency, allow-

ing for the eective handling of edges in various directions and

types, while also enabling quick and accurate single edge queries.

•

On the transaction front, we implement distributed transactions

for OLTP workloads using bidirectional and interactive methods.

Additionally, we manage OLAP workloads with lock-free meth-

ods, allowing OLTP and OLAP workloads to run concurrently

without causing blocks.

•

In the distributed mode, Galaxybase achieves a throughput of

up to 85,000 queries per second in OLTP workloads, and its

performance in OLAP workloads exceeds competitors by an

order of magnitude. This high eciency is sustained even under

restricted memory resources, enabling the execution of complex

queries in environments with limited capacity.

2 BACKGROUND AND DESIGN PRINCIPLE

Reecting on the challenges and limitations of current graph databases

outlined in Section 1, this section delves into the motivation and

key factors in crafting Galaxybase. Our primary objective is to

build a unied system that demonstrates exceptional performance,

availability, scalability, and robust transaction capabilities.

Galaxybase utilizes the property graph model [

], where vertices

and edges can possess a variety of properties. Based on this model,

we develop a Ming Dynasty literature knowledge graph for univer-

sities to enhance literary research and teaching, build a power grid

knowledge graph for the State Grid to ensure accurate and stable

power dispatch strategies, and implement a nancial fraud detection

graph for banks to enhance security and more eectively identify

fraudulent activities. As illustrated in Figure 1, in a social network

using the property graph model, each vertex/edge is assigned a

type (e.g.,

person

country

follows

locatedIn

), alongside a set

of properties (e.g., NAME:Alice and TIME:20201102).

Graph databases organize data through edges, oering the signif-

icant advantage of native and ecient support for graph traversal

queries. These queries navigate the graph from a specied vertex

to a predetermined depth or target vertex. For example, as depicted

in Figure 1, a graph traversal query starting from vertex

person_1

with a depth of 1 and a relational constraint of

follows

would

identify all followers of

person_1

. Relational databases depend

3894

of 13

免费下载

文档被以下合辑收录

VLDB2024 数据库顶会论文（共31篇）

本合辑收录了VLDB2024 数据库顶会论文。

关注

文档被以下合辑收录

评论