VLDB2024_LavaStore：ByteDance’s Purpose-built, High-performance, Cost-effective Local Storage Engine for Cloud Services_字节跳动.pdf

迹部景吾

1208

14页

14次

2024-09-09

免费下载

LavaStore: ByteDance’s Purpose-built, High-performance,

Cost-eective Local Storage Engine for Cloud Services

Hao Wang

∗

Jiaxin Ou

∗

Ming Zhao

∗

Sheng Qiu

∗

Yizheng Jiao

∗

Yi Wang

∗

Qizhong Mao

∗

Zhengyu Yang

∗

Yang Liu

∗

Jianshun Zhang

∗

Jianyang Hu

∗

Jingwei Zhang

∗

Jinrui Liu

∗

Jiaqiang Chen

∗

Yong Shen

∗

ByteDance

Lixun Cao

∗

Heng Zhang

∗

Hongde Li

∗

Ming Li

∗

Yue Ma

∗

Lei Zhang

∗

Jian Liu

∗

Guanghui Zhang

∗

Fei Liu

∗

Jianjun Chen

∗

ABSTRACT

Persistent key-value (KV) stores are widely used by cloud services

at ByteDance as local storage engines, and RocksDB used to be

the de facto implementation since it can be tailored to a variety of

workloads and requirements. In this paper, we provide key insights

into local storage engine usage at ByteDance, explain why the

combination of highly write-intensive workloads and stringent

requirements on cost eciency and point lookup tail latency may

pose challenges to a general-purpose local storage engine such as

RocksDB, and present the design and implementation of LavaStore,

a high-performance cost-eective local storage engine purpose-

built to address these challenges.

LavaStore achieves its design goals by selectively customizing a

few components of a RocksDB-based, general-purpose local storage

engine, including a distinct KV separation design that decouples

garbage collection from compaction, a specialized engine type for

the commonly recurring Write-Ahead-Logging workload, and a

customized user-space append-only lesystem. LavaStore has been

deployed to production with hundreds of thousands of running

instances, storing more than 100 PB of data and serving billions

of requests per second, bringing signicant performance improve-

ments and cost reductions to customers over their original local

storage engines. For example, a ByteDance proprietary distributed

OLTP database service has experienced a reduction in average write

and read latency by 61% and 16%, respectively, and a ByteDance

proprietary caching service has gained an 87% increase in write

throughput with no more than 6% space overhead.

PVLDB Reference Format:

Hao Wang, Jiaxin Ou, Ming Zhao, Sheng Qiu, Yizheng Jiao, Yi Wang,

Qizhong Mao, Zhengyu Yang, Yang Liu, Jianshun Zhang, Jianyang Hu,

Jingwei Zhang, Jinrui Liu, Jiaqiang Chen, Yong Shen, Lixun Cao, Heng

Zhang, Hongde Li, Ming Li, Yue Ma, Lei Zhang, Jian Liu, Guanghui Zhang,

Fei Liu, and Jianjun Chen. LavaStore: ByteDance’s Purpose-built,

High-performance, Cost-eective Local Storage Engine for Cloud Services.

PVLDB, 17(12): 3799 - 3812, 2024.

doi:10.14778/3685800.3685807

∗

{ hao.wang, oujiaxin, zhaoming.274, sheng.qiu, yizhengjiao, wangyi.ywq, qizhong.mao,

zhengyu.yang, yangliu1, zhangjianshun, hujianyang, zhangjingwei.831, liujin-

rui.yummy, chenjiaqiang.0, shenyong.sy, caolixun, zhangheng.he, lihongde, lim-

ing.1018, mayue.ght, zhanglei.michael, liujian.kv, zhangguanghui, fei.liu, jianjun.chen

}@bytedance.com

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

1 INTRODUCTION

Persistent Key-Value (KV) stores are widely used by cloud services

at ByteDance as local storage engines. For example, ByteNDB, a

proprietary OnLine Transaction Processing (OLTP) system, stores

database page versions in persistent KV stores; ABase, a proprietary

distributed NoSQL database, also implements Redis-compliant data

structures on top of persistent KV stores.

RocksDB [

] is a Log-Structured Merge (LSM) tree [

] based,

high-performance persistent KV store developed for large-scale

distributed systems and optimized for Solid State Drives (SSDs).

Due to its congurability, RocksDB can be tailored to a variety of

workloads and requirements, and became the de facto local storage

engine implementation for many cloud services at ByteDance.

With the tremendous growth of popular ByteDance applications,

however, some cloud services began to encounter performance

and cost issues with their RocksDB-based local storage engines.

After numerous attempts in tuning RocksDB conguration, we

came to the conclusion that the unique workload characteristics,

performance requirements, and cost objectives of these cloud ser-

vices demand a local storage engine design with some distinctly

dierent trade-os from that of RocksDB. In 2019, ByteDance ac-

quired TerarkDB [

], a RocksDB-based general-purpose KV store

with customized indexing and compression algorithms, and set

out to develop LavaStore, a high-performance and cost-eective

local storage engine based on TerarkDB but purpose-built for cloud

services at ByteDance. In this paper, we describe the challenges for

local storage engines at ByteDance, explain why existing designs

in RocksDB fall short, and present the design and implementation

of LavaStore that address these challenges.

Write throughput was one of the rst major bottlenecks that

emerged early on due to the unique workload characteristics at

ByteDance. Specically, ByteDance applications aggressively de-

ployed in-memory caches (e.g., Redis and Memcached) at multiple

layers of their architecture to reduce read latency, leaving only a

small fraction of read requests for local storage engines to handle.

Furthermore, with applications commonly batching write requests

for higher throughput, the write workload is dominated by large

value writes. Unfortunately, RocksDB’s write throughput under

such a workload is severely limited by the inherently large write

amplication of LSM-tree for large value sizes. Even with BlobDB,

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 17, No. 12 ISSN 2150-8097.

doi:10.14778/3685800.3685807

3799

which is the RocksDB implementation of the commonly used tech-

nique of KV separation to improve write throughput for large value

sizes, the write throughput still falls signicantly short of appli-

cation requirements. In order to address this challenge, we opted

for a distinct KV separation design that decouples Garbage Col-

lection (GC) from compaction, which enables more exible trade-

os among space usage, read performance and write performance.

Compared with the design of

BlobDB

in RocksDB, which ties GC to

compaction, LavaStore can achieve much better write performance

with comparable space usage and read performance.

As applications in ByteDance enjoyed continued exponential

growth, the resource consumption by the cloud services, in terms

of the volume of data stored, CPU usage, etc., gradually grew to a

scale for which cost reduction became a major concern. In order to

improve resource eciency, LavaStore introduced various GC opti-

mizations to its KV separation implementation. In particular, for the

commonly recurring Write-Ahead-Logging (WAL) workload, we

added a specialized local storage engine, LavaLog, to exploit the fact

that these data are mostly written and expired in a First-In-First-Out

(FIFO) fashion, and rarely read, in order to achieve near-optimal

write amplication and GC overhead. Moreover, some applications

would like to enable RocksDB “sync” write to guarantee data dura-

bility, but the ensuing throughput and latency penalty is so large

that most of them opted for “non-sync” write with increased repli-

cation, eectively trading resource eciency for durability. For

such applications, LavaStore employed cross-layer optimization be-

tween the KV store and the underlying lesystem to eliminate the

“sync write” performance penalty, which helps these cloud services

restore resource eciency while maintaining data durability.

Finally, as cloud services continued to pursue aggressive cost re-

ductions, the previously abundant in-memory caches got more and

more scarce, so read performance optimization began to gain im-

portance. With a careful study of application requirements and read

performance metrics, we found that point lookup queries are the

main pain point. Specically, due to their interactive nature, many

popular ByteDance applications have stringent Service Level Agree-

ments (SLAs) on both the average and tail (e.g., the 99-percentile)

latencies. Although the average read latencies of most cloud ser-

vices were well below SLA, tail latencies were orders of magnitude

above SLA. Since resources must be provisioned to meet SLAs on

both average and tail latencies, average resource utilization became

very low. Reducing the tail latency of point lookup queries thus

became the key to improving resource eciency. To this end, Lava-

Store introduced a new index type that is very memory ecient

with outstanding point lookup performance. Together with a more

rened caching strategy, this new index type enables LavaStore to

achieve near-optimal read amplication for point lookup queries,

thus drastically closing the gap between average and tail latencies

for point lookup queries, and leading to substantial cost savings for

such cloud services.

As of this writing, LavaStore has been successfully deployed to

three widely used cloud services at ByteDance, with more than

100,000 running instances in production, storing more than 100

PB of data and serving an aggregate of over 2 billion Queries Per

Second (QPS). Customers of LavaStore have enjoyed great benets

over their original local storage engines. Specically, ByteNDB has

seen its average write latency reduced by 61% and its read latency

by 16%. The write QPS of ABase is increased by 87% while keeping

the total garbage ratio between 1% and 6%. Flink also reduces its

CPU usage by up to 67% after switching from its previous RocksDB-

based state backend to one based on LavaStore.

In summary, our key contributions are as follows:

•

We provide insights into local storage engine usage by cloud

services at ByteDance, and explain why the combination of

highly write-intensive workloads and stringent requirements on

cost eciency and point lookup tail latency may pose challenges

to a general-purpose local storage engine such as RocksDB;

•

We show how these challenges can be eectively addressed by

selectively customizing a few components of a RocksDB-based,

general-purpose local storage engine. Specically,

–

We present LavaKV, with a distinct KV separation design

that decouples garbage collection from compaction, which

enables more exible trade-os among space usage, read

performance, and write performance than RocksDB;

– We present LavaLog, a specialized local storage engine for

the commonly recurring Write-Ahead-Logging workload,

which signicantly outperforms RocksDB in terms of write

amplication and garbage collection overhead;

–

We present LavaFS, a user-space append-only lesystem,

which provides much lower write amplication and syn-

chronous write latency than the in-kernel lesystem Ext4

when used by LavaKV and LavaLog;

•

Using both synthetic and production workloads, we validate

LavaStore’s design and implementation by showing that it suc-

cessfully meets the performance requirements and cost objec-

tives of cloud services at ByteDance;

•

We share the lessons we have learned while developing LavaS-

tore and running it in production at scale.

The rest of the paper is organized as follows. Section 2 presents

the background and motivation of this work. Section 3 describes

the design of LavaStore, with Section 3.2 focusing on improving

write performance, Section 3.3 on improving cost-eectiveness,

and Section 3.4 on improving read performance. In Section 4, we

evaluate LavaStore’s performance and cost-eectiveness using both

synthetic and production workloads. Section 5 discusses our lessons

learned and future work. Section 6 reviews the related work. Finally,

Section 7 concludes our work.

2 BACKGROUND AND MOTIVATION

In this section, we rst describe three cloud services that represent

typical use cases of persistent KV stores as local storage engines at

ByteDance, with a focus on their workload characteristics, perfor-

mance requirements, and cost objectives. We then explain why the

existing designs in RocksDB does not adequately address these use

cases, which also motivates the design of LavaStore.

2.1 Local Storage Engine Usage at ByteDance

2.1.1 ByteNDB. ByteNDB, short for ByteDance NewSQL Database,

is a cloud-native, distributed OLTP database suite engineered for full

compatibility with MySQL [

] and PostgreSQL [

]. Moving away

from MySQL’s traditional non-distributed InnoDB storage engine,

ByteNDB’s storage layer comprises two key components: LogStore

and PageStore [

]. LogStore leverages append-only distributed

3800

of 14

免费下载

文档被以下合辑收录

VLDB2024 数据库顶会论文（共31篇）

本合辑收录了VLDB2024 数据库顶会论文。

关注

文档被以下合辑收录

评论