暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
VLDB2024_LavaStore:ByteDance’s Purpose-built, High-performance, Cost-effective Local Storage Engine for Cloud Services_字节跳动.pdf
1053
14页
13次
2024-09-09
免费下载
LavaStore: ByteDance’s Purpose-built, High-performance,
Cost-eective Local Storage Engine for Cloud Services
Hao Wang
Jiaxin Ou
Ming Zhao
Sheng Qiu
Yizheng Jiao
Yi Wang
Qizhong Mao
Zhengyu Yang
Yang Liu
Jianshun Zhang
Jianyang Hu
Jingwei Zhang
Jinrui Liu
Jiaqiang Chen
Yong Shen
ByteDance
Lixun Cao
Heng Zhang
Hongde Li
Ming Li
Yue Ma
Lei Zhang
Jian Liu
Guanghui Zhang
Fei Liu
Jianjun Chen
ABSTRACT
Persistent key-value (KV) stores are widely used by cloud services
at ByteDance as local storage engines, and RocksDB used to be
the de facto implementation since it can be tailored to a variety of
workloads and requirements. In this paper, we provide key insights
into local storage engine usage at ByteDance, explain why the
combination of highly write-intensive workloads and stringent
requirements on cost eciency and point lookup tail latency may
pose challenges to a general-purpose local storage engine such as
RocksDB, and present the design and implementation of LavaStore,
a high-performance cost-eective local storage engine purpose-
built to address these challenges.
LavaStore achieves its design goals by selectively customizing a
few components of a RocksDB-based, general-purpose local storage
engine, including a distinct KV separation design that decouples
garbage collection from compaction, a specialized engine type for
the commonly recurring Write-Ahead-Logging workload, and a
customized user-space append-only lesystem. LavaStore has been
deployed to production with hundreds of thousands of running
instances, storing more than 100 PB of data and serving billions
of requests per second, bringing signicant performance improve-
ments and cost reductions to customers over their original local
storage engines. For example, a ByteDance proprietary distributed
OLTP database service has experienced a reduction in average write
and read latency by 61% and 16%, respectively, and a ByteDance
proprietary caching service has gained an 87% increase in write
throughput with no more than 6% space overhead.
PVLDB Reference Format:
Hao Wang, Jiaxin Ou, Ming Zhao, Sheng Qiu, Yizheng Jiao, Yi Wang,
Qizhong Mao, Zhengyu Yang, Yang Liu, Jianshun Zhang, Jianyang Hu,
Jingwei Zhang, Jinrui Liu, Jiaqiang Chen, Yong Shen, Lixun Cao, Heng
Zhang, Hongde Li, Ming Li, Yue Ma, Lei Zhang, Jian Liu, Guanghui Zhang,
Fei Liu, and Jianjun Chen. LavaStore: ByteDance’s Purpose-built,
High-performance, Cost-eective Local Storage Engine for Cloud Services.
PVLDB, 17(12): 3799 - 3812, 2024.
doi:10.14778/3685800.3685807
{ hao.wang, oujiaxin, zhaoming.274, sheng.qiu, yizhengjiao, wangyi.ywq, qizhong.mao,
zhengyu.yang, yangliu1, zhangjianshun, hujianyang, zhangjingwei.831, liujin-
rui.yummy, chenjiaqiang.0, shenyong.sy, caolixun, zhangheng.he, lihongde, lim-
ing.1018, mayue.ght, zhanglei.michael, liujian.kv, zhangguanghui, fei.liu, jianjun.chen
}@bytedance.com
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
1 INTRODUCTION
Persistent Key-Value (KV) stores are widely used by cloud services
at ByteDance as local storage engines. For example, ByteNDB, a
proprietary OnLine Transaction Processing (OLTP) system, stores
database page versions in persistent KV stores; ABase, a proprietary
distributed NoSQL database, also implements Redis-compliant data
structures on top of persistent KV stores.
RocksDB [
25
] is a Log-Structured Merge (LSM) tree [
62
] based,
high-performance persistent KV store developed for large-scale
distributed systems and optimized for Solid State Drives (SSDs).
Due to its congurability, RocksDB can be tailored to a variety of
workloads and requirements, and became the de facto local storage
engine implementation for many cloud services at ByteDance.
With the tremendous growth of popular ByteDance applications,
however, some cloud services began to encounter performance
and cost issues with their RocksDB-based local storage engines.
After numerous attempts in tuning RocksDB conguration, we
came to the conclusion that the unique workload characteristics,
performance requirements, and cost objectives of these cloud ser-
vices demand a local storage engine design with some distinctly
dierent trade-os from that of RocksDB. In 2019, ByteDance ac-
quired TerarkDB [
9
], a RocksDB-based general-purpose KV store
with customized indexing and compression algorithms, and set
out to develop LavaStore, a high-performance and cost-eective
local storage engine based on TerarkDB but purpose-built for cloud
services at ByteDance. In this paper, we describe the challenges for
local storage engines at ByteDance, explain why existing designs
in RocksDB fall short, and present the design and implementation
of LavaStore that address these challenges.
Write throughput was one of the rst major bottlenecks that
emerged early on due to the unique workload characteristics at
ByteDance. Specically, ByteDance applications aggressively de-
ployed in-memory caches (e.g., Redis and Memcached) at multiple
layers of their architecture to reduce read latency, leaving only a
small fraction of read requests for local storage engines to handle.
Furthermore, with applications commonly batching write requests
for higher throughput, the write workload is dominated by large
value writes. Unfortunately, RocksDB’s write throughput under
such a workload is severely limited by the inherently large write
amplication of LSM-tree for large value sizes. Even with BlobDB,
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 17, No. 12 ISSN 2150-8097.
doi:10.14778/3685800.3685807
3799
which is the RocksDB implementation of the commonly used tech-
nique of KV separation to improve write throughput for large value
sizes, the write throughput still falls signicantly short of appli-
cation requirements. In order to address this challenge, we opted
for a distinct KV separation design that decouples Garbage Col-
lection (GC) from compaction, which enables more exible trade-
os among space usage, read performance and write performance.
Compared with the design of
BlobDB
in RocksDB, which ties GC to
compaction, LavaStore can achieve much better write performance
with comparable space usage and read performance.
As applications in ByteDance enjoyed continued exponential
growth, the resource consumption by the cloud services, in terms
of the volume of data stored, CPU usage, etc., gradually grew to a
scale for which cost reduction became a major concern. In order to
improve resource eciency, LavaStore introduced various GC opti-
mizations to its KV separation implementation. In particular, for the
commonly recurring Write-Ahead-Logging (WAL) workload, we
added a specialized local storage engine, LavaLog, to exploit the fact
that these data are mostly written and expired in a First-In-First-Out
(FIFO) fashion, and rarely read, in order to achieve near-optimal
write amplication and GC overhead. Moreover, some applications
would like to enable RocksDB “sync” write to guarantee data dura-
bility, but the ensuing throughput and latency penalty is so large
that most of them opted for “non-sync” write with increased repli-
cation, eectively trading resource eciency for durability. For
such applications, LavaStore employed cross-layer optimization be-
tween the KV store and the underlying lesystem to eliminate the
“sync write” performance penalty, which helps these cloud services
restore resource eciency while maintaining data durability.
Finally, as cloud services continued to pursue aggressive cost re-
ductions, the previously abundant in-memory caches got more and
more scarce, so read performance optimization began to gain im-
portance. With a careful study of application requirements and read
performance metrics, we found that point lookup queries are the
main pain point. Specically, due to their interactive nature, many
popular ByteDance applications have stringent Service Level Agree-
ments (SLAs) on both the average and tail (e.g., the 99-percentile)
latencies. Although the average read latencies of most cloud ser-
vices were well below SLA, tail latencies were orders of magnitude
above SLA. Since resources must be provisioned to meet SLAs on
both average and tail latencies, average resource utilization became
very low. Reducing the tail latency of point lookup queries thus
became the key to improving resource eciency. To this end, Lava-
Store introduced a new index type that is very memory ecient
with outstanding point lookup performance. Together with a more
rened caching strategy, this new index type enables LavaStore to
achieve near-optimal read amplication for point lookup queries,
thus drastically closing the gap between average and tail latencies
for point lookup queries, and leading to substantial cost savings for
such cloud services.
As of this writing, LavaStore has been successfully deployed to
three widely used cloud services at ByteDance, with more than
100,000 running instances in production, storing more than 100
PB of data and serving an aggregate of over 2 billion Queries Per
Second (QPS). Customers of LavaStore have enjoyed great benets
over their original local storage engines. Specically, ByteNDB has
seen its average write latency reduced by 61% and its read latency
by 16%. The write QPS of ABase is increased by 87% while keeping
the total garbage ratio between 1% and 6%. Flink also reduces its
CPU usage by up to 67% after switching from its previous RocksDB-
based state backend to one based on LavaStore.
In summary, our key contributions are as follows:
We provide insights into local storage engine usage by cloud
services at ByteDance, and explain why the combination of
highly write-intensive workloads and stringent requirements on
cost eciency and point lookup tail latency may pose challenges
to a general-purpose local storage engine such as RocksDB;
We show how these challenges can be eectively addressed by
selectively customizing a few components of a RocksDB-based,
general-purpose local storage engine. Specically,
We present LavaKV, with a distinct KV separation design
that decouples garbage collection from compaction, which
enables more exible trade-os among space usage, read
performance, and write performance than RocksDB;
We present LavaLog, a specialized local storage engine for
the commonly recurring Write-Ahead-Logging workload,
which signicantly outperforms RocksDB in terms of write
amplication and garbage collection overhead;
We present LavaFS, a user-space append-only lesystem,
which provides much lower write amplication and syn-
chronous write latency than the in-kernel lesystem Ext4
when used by LavaKV and LavaLog;
Using both synthetic and production workloads, we validate
LavaStore’s design and implementation by showing that it suc-
cessfully meets the performance requirements and cost objec-
tives of cloud services at ByteDance;
We share the lessons we have learned while developing LavaS-
tore and running it in production at scale.
The rest of the paper is organized as follows. Section 2 presents
the background and motivation of this work. Section 3 describes
the design of LavaStore, with Section 3.2 focusing on improving
write performance, Section 3.3 on improving cost-eectiveness,
and Section 3.4 on improving read performance. In Section 4, we
evaluate LavaStore’s performance and cost-eectiveness using both
synthetic and production workloads. Section 5 discusses our lessons
learned and future work. Section 6 reviews the related work. Finally,
Section 7 concludes our work.
2 BACKGROUND AND MOTIVATION
In this section, we rst describe three cloud services that represent
typical use cases of persistent KV stores as local storage engines at
ByteDance, with a focus on their workload characteristics, perfor-
mance requirements, and cost objectives. We then explain why the
existing designs in RocksDB does not adequately address these use
cases, which also motivates the design of LavaStore.
2.1 Local Storage Engine Usage at ByteDance
2.1.1 ByteNDB. ByteNDB, short for ByteDance NewSQL Database,
is a cloud-native, distributed OLTP database suite engineered for full
compatibility with MySQL [
26
] and PostgreSQL [
57
]. Moving away
from MySQL’s traditional non-distributed InnoDB storage engine,
ByteNDB’s storage layer comprises two key components: LogStore
and PageStore [
15
]. LogStore leverages append-only distributed
3800
of 14
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜