which is the RocksDB implementation of the commonly used tech-
nique of KV separation to improve write throughput for large value
sizes, the write throughput still falls signicantly short of appli-
cation requirements. In order to address this challenge, we opted
for a distinct KV separation design that decouples Garbage Col-
lection (GC) from compaction, which enables more exible trade-
os among space usage, read performance and write performance.
Compared with the design of
BlobDB
in RocksDB, which ties GC to
compaction, LavaStore can achieve much better write performance
with comparable space usage and read performance.
As applications in ByteDance enjoyed continued exponential
growth, the resource consumption by the cloud services, in terms
of the volume of data stored, CPU usage, etc., gradually grew to a
scale for which cost reduction became a major concern. In order to
improve resource eciency, LavaStore introduced various GC opti-
mizations to its KV separation implementation. In particular, for the
commonly recurring Write-Ahead-Logging (WAL) workload, we
added a specialized local storage engine, LavaLog, to exploit the fact
that these data are mostly written and expired in a First-In-First-Out
(FIFO) fashion, and rarely read, in order to achieve near-optimal
write amplication and GC overhead. Moreover, some applications
would like to enable RocksDB “sync” write to guarantee data dura-
bility, but the ensuing throughput and latency penalty is so large
that most of them opted for “non-sync” write with increased repli-
cation, eectively trading resource eciency for durability. For
such applications, LavaStore employed cross-layer optimization be-
tween the KV store and the underlying lesystem to eliminate the
“sync write” performance penalty, which helps these cloud services
restore resource eciency while maintaining data durability.
Finally, as cloud services continued to pursue aggressive cost re-
ductions, the previously abundant in-memory caches got more and
more scarce, so read performance optimization began to gain im-
portance. With a careful study of application requirements and read
performance metrics, we found that point lookup queries are the
main pain point. Specically, due to their interactive nature, many
popular ByteDance applications have stringent Service Level Agree-
ments (SLAs) on both the average and tail (e.g., the 99-percentile)
latencies. Although the average read latencies of most cloud ser-
vices were well below SLA, tail latencies were orders of magnitude
above SLA. Since resources must be provisioned to meet SLAs on
both average and tail latencies, average resource utilization became
very low. Reducing the tail latency of point lookup queries thus
became the key to improving resource eciency. To this end, Lava-
Store introduced a new index type that is very memory ecient
with outstanding point lookup performance. Together with a more
rened caching strategy, this new index type enables LavaStore to
achieve near-optimal read amplication for point lookup queries,
thus drastically closing the gap between average and tail latencies
for point lookup queries, and leading to substantial cost savings for
such cloud services.
As of this writing, LavaStore has been successfully deployed to
three widely used cloud services at ByteDance, with more than
100,000 running instances in production, storing more than 100
PB of data and serving an aggregate of over 2 billion Queries Per
Second (QPS). Customers of LavaStore have enjoyed great benets
over their original local storage engines. Specically, ByteNDB has
seen its average write latency reduced by 61% and its read latency
by 16%. The write QPS of ABase is increased by 87% while keeping
the total garbage ratio between 1% and 6%. Flink also reduces its
CPU usage by up to 67% after switching from its previous RocksDB-
based state backend to one based on LavaStore.
In summary, our key contributions are as follows:
•
We provide insights into local storage engine usage by cloud
services at ByteDance, and explain why the combination of
highly write-intensive workloads and stringent requirements on
cost eciency and point lookup tail latency may pose challenges
to a general-purpose local storage engine such as RocksDB;
•
We show how these challenges can be eectively addressed by
selectively customizing a few components of a RocksDB-based,
general-purpose local storage engine. Specically,
–
We present LavaKV, with a distinct KV separation design
that decouples garbage collection from compaction, which
enables more exible trade-os among space usage, read
performance, and write performance than RocksDB;
– We present LavaLog, a specialized local storage engine for
the commonly recurring Write-Ahead-Logging workload,
which signicantly outperforms RocksDB in terms of write
amplication and garbage collection overhead;
–
We present LavaFS, a user-space append-only lesystem,
which provides much lower write amplication and syn-
chronous write latency than the in-kernel lesystem Ext4
when used by LavaKV and LavaLog;
•
Using both synthetic and production workloads, we validate
LavaStore’s design and implementation by showing that it suc-
cessfully meets the performance requirements and cost objec-
tives of cloud services at ByteDance;
•
We share the lessons we have learned while developing LavaS-
tore and running it in production at scale.
The rest of the paper is organized as follows. Section 2 presents
the background and motivation of this work. Section 3 describes
the design of LavaStore, with Section 3.2 focusing on improving
write performance, Section 3.3 on improving cost-eectiveness,
and Section 3.4 on improving read performance. In Section 4, we
evaluate LavaStore’s performance and cost-eectiveness using both
synthetic and production workloads. Section 5 discusses our lessons
learned and future work. Section 6 reviews the related work. Finally,
Section 7 concludes our work.
2 BACKGROUND AND MOTIVATION
In this section, we rst describe three cloud services that represent
typical use cases of persistent KV stores as local storage engines at
ByteDance, with a focus on their workload characteristics, perfor-
mance requirements, and cost objectives. We then explain why the
existing designs in RocksDB does not adequately address these use
cases, which also motivates the design of LavaStore.
2.1 Local Storage Engine Usage at ByteDance
2.1.1 ByteNDB. ByteNDB, short for ByteDance NewSQL Database,
is a cloud-native, distributed OLTP database suite engineered for full
compatibility with MySQL [
26
] and PostgreSQL [
57
]. Moving away
from MySQL’s traditional non-distributed InnoDB storage engine,
ByteNDB’s storage layer comprises two key components: LogStore
and PageStore [
15
]. LogStore leverages append-only distributed
3800
文档被以下合辑收录
评论