SIGMOD-Companion ’25, June 22–27, 2025, Berlin, Germany. Chenguang Fang et al.
Additionally, long-running transactions occupy fragmented resources
for extended periods. This also leads to potential issues of releasing
the resource and hinders the eciency of database recovery pro-
cesses. Several modern RDBMSs (e.g., TiDB [
18
] and CockroachDB
[
28
]) also employ LSM-tree as their storage engines. However, this
design faces several specic challenges:
Transaction Size Limitation. While existing LSM-tree based
databases support moderately large transactions, they still face
limitations due to either memory or log size constraints. Some LSM-
tree based databases support transactions following the Percolator
model [
24
], which relies on memory to manage all uncommitted
data. Consequently, the transaction size in these systems cannot
exceed the available memory. On the other hand, databases like
RocksDB [
2
] support transactions that are larger than RAM by
temporarily writing uncommitted changes to disk. Unfortunately,
their implementation requires reserving all the write-ahead log
(WAL) for a transaction to track the transaction states before it is
committed. This results in large logs, thus restricting the transaction
size based on the log capacity. Therefore, it is challenging to support
arbitrarily large transactions in LSM-tree based databases.
Inecient Recovery. Large transactions also incur high recov-
ery costs [
13
]. When the system encounters an exceptional scenario,
the recovery process comprises redo and undo phases [
22
]. How-
ever, to recover these transactions, it takes a long time to undo all
the operations as well as keeping locks during the process. While
existing methods (e.g., [
13
,
19
]) propose to eliminate the issue of
large transaction recovery in B
+
-tree based databases, they are not
applicable in LSM-tree based systems, since LSM-tree prevents in-
place updates for SSTables. Moreover, both redo and undo phases
rely on the states of the active and terminated transactions, i.e., it
is essential to manage the persistence of the transaction states.
Limited Utilization of LSM-tree Features. Existing systems
often abstract the LSM-tree storage engine as a key-value store,
treating it as a black box. Such design limits the potential for lever-
aging LSM-tree features to better support large transactions. To
support large transactions, the most common implementation is
to write the commit version back (i.e., backll) or rollback state
into the LSM-tree as a key-value pair upon commit or abort [
24
] to
invalidate the old states. This is easy in B
+
-tree based database by
rewriting the corresponding data pages. Unfortunately, owing to
the append-only nature of LSM-tree, the SSTables in disk are im-
mutable and thus the rewrites incur additional I/Os the same as the
size of the transaction. In addition to the aforementioned overhead,
during commit and rollback, such design also fails to optimize reads
and writes based on such black-box implementation.
1.2 Solutions
To address the aforementioned challenges, we devise MaLT for
Managing Large Transactions in OceanBase. It highlights the fol-
lowing features.
Larger-than-RAM Transactions Support. To facilitate the
execution of larger-than-RAM transactions in LSM-tree based data-
base, it is crucial to write uncommitted changes to disk, i.e., ap-
plying the steal policy [
16
,
22
]. To eliminate the dependency on
reserving the entire WAL for a transaction, MaLT introduces an
external structure to store the transaction states, including active,
committed and aborted. Therefore, we devise Transaction Context
Table (TCT) and Transaction Data Table (TDT) tailored for the
LSM-tree architecture. Specically, TCT is responsible for record-
ing in-memory active transaction context. TDT records the states
of committed/aborted transactions. By combining TCT and TDT,
MaLT eciently updates the transaction states for uncommitted
changes with TCT and TDT upon commit. Hence, MaLT eectively
supports Larger-than-RAM transactions without any limitations.
Eective Persistence of Transaction States and Ecient
Recovery. We further leverage TCT and TDT for recovery and
devise dedicated persistence strategies for them. In particular, MaLT
ensures full preservation of TCT on disk, which enables the recovery
of the active transactions during redo phase. TDT, instead, persists
the terminated transaction states following the similar structure of
LSM-tree for more ecient storage. MaLT then skips the undo phase
by utilizing TDT. This eliminates the immediate latency typically
associated with recovery processes and also enables constant-time
recovery regardless of transaction size.
Ecient Transaction Executions with Optimizations in
LSM-tree. Unlike traditional RDBMSs that often abstract storage
engines as key-value stores, MaLT embeds transaction information
directly within the LSM-tree storage engine. This enables a range
of optimizations. First, as the TCT and TDT frameworks manage
transaction states, there is no immediate need for backll or rollback
operations upon commit or abort. This allows MaLT to integrate
the backll and rollback of transaction states into the LSM-tree com-
paction stage seamlessly. Hence, MaLT can perform highly ecient
transaction commit and rollback both in constant time. Addition-
ally, the MemTable and SSTable in MaLT store transaction-specic
metadata allowing for ecient data ltering and retrieval. This
optimization signicantly enhances read and write performance.
Overall, MaLT achieves very ecient commit, abort, and execu-
tion operations through these optimization strategies based on the
LSM-tree architecture.
1.3 Use Case
The aforementioned challenges in §1.1 highlight a critical gap be-
tween LSM-tree storage engines and the demands of modern en-
terprise workloads regarding large transactions. This subsection
reports a use case from one of our customers who implemented a -
nancial platform that required robust support for large transactions
in their database.
The nancial platform exhibited specic requirements for the
database system: It experiences a business peak in the morning, with
backups scheduled in the afternoon. Subsequently, while there is no
business data activity, batch processing involving large transactions
takes place. In the evening, the platform handles large transactions
involving 2.5 million entries per commit for data import, during
which DDL synchronization is also required. The key requirements
from the platform include ensuring peak performance during busi-
ness hours and stability during large transaction batch processing.
The previous versions of OceanBase often resulted in memory
overloads or log disk failures that necessitated manual recovery
interventions. Even after achieving support for large transactions in
earlier versions, the commit and rollback speeds are still concerning
for transaction executions. Therefore, we implemented MaLT in
BBAAD9C20180234D78A0072836F0BB2062B9B20A18E7DBB0A7D9813CB1462B79BB44B438015D7B0A22192208984674EBE7E921BAE1D06BC11BBFC27F7A1E39D6241DD7AD5324C98764CB2F77635743E76F4CE6C174B402A3B80CB19F4EF06C08D7B62291FE3
文档被以下合辑收录
评论