SIGMOD-Companion ’25, June 22–27, 2025, Berlin, Germany Daokun Hu, anqing Xu, and Chuanghui Yang
(5) OLTP engines on modern distributed storage (55 min).
(6)
Discussion on future challenges and opportunities (10 min).
Related Tutorials. The tutorial “Data management in non-
volatile memory” [
47
] presented at the SIGMOD 2015 conference,
provided insights into how persistent memory can be seamlessly
integrated into data management systems. In 2017, the tutorial
“How to Build a Non-Volatile Memory Database Management Sys-
tem” [
9
] extended persistent memory to the entire internal database
management system stack. More recently, from 2022 to 2023, tutori-
als [
27
,
37
,
38
,
40
,
48
] focused on recovery strategies, disaggregated
databases, cloud databases and databases on modern networks.
Unlike previous work, this tutorial specically focuses on OLTP en-
gines on modern storage architectures, emphasizing architectures
based on various storage hardware and protocols such as RDMA
and CXL.
2 Background
With advances in storage technologies and interconnect protocols,
numerous cutting-edge products, such as NVMe SSDs with PCIe 5.0,
Intel Optane DCPMM, and CXL, are poised to supersede traditional
storage systems. These innovations oer remarkable improvements
in speed, eciency, and data management capabilities, catering to
the growing demands of modern computing environments. In this
section, we explore modern storage and interconnect technologies
to better understand their impact on performance and scalability,
which will be further discussed in Sections 3, 5 and 4.
2.1 Persistent Memory
Persistent memory (PMem) combines memory-like speed with non-
volatility, ensuring data persistence even during power loss. It is
distinguished by high performance, persistence, byte-addressability
and high density, bridging the gap between DRAM and block de-
vices in terms of both capacity and performance.
In 2019, Intel introduced the Optane DCPMM series 100, based on
3D XPoint technology, as the rst commercially available persistent
memory, becoming a focal point for research. The DCPMM oers a
per-DIMM capacity ranging from 128 to 512 GB, with write/read
latencies in the tens or hundreds of nanoseconds. It shares the mem-
ory bus with DRAM and supports
load/store
instructions. After
ushing data out of CPU cache, once data reaches the asynchro-
nous DRAM refresh (ADR) region, it is guaranteed to be durable.
For persistence and consistency, programmers must explicitly ush
the CPU cache to ensure data is persisted and use memory fence
instructions to prevent the CPU from reordering store operations.
The second-generation DCPMM, which supports extended asyn-
chronous DRAM refresh (eADR), expands the persistence domain
to include the CPU cache. This extension makes the CPU cache
a transient persistence domain by ensuring that data buered in
the CPU cache is ushed to persistent memory during a power
outage. Despite this advancement, memory fence instructions are
still necessary to maintain data consistency.
The emergence of persistent memory has introduced a new stor-
age architecture, presenting both opportunities and challenges for
OLTP systems with data persistence requirements. The study [
31
]
employs various database engines and benchmarks to evaluate and
compare performance impacts of PMem. It highlights the need for
ne-tuning and optimization redesign to fully leverage the capabil-
ities of PMem.
Fro programmers, the programming model of PMem (PMDK [
6
])
provides a transactional object store, including memory allocation,
transactions, and general facilities for persistent memory program-
ming. It also provides low-level persistent memory support such as
data copy and persistence, etc.
2.2 NVMe SSD
Non-volatile memory express solid-state drives (NVMe SSD) feature
block-addressability and deliver high performance. Recent advance-
ments have made SSD both faster and more cost-eective, with
the NVMe/PCIe interface enhancing interconnect speeds from 4
GB/s (PCIe 3.0) to 16 GB/s (PCIe 5.0) [
35
]. An array of PCIe 5.0
NVMe SSDs can achieve more than 100 GB/s read throughput [
32
].
Modern commodity servers, equipped with up to 128 PCIe lanes per
socket, can eortlessly host 8 or more SSD at full bandwidth [
20
].
As a result, a server can achieve tens of millions of I/O operations
per second [
35
]. However, the rise of high-throughput NVMe SSD
also challenges current database engines: pure in-memory engines
are costly and cannot leverage cheaper SSD, while out-of-memory
systems cannot fully utilize SSD capabilities originally designed for
SATA disks [20, 23, 32].
For programmers, there are three mainstream programming mod-
els of NVMe SSDs mechanism including libaio [
3
], io_uring [
2
]
and SPDK [
56
]. SPDK is a user space I/O library that bypasses the
kernel, enabling direct access to NVMe SSDs with zero-copy and
high-performance features. io_uring is a Linux API that utilizes
shared-memory, lock-free queues between the kernel and applica-
tion. It supports dierent polling mechanisms, allowing for reduced
syscall and interrupt overhead and enhanced asynchronous I/O
performance. libaio is an asynchronous I/O library. It oers an
interface for applications to issue asynchronous I/O requests.
2.3 RDMA and CXL
Remote Direct Memory Access (RDMA) is a technology that en-
ables nodes within a cluster to directly access each other’s memory
regions, bypassing the operating system kernel. This eliminates
traditional TCP/IP protocol stack overhead, such as unnecessary
data copying and context switching between user and kernel spaces.
RDMA relies on Remote Network Interface Controllers (RNICs) for
direct memory access within network adapters, facilitating data
transfers between nodes’ memory. In fast datacenter networks, a
basic RDMA operation takes approximately 2 microseconds [
11
],
and the bandwidth can reach tens of gigabytes per second. RDMA
has been widely used in datacenter.
For RDMA programming models, ibverbs [
5
] is a key component
in InniBand technology, providing a high-speed communication
interface. It enables ecient data transfer and low-latency com-
munication between nodes. Libfabric [
4
] is a more generalized
fabric API that provides a unied abstraction for high-performance
network devices.
Compute Express Link (CXL) is a promising interconnect stan-
dard [
1
], which enables cacheable
load/store
accesses to pooled
memory. CXL consists of three sub-protocols:
CXL.IO
,
CXL.Cache
,
and
CXL.Mem
.
CXL.IO
is an enhanced version of PCIe and forms the
BBAAD9C20180234D78A0072836F0BBB092B9B20A1C485BB0A6D9833CB14E2B59EB4BB43801567B0722392208984656EB1DE921EAE1D09B311BBFC257744E39D9241D02AD4523C98764FD2FE763B74345586CE5C7710452A0D89C419F15DDAC08D6B62097FE3
文档被以下合辑收录
评论