X-SSD：A Storage System with Native Support for Database Logging and Replication.pdf

Bigbig

175

15页

4次

2022-07-20

免费下载

X-SSD: A Storage System with Native Support for

Database Logging and Replication

Sangjin Lee

∗

Hanyang University

Republic of Korea

Alberto Lerner

University of Fribourg

Switzerland

André Ryser

University of Fribourg

Switzerland

Kibin Park

Hanyang University

Republic of Korea

Chanyoung Jeon

Hanyang University

Republic of Korea

Jinsub Park

Hanyang University

Republic of Korea

Yong Ho Song

Hanyang University &

Samsung Electronics

Republic of Korea

Philippe

Cudré-Mauroux

University of Fribourg

Switzerland

ABSTRACT

Transaction logging and log shipping are standard techniques to

provide recoverability and high availability in data management

systems. They entail an update to a local log le and a remote site at

every transaction. Modern databases have leveraged technologies

such as Persistent Memory (PM) and RDMA-enabled networking

to perform these updates as fast as possible. This mix of technolo-

gies, however, presents several drawbacks: lack of portability, the

complexity of the data path, and interoperability.

To address these issues, this paper introduces the X-SSD, a new

SSD architecture that mixes NAND Flash and PM memory classes.

A X-SSD device can take transaction log writes on a fast, PM-backed

data path and be responsible for propagating the operation to re-

mote sites and eventually to NAND Flash storage. We design and

implement an actual reference X-SSD device called Villars to vali-

date this new architecture. Our experiments show that the Villars

device can oer a more straightforward and robust way to manage

PM on behalf of the database and achieve equally fast results.

CCS CONCEPTS

• Information systems

→

Database management system en-

gines; Storage architectures.

KEYWORDS

database-storage codesign, write-ahead log, database replication

ACM Reference Format:

Sangjin Lee, Alberto Lerner, André Ryser, Kibin Park, Chanyoung Jeon,

Jinsub Park, Yong Ho Song, and Philippe Cudré-Mauroux. 2022. X-SSD: A

Storage System with Native Support for Database Logging and Replication.

In Proceedings of the 2022 International Conference on Management of Data

(SIGMOD ’22), June 12–17, 2022, Philadelphia, PA, USA. ACM, New York, NY,

USA, 15 pages. https://doi.org/10.1145/3514221.3526188

∗

The author performed most of the work while visiting the University of Fribourg.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

ACM ISBN 978-1-4503-9249-5/22/06... $15.00

https://doi.org/10.1145/3514221.3526188

1 INTRODUCTION

Database replication is often performed by copying transactions’

changes into a secondary site before committing these changes

into local storage [

]. Such a mechanism is called (transaction)

log shipping and is present almost universally in databases that

oer replication, (e.g., [

]). If the primary database site goes

down, the secondary one can serve as a hot backup, as it caught

up with all the primary database changes. Achieving this level of

robustness, however, comes at a cost. Transaction logging and log

shipping require writing to storage and exchanging data over the

network, both relatively expensive operations.

Two technologies reached maturity recently that can be rele-

vant in this scenario. The rst one is Persistent Memory (PM), and

more specically, PM in a DIMM form factor that replaces server

memory and can be accessed by an application via

load

and

store

instructions. PM comes in many avors such as Intel Optane [

]

or battery-backed DRAM [

]

. Optane class PM has for instance

proved to be useful in mixed memory indices [

], and can of-

fer alternative ways to build a database system [

]. Battery-backet

class PM behaves as regular DRAM but is not volatile. The sec-

ond technology is RDMA-enabled networks [

]. These networks

transport data with negligible overhead and have been useful, for

instance, in query execution [

]. Just as with PM, RDMA-

enabled networks have also fostered new database designs [11].

PM and RDMA-enabled networks can also help to record and

propagate transaction log updates [

]. In particular, we con-

sider the case of Main-Memory Databases [

]. They can reach

unprecedented performance levels because they maintain all their

data in DRAM and persist only the transaction log, which there-

fore becomes their main bottleneck [

]. Figure 1 (left) depicts

how a typical system can perform log writing and shipping with

PM and RDMA. We can observe in the gure that the database

system is responsible for coordinating several dierent steps, some-

times targeting local PM, sometimes remote PM or memory via an

RDMA-enabled NIC, and lastly, fast SSD devices.

Each of these technologies oers a specic API and presents

some restrictions. The combination of these restrictions creates a

number of issues, including:

•

The interaction of RDMA and PM is complex and poorly under-

stood. For example, using RDMA to update a PM-backed address

on a remote machine may make the update visible, but it does not

The JEDEC standard, which supports DRAM interoperability, calls these NVDIMM-P

and NVDIMM-F types of persistent memory, respectively [23].

Session 14: Modern Hardware and In-memory DBMS

SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

988

CPU

mem ctrl

PCIe system

Cores /

Caches

NIC

NVM

ctrl

SSD

MEM

CPU

mem ctrl

PCIe system

Cores /

Caches

NIC

X-SSD

MEM

mem ctrl

PCIe system

Cores /

Caches

NIC

NVM

ctrl

SSD

MEM

mem ctrl

PCIe system

Cores /

Caches

NIC

X-SSD

MEM

(1)

(2)

(3)

(4a)

(4b)

(1)

(2)

(3)

(4a)

(4b)

Figure 1: (Left) Logging and replication path using PM and RDMA. (1) The database writes log data into the PM. (2) It then ships

the data to remote PM via RDMA. (3) It uses a se cond RDMA operation to make the change the log describ es into the remote

host’s memory (e.g., using Active-Memory techniques [

]). Eventually, both hosts will need to make space on PM. (4a/b) They

do so by copying some of its contents into an SSD. (Right) Logging and replication path using a X-SSD device. The sequence of

steps is the same, but the X-SSD device takes responsibility for propagating data in steps (2) and (4a/b), while the update of the

remote memory is done by the remote Database (3).

guarantee that that update is persistent [

]. If a machine crashes,

a replication operation’s correctness can be compromised.

•

While PM can be accessed with simple

load

store

memory

instructions, programming correct, persistent data structures is a

daunting endeavor. A software crash can leave a structure in an

arbitrary state, from which the database then needs to recover.

•

Every DIMM slot used for PM is not used for DRAM. This forces

the system designers to choose between DRAM or PM capacity.

•

Optane and battery-backed DRAM require specic server support

and cannot be ported across servers without certain characteris-

tics. Optane, in particular, is not supported on AMD platforms.

To address these issues, this paper presents a new SSD design

that allows database logging and replication to benet from PM and

fast networking, but without the above drawbacks. In summary,

our design is based on a deceptively simple decision: it moves PM

out of the CPU path and into the SSD, and it lets the latter manage

the access to PM, locally or remotely, on behalf of the database.

Specically, we devise a new storage architecture that contains PM-

and NAND Flash-based storage that is natively networked. The

architecture provides a separate, fast data path and interface fully

dedicated to transaction log writes and oers Data Propagation

Services, including across servers, upon which database replication

can be built. We call our storage architecture the X-SSD

. Figure 1

(right) shows how the logging and replication data paths can be

simplied with a X-SSD device.

Moving PM into a X-SSD device frees DIMM slots for DRAM and

restores the ability to deploy PM on vendor-independent server-

class machines without special-purpose DIMM slots or battery-

backing features. One can then use PM on an AMD server simply

by plugging in this new NVMe device. The device also avoids in-

teroperability problems between RDMA and remote PM. These

problems occur because the RDMA writes may be routed to the

CPU caches in a process called Data-Direct IO (DDIO) [

], before

they reach the PM. Our system can resort to low-level mechanisms

to request the NIC to deliver messages directly to storage, which

would be impractical at the application level.

The ’X’ stands for “cross” for reasons that will become apparent shortly.

We design and implement a X-SSD device using an actual SSD

prototyping platform [

]. We call this device Villars. The Villars

device is fully compatible with the NVMe standard [

], the de facto

standard for fast SSDs, even with our extensions. The Villars is

the rst in what we expect to be a series of X-SSD devices with

increasing application functionality. We also provide a set of drop-

in system call replacements, e.g,

pwrite()

, that make it easy to

convert existing code to use our device. These syscall replacements

can detect, with low overhead, when a previous write is persistent

or is in-processing within a local device or a remote device, allowing

the database to implement dierent replication avours.

In summary, we make the following contributions:

•

We propose a new SSD architecture that mixes PM and NAND

Flash storage in a way that naturally matches transaction logging

and replication behavior (§ 3).

•

We describe a reference design of our architecture that presents

precise durability semantics to the application (§ 4).

•

We describe how to integrate data propagation services in an

existing database as well as suggest how a new database can

explore alternative designs (§ 5).

•

We quantify the benets of shifting data movement (across mem-

ory types and local and remote servers) to the storage, freeing

application cycles in the process (§ 6).

•

We present several use cases that can benet from X-SSD devices’

existing and potential features (§ 7).

• We position our solution relative to the state of the art (§ 8).

Before describing our architecture in detail, we start below by

presenting the background necessary to follow our discussions.

2 BACKGROUND

The X-SSD device extends traditional SSDs with data propagation

services exposed through a new interface. To understand the impli-

cations of such an extension, we revisit how data ows between a

database and a storage device (§ 2.1) and, once it reached the latter,

the path the data follows within a conventional SSD (§ 2.2). Lastly,

we introduce two standard but little-known technologies called

CMB and NTB upon which our architecture is based (§ 2.3).

Session 14: Modern Hardware and In-memory DBMS

SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

989

of 15

免费下载

关注

评论