NVM has a tuple ID eld and a Tx-CTS (Transaction Commit Times-
tamp) eld. Tx-CTS identies the transaction that produces the
version of the tuple. At commit time, Zen persists modied tuples
in a transaction from the Met-Cache to the relevant base tables
in NVM. It writes to newly allocated or garbage collected space
without overwriting the previous versions of the tuples. The most
signicant bit in Tx-CTS is used as a LP (Last Persisted) bit. After
persisting the set of modied tuples in a transaction, Zen sets the LP
bit and persists the Tx-CTS for the last tuple in the set. Upon failure
recovery, Zen can identify if the modication of a transaction is
fully persisted by checking if the LP bit is set for one of the tuples.
If yes, then the new tuple versions will be the current versions. If
no, then the transaction is considered as aborted, and the previous
tuple versions are used.
Lightweight NVM Space Management
: We aim to reduce the
persistence operations for NVM space management as much as pos-
sible. First, we allocate large (2MB sized) chunks of NVM memory
from the underlying system, and initialize the NVM memory so that
Tx-CTS=0. Second, we manage tuple allocation and free without
performing any persistence operations. This is because using the
log-free persistence mechanism, Zen can identify the tuple versions
that are most recently committed upon recovery. The old tuple ver-
sions are then put into the free lists. Third, the allocation structures
are maintained in DRAM during normal processing. Zen garbage
collects old tuple versions and puts them into free lists for tuple
allocations. Each thread has its own allocation structures to avoid
thread synchronization overhead.
The contributions of this paper are fourfold. First, we identify
the main design principles for NVM based OLTP engines by exam-
ining the strengths and weaknesses of three state-of-the-art NVM
based OLTP designs (§2). Second, we propose Zen, which reduces
NVM overhead by three novel techniques, namely the Met-Cache,
log-free persistent transactions, and light-weight NVM space man-
agement (§3 and §5). The three techniques push to the extreme
of minimizing NVM writes: for every tuple write, the only NVM
write is for the modied tuple itself. Third, we evaluate the runtime
and recovery performance of Zen using YCSB and TPCC bench-
marks on a real machine equipped with Intel Optane DC Persistent
Memory. Experimental results show that Zen achieves up to 10.1x
improvements over MMDB with NVM capacity, WBL, and FOEDUS,
while obtaining almost instant recovery (§4). Finally, we prove the
wide applicability of Zen by supporting 10 dierent concurrency
control methods (§3 and §4).
2 BACKGROUND AND MOTIVATION
We provide background on NVM and OLTP, examine existing OLTP
engine designs for NVM, then discuss the design challenges.
2.1 NVM Characteristics
There are several competing NVM technologies, including PCM [
29
],
STT-RAM [
39
], Memristor [
3
], and 3DXPoint [
1
,
18
]. They share
similar characteristics: (i) NVM is byte-addressable like DRAM; (ii)
NVM is modestly (e.g., 2–3x) slower than DRAM, but orders of mag-
nitude faster than HDDs and SSDs; (iii) NVM provides non-volatile
main memory that can be much larger (e.g., up to 6TB in a dual-
socket server) than DRAM; (iv) NVM writes have lower bandwidth
than NVM reads; (v) To ensure that data is consistent in NVM upon
power failure, special persistence operations using cache line ush
and memory fence instructions (e.g.,
clwb
and
sfence
) are required
to persist data from the volatile CPU cache to NVM, incurring sig-
nicantly higher overhead than normal writes; and (vi) NVM cells
may wear out after a limited number (e.g., 10
8
) of writes.
From previous work on NVM based data structures and sys-
tems [
4
–
6
,
9
–
12
,
17
,
19
,
20
,
25
,
26
,
28
,
33
–
35
,
37
,
38
], we obtain
three common design principles: (i) Put frequently accessed data
structures in DRAM if they are either transient or can be recon-
structed upon recovery; (ii) Reduce NVM writes as much as possible;
(iii) Reduce persistence operations as much as possible. We would
like to apply these design principles to the OLTP engine design.
2.2 OLTP in Main Memory Databases
Main memory OLTP systems are the starting point to design an
OLTP engine for NVM. We consider concurrency control and crash
recovery mechanisms for achieving ACID transaction support.
Recent work has investigated concurrency control methods for
high-throughput main memory transactions [
15
,
24
,
27
,
32
,
36
,
41
].
Instead of using two phase locking (2PL) [
7
,
16
], which is the stan-
dard method in traditional disk-oriented databases, main memory
OLTP designs exploit optimistic concurrency control (OCC) [
21
]
and multi-version concurrency control (MVCC) [
7
] for higher per-
formance. Silo [
32
] enhances OCC with epoch-based batch times-
tamp generation and group commit. MOCC [
36
] is an OCC based
method that exploits locking mechanisms to deal with high conicts
for hot tuples. Tictoc [
41
] removes the bottleneck of centralized
timestamp allocation in OCC and computes transaction timestamps
lazily at commit time. Hekaton [
15
] employs latch-free data struc-
tures and MVCC for transactions in memory. Hyper [
27
] improves
MVCC for read-heavy transactions in column stores by performing
in-place updates and storing before-image deltas in undo buers. Ci-
cada [
24
] reduces overhead and contention of MVCC with multiple
loosely synchronized clocks for generating timestamps, best-eort
inlining to decrease cache misses, and optimized multi-version val-
idation. One common feature of the above methods is that they
extend every tuple or every version of a tuple with metadata, such
as read/write timestamps, pointers to dierent tuple versions, and
lock bits for validation and commit processing. These methods have
achieved transaction throughputs of over one million transactions
per second (TPS) without persistence.
Similar to traditional databases, main memory databases (MMDB)
store logs and checkpoints on durable storage (e.g., HDDs, SSDs)
in order to achieve durability [
8
,
14
,
22
,
23
,
30
,
43
]. The main dier-
ence resides in the fact that all the data ts into main memory in
MMDBs. Hence, only committed states and redo logs need to be
written to disks. After a crash, an MMDB recovers by loading the
most recent checkpoint from durable storage into main memory,
then reading and applying the redo log up to the crash point.
2.3 Existing OLTP Engine Designs for NVM
In this paper, we focus on the case where all data and structures of
the OLTP engine can t into NVM memory. We assume that the
computer system contains both NVM and DRAM memory, which
are mapped to dierent address ranges in the virtual memory of
836
评论