
Replication
Monolithic Server
CPU
Memory
Storage
Monolithic Server
CPU
Memory
Storage
Figure 1: monolithic machine
Replication
Compute Node
CPU
Memory
Compute Node
CPU
Memory
Storage
Storage Node
Storage
Storage Node
Network
Storage Pool Storage Pool
(a) virtual machine with remote disk
Coordination
Compute Node
CPU
Memory
Compute Node
CPU
Memory
Storage
Storage Node
Storage Pool
Network
(b) shared storage
Figure 2: separation of compute and storage
Coordination
Compute Node
Local Memory
CPU
Storage
Storage Node
Storage Pool
Memory
Memory Node
Memory Pool
Compute Node
Local Memory
CPU
Network
Figure 3: disaggregation
problems like bin-packing of CPU and memory, lacking of exible
and scalable memory resources, remain unsolved. Furthermore,
each read replica keeps a redundant in-memory data copy, leading
to high memory costs.
In this paper, we propose a novel cloud database design par-
adigm of the disaggregation architecture (Figure 3). It goes one
step further than the shared storage architecture, to address the
aforementioned problems. The disaggregation architecture runs
in the disaggregated data centers (DDC), in which CPU, memory
and storage resources are no longer tightly coupled as in a mono-
lithic machine. Resources are located in dierent nodes connected
through high-speed network. As a result, each resource type im-
proves its utilization rate and expands its volume independently.
This also eliminates fate sharing, allowing each resource be re-
covered from failure and upgraded independently. Moreover, data
pages in the remote memory pool can be shared among multiple
database processes, analogous to the storage pool being shared
in shared storage architecture. Adding a read replica no longer in-
creases the cost of memory resources, except for consuming a small
piece of local memory.
A trend in recent years is that cloud-native database vendors are
launching serverless variants [
3
,
4
]. The main feature of serverless
databases is on-demand resource provisioning (such as auto-scaling
and auto-pause), which should be transparent and seamless without
interrupting customer workloads. Most cloud-native databases are
implemented based on the shared storage architecture, where CPU
and memory resources are coupled and must be scaled at the same
time. In addition, auto-pause has to release both resources, resulting
in long resumption time. We show that disaggregation architecture
can overcome these limitations.
PolarDB Serverless is a cloud-native database implementation
that follows the disaggregation architecture. Similar to major cloud-
native database products like Aurora, HyperScale, and PolarDB
1
,
it includes one primary (RW node) and multiple read replicas (RO
nodes) in the database node layer. With the disaggregation architec-
ture, it is possible to support multiple primaries (RW nodes), but
this is not within the scope of this paper.
The design of a multi-tenant scale-out memory pool is intro-
duced in PolarDB Serverless, including page allocation and life cycle
management. The rst challenge is to ensure that the system exe-
cutes transactions
correctly
after adding remote memory to the
system. For example, read after write should not miss any updates
even across nodes. We realize it using cache invalidation. When
1
PolarDB Serverless is developed on a fork of PolarDB’s codebase.
RW is splitting or merging a B+Tree index, other RO nodes should
not see an inconsistent B-tree structure in the middle. We protect
it with global page latches. When a RO node performs read-only
transactions, it must avoid reading anything written by uncommit-
ted transactions. We achieve it through the synchronization of read
views between database nodes.
The evolution of the disaggregation architecture could have a
negative impact on the database performance. It is because the
data is likely to be accessed from the remote, which introduces
signicant network latency. The second challenge is to execute
transactions
eciently
. We exploit RDMA optimization exten-
sively, especially one-sided RDMA verbs, including using RDMA
CAS [
42
] to optimize the acquisition of global latches. In order
to improve concurrency, both RW and RO use optimistic locking
techniques to avoid unnecessary global latches. On the storage side,
page materialization ooading allows dirty pages to be evicted
from remote memory without ushing them to the storage, while
index-aware prefetching improve query performance.
The disaggregation architecture complicates the system and
hence increases the variety and probability of system failures. As a
cloud database service, the third challenge is to build a
reliable
sys-
tem, we summarize our strategies to handle single-node crashes of
dierent node types which guarantee that there is no single-point
failure in the system. Because the states in memory and storage are
decoupled from the database node, crash recovery time of the RW
node becomes 5.3 times faster than that in the monolithic machine
architecture.
We summarize our main contributions as follows:
•
We propose the disaggregation architecture and present the
design of PolarDB Serverless, which is the rst cloud database
implementation following the architecture. We demonstrate
that this architecture provides new opportunities for the
design of new cloud-native and serverless databases.
•
We provide design details and optimizations that make the
system work correctly and eciently, overcoming the perfor-
mance drawbacks brought by the disaggregation architecture.
•
We describe our fault tolerance strategies, including the
handling of single-point failures and cluster failures.
The remainder of this paper is organized as follows. In Section 2,
we introduce backgrounds of PolarDB and DDC. Section 3 explains
the design of PolarDB Serverless. Section 4 presents our performance
optimizations. Section 5 discusses our fault tolerance and recov-
ery strategies. Section 6 gives the experimental results. Section 7
reviews the related work, and Section 8 concludes the paper.
文档被以下合辑收录
评论