at the granularity of individual entries, causing substantial network
overhead and becoming a performance bottleneck. Simultaneously,
in the context of variable-length KV payloads, transferring each
entry requires accessing the original KV, resulting in a signicant
increase in read amplication.
2) Concurrency control overhead. Concurrent access to the
index requires robust mechanisms for ensuring data consistency.
When multiple clients access an index concurrently, they may read
or write the same key at the same time. To avoid issues such as
insert-miss and duplicate key [
9
,
30
,
65
], current approaches gener-
ally adopt lock-based or reread-based concurrency control strate-
gies. However, both methods incur additional round-trip time (RTT)
and increase the latency of insert operations.
3) Sacricing read performance. Achieving optimal write per-
formance often requires compromising the orderliness of the data
layout, which can negatively impact read performance. For instance,
in a leveling hash structure, the read operation must search through
multiple levels, causing severe read amplication. In addition, mem-
ory constraints on compute nodes and communication overhead
make optimizing read performance challenging.
In this paper, we propose SepHash, a write-optimized hash index
for disaggregated memory. To address the above challenges, we
have introduced innovative techniques, including the separate seg-
ment structure for ecient resizing, an RTT-reduced concurrency
control mechanism for reducing write latency, and ecient cache
and lter structures for enhancing read performance. With these
techniques, SepHash delivers high read and write performance.
Separate segment structure. To reduce the resize overhead,
SepHash proposes a two-level separate segment structure that com-
bines the benets of extendible hash and leveling hash. The data
ow between segments is completely batch-oriented, avoiding indi-
vidual entry movements. By storing depth information (part of the
hash) and state information in each entry, SepHash reduces access
to original KVs and quickly empties old entries during resize.
RTT-reduced concurrency control. SepHash designs an RTT-
reduced concurrency control strategy that uses append write, corou-
tine, and sliding windows. This policy allows highly concurrent
access to the index without additional locking or rereading opera-
tions. All write operations can return immediately after inserting
KV pointers, without waiting for the completion of subsequent
meta-data updates, signicantly reducing write latency.
Ecient cache and lter. To improve read performance, Sep-
Hash introduces a lter for the separate segment structure and
maintains a client-side cache. These components can eectively
reduce the access granularity of RDMA operations and reduce
unnecessary access to indexes. SepHash designs a space-ecient
cache structure and proposes an update-ecient lter structure
that requires only one RDMA operation for each update.
We implement a prototype of SepHash and evaluate performance
using Micro-Benchmarks and YCSB workloads [
10
,
50
] . Our experi-
ments demonstrate that, under write-intensive workloads, SepHash
improves write throughput by 3.3
×
and reduces write latency by
30% compared to state-of-the-art hash indexes. Furthermore, un-
der read-intensive workloads, SepHash achieves 7.1
×
higher read
throughput compared to other leveling index structures. In sum-
mary, we have the following contributions:
•
We perform an analysis of existing hash indexes on disag-
gregated memory and identify three factors contributing to
poor write performance: entry-based resize strategies, read
amplication caused by out-of-place KV, and additional
RTTs caused by concurrent control strategies (§3).
•
We design SepHash, a write-optimized hash index on disag-
gregated memory. SepHash employs a two-level structure,
RTT-reduced concurrent control, and ecient cache and
lter to achieve excellent performance (§4).
•
We evaluate SepHash and compare it with state-of-the-
art distributed hash indexes. Evaluations demonstrate that
SepHash outperforms other indexes in terms of write per-
formance while achieving balanced read performance (§5).
2 BACKGROUND
2.1 Disaggregated Memory
Disaggregated memory segregates compute and memory resources
into separate pools [
21
,
22
,
53
], which are interconnected using
high-performance RDMA networks. The compute pool has mul-
tiple CPUs (e.g., 100s) with a few GBs of DRAM. In contrast, the
memory pool has a large number of memory resources (e.g., 100s–
1000s of GBs) with weak computing capabilities (e.g., 1–2 wimpy
cores) for handling communication and memory management.
RDMA networks provide RDMA
READ
,
WRITE
, and
ATOMIC
inter-
faces, e.g., compare-and-swap (
CAS
) and fetch-and-add (
FAA
) oper-
ations. RDMA operations are based on the post-poll mechanism.
Users initiate data transfer by posting work requests to the send
queue. Requests in the same send queue are executed sequentially.
By using doorbell batching [
34
,
43
,
47
], multiple RDMA operations
can be combined into a single request. These requests are then
read by the RDMA NIC, which asynchronously writes/reads data
to/from remote memory. During data transfer, the completion queue
is polled to check if the operation is complete. Waiting for network
transmissions accounts for most of RDMA’s latency overhead [
7
],
making it suitable for optimization using coroutine techniques.
Coroutines are lightweight threads that are not managed by the
operating system but by the program. Coroutines allow a program
to pause execution at any location and resume execution later
without interrupting the entire program. This makes coroutines
well suited for asynchronous tasks like RDMA where I/O wait and
compute can be executed under the same thread.
2.2 Write Optimized Hash
Current write-optimized hash indexes include extendible hash [
25
,
30, 62] and leveling hash [9, 41, 55, 63, 64], as shown in Figure 1.
Extendible hash uses a three-level structure comprising a direc-
tory, segments, and buckets. The directory contains a global depth
variable, which represents the length of the hash sux used to
index segments in the directory. Each segment holds a local depth
variable, which represents the hash sux length shared by all KVs
in that segment. A segment includes multiple buckets stored contin-
uously, and the corresponding bucket is selected using another part
of the hash value to insert KVs. Each bucket has multiple entries to
hold KVs with the same hash value. When storing variable-length
KVs, each entry holds a pointer to the KV data. When a bucket is
full, a resize operation is triggered for the entire segment. During a
1092
文档被以下合辑收录
评论