暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
VLDB2024_SepHash:A Write-Optimized Hash Index On Disaggregated Memory via Separate Segment Structure_华为.pdf
622
14页
8次
2024-09-09
免费下载
SepHash: A Write-Optimized Hash Index On Disaggregated
Memor y via Separate Segment Structure
Xinhao Min
Kai Lu
Wuhan National Laboratory for
Optoelectronics, Huazhong
University of Science and Technology
minxinhao@hust.edu.cn
kailu@hust.edu.cn
Pengyu Liu
Jiguang Wan
Changsheng Xie
Wuhan National Laboratory for
Optoelectronics, Huazhong
University of Science and Technolog
pyliu@hust.edu.cn
jgwan@hust.edu.cn
cs_xie@hust.edu.cn
Daohui Wang
Ting Yao
Huatao Wu
Huawei Cloud
wangdaogui@huawei.com
yaoting17@huawei.com
wuhuatao@huawei.com
ABSTRACT
Disaggregated memory separates compute and memory resources
into independent pools connected by fast RDMA (Remote Direct
Memory Access) networks, which can improve memory utilization,
reduce cost, and enable elastic scaling of compute and memory
resources. Hash indexes provide high-performance single-point op-
erations and are widely used in distributed systems and databases.
However, under disaggregated memory, existing hash indexes suer
from write performance degradation due to high resize overhead
and concurrency control overhead. Traditional write-optimized
hash indexes are not ecient for disaggregated memory and sacri-
ce read performance.
In this paper, we propose SepHash, a write-optimized hash index
for disaggregated memory. First, SepHash proposes a two-level
separate segment structure that signicantly reduces the band-
width consumption of resize operations. Second, SepHash employs
a low-latency concurrency control strategy to eliminate unneces-
sary mutual exclusion and check overhead during insert operations.
Finally, SepHash designs an ecient cache and lter to acceler-
ate read operations. The evaluation results show that, compared
to state-of-the-art distributed hash indexes, SepHash achieves a
3.3
×
higher write performance while maintaining comparable read
performance.
PVLDB Reference Format:
Xinhao Min, Kai Lu, Pengyu Liu, Jiguang Wan, Changsheng Xie, Daohui
Wang, Ting Yao, and Huatao Wu. SepHash: A Write-Optimized Hash Index
On Disaggregated Memory via Separate Segment Structure. PVLDB, 17(5):
1091 - 1104, 2024.
doi:10.14778/3641204.3641218
PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at
https://github.com/minxinhao/SepHash.
Corresponding author
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 17, No. 5 ISSN 2150-8097.
doi:10.14778/3641204.3641218
1 INTRODUCTION
Recently, disaggregated memory architecture has received wide-
spread attention from both academia [
20
,
26
,
36
,
37
,
45
,
57
,
58
]
and industry (e.g., Microsoft [
21
], Alibaba [
5
,
58
], IBM [
17
]) due
to its high resource utilization, scalability, and failure isolation ad-
vantages. Disaggregated memory decouples compute (CPU) and
memory resources from traditional monolithic servers to form in-
dependent resource pools. The compute pool contains rich CPU
resources but minimal memory resources, whereas the memory
pool contains large amounts of memory but near-zero computa-
tion power. The compute pool accesses the memory pool through
RDMA-capable networks such as InniBand, RoCE, and Omnipath
[
15
,
38
,
54
], which oer salient features including remote CPU
bypass, low latency, and high bandwidth [14, 40, 59].
Distributed hash indexes are widely used for high-performance
data indexing services such as databases, key-value (KV) stores,
memory pool systems, and le systems [
1
,
8
,
24
,
27
,
46
]. However,
due to the near-zero computation power of memory pools, tradi-
tional solutions [
14
,
28
,
46
] based on two-sided RDMA operations
cannot be applied eciently to disaggregated memory architec-
tures. Prior work, e.g., RACE [
65
], focused on designing extendible
hash structures [
19
,
30
] that are friendly to one-sided RDMA access.
However, when faced with write-intensive workloads, RACE suers
from severe write performance degradation due to inecient resize
operations and high concurrency control overhead. One common
way to improve the write performance of hash indexes is to intro-
duce leveling index structures that optimize data movement, such
as CLevel [
9
] and Plush [
41
]. However, these designs do not cater to
disaggregated memory, leading to frequent RDMA communication
induced by a large number of small reads and writes. Furthermore,
the multi-level index structure signicantly compromises the read
performance. Direct transplanting of these solutions is inadequate
to attain the desired performance. In summary, it is imperative to
redesign high-performance hash indexes for disaggregated memory,
yet it is confronted with the following challenges:
1) Signicant resize overhead. Resize operations of hash in-
dexes need to be transferred to clients due to the limited computa-
tion power of the memory nodes, resulting in signicant bandwidth
consumption. Existing methods concentrate on minimizing the ag-
gregate data volume moved during each resize operation. Nonethe-
less, they utilize an entry-based transfer strategy that relocates data
1091
at the granularity of individual entries, causing substantial network
overhead and becoming a performance bottleneck. Simultaneously,
in the context of variable-length KV payloads, transferring each
entry requires accessing the original KV, resulting in a signicant
increase in read amplication.
2) Concurrency control overhead. Concurrent access to the
index requires robust mechanisms for ensuring data consistency.
When multiple clients access an index concurrently, they may read
or write the same key at the same time. To avoid issues such as
insert-miss and duplicate key [
9
,
30
,
65
], current approaches gener-
ally adopt lock-based or reread-based concurrency control strate-
gies. However, both methods incur additional round-trip time (RTT)
and increase the latency of insert operations.
3) Sacricing read performance. Achieving optimal write per-
formance often requires compromising the orderliness of the data
layout, which can negatively impact read performance. For instance,
in a leveling hash structure, the read operation must search through
multiple levels, causing severe read amplication. In addition, mem-
ory constraints on compute nodes and communication overhead
make optimizing read performance challenging.
In this paper, we propose SepHash, a write-optimized hash index
for disaggregated memory. To address the above challenges, we
have introduced innovative techniques, including the separate seg-
ment structure for ecient resizing, an RTT-reduced concurrency
control mechanism for reducing write latency, and ecient cache
and lter structures for enhancing read performance. With these
techniques, SepHash delivers high read and write performance.
Separate segment structure. To reduce the resize overhead,
SepHash proposes a two-level separate segment structure that com-
bines the benets of extendible hash and leveling hash. The data
ow between segments is completely batch-oriented, avoiding indi-
vidual entry movements. By storing depth information (part of the
hash) and state information in each entry, SepHash reduces access
to original KVs and quickly empties old entries during resize.
RTT-reduced concurrency control. SepHash designs an RTT-
reduced concurrency control strategy that uses append write, corou-
tine, and sliding windows. This policy allows highly concurrent
access to the index without additional locking or rereading opera-
tions. All write operations can return immediately after inserting
KV pointers, without waiting for the completion of subsequent
meta-data updates, signicantly reducing write latency.
Ecient cache and lter. To improve read performance, Sep-
Hash introduces a lter for the separate segment structure and
maintains a client-side cache. These components can eectively
reduce the access granularity of RDMA operations and reduce
unnecessary access to indexes. SepHash designs a space-ecient
cache structure and proposes an update-ecient lter structure
that requires only one RDMA operation for each update.
We implement a prototype of SepHash and evaluate performance
using Micro-Benchmarks and YCSB workloads [
10
,
50
] . Our experi-
ments demonstrate that, under write-intensive workloads, SepHash
improves write throughput by 3.3
×
and reduces write latency by
30% compared to state-of-the-art hash indexes. Furthermore, un-
der read-intensive workloads, SepHash achieves 7.1
×
higher read
throughput compared to other leveling index structures. In sum-
mary, we have the following contributions:
We perform an analysis of existing hash indexes on disag-
gregated memory and identify three factors contributing to
poor write performance: entry-based resize strategies, read
amplication caused by out-of-place KV, and additional
RTTs caused by concurrent control strategies (§3).
We design SepHash, a write-optimized hash index on disag-
gregated memory. SepHash employs a two-level structure,
RTT-reduced concurrent control, and ecient cache and
lter to achieve excellent performance (§4).
We evaluate SepHash and compare it with state-of-the-
art distributed hash indexes. Evaluations demonstrate that
SepHash outperforms other indexes in terms of write per-
formance while achieving balanced read performance (§5).
2 BACKGROUND
2.1 Disaggregated Memory
Disaggregated memory segregates compute and memory resources
into separate pools [
21
,
22
,
53
], which are interconnected using
high-performance RDMA networks. The compute pool has mul-
tiple CPUs (e.g., 100s) with a few GBs of DRAM. In contrast, the
memory pool has a large number of memory resources (e.g., 100s–
1000s of GBs) with weak computing capabilities (e.g., 1–2 wimpy
cores) for handling communication and memory management.
RDMA networks provide RDMA
READ
,
WRITE
, and
ATOMIC
inter-
faces, e.g., compare-and-swap (
CAS
) and fetch-and-add (
FAA
) oper-
ations. RDMA operations are based on the post-poll mechanism.
Users initiate data transfer by posting work requests to the send
queue. Requests in the same send queue are executed sequentially.
By using doorbell batching [
34
,
43
,
47
], multiple RDMA operations
can be combined into a single request. These requests are then
read by the RDMA NIC, which asynchronously writes/reads data
to/from remote memory. During data transfer, the completion queue
is polled to check if the operation is complete. Waiting for network
transmissions accounts for most of RDMA’s latency overhead [
7
],
making it suitable for optimization using coroutine techniques.
Coroutines are lightweight threads that are not managed by the
operating system but by the program. Coroutines allow a program
to pause execution at any location and resume execution later
without interrupting the entire program. This makes coroutines
well suited for asynchronous tasks like RDMA where I/O wait and
compute can be executed under the same thread.
2.2 Write Optimized Hash
Current write-optimized hash indexes include extendible hash [
25
,
30, 62] and leveling hash [9, 41, 55, 63, 64], as shown in Figure 1.
Extendible hash uses a three-level structure comprising a direc-
tory, segments, and buckets. The directory contains a global depth
variable, which represents the length of the hash sux used to
index segments in the directory. Each segment holds a local depth
variable, which represents the hash sux length shared by all KVs
in that segment. A segment includes multiple buckets stored contin-
uously, and the corresponding bucket is selected using another part
of the hash value to insert KVs. Each bucket has multiple entries to
hold KVs with the same hash value. When storing variable-length
KVs, each entry holds a pointer to the KV data. When a bucket is
full, a resize operation is triggered for the entire segment. During a
1092
of 14
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜