VLDB2024_SepHash：A Write-Optimized Hash Index On Disaggregated Memory via Separate Segment Structure_华为.pdf

迹部景吾

699

14页

8次

2024-09-09

免费下载

SepHash: A Write-Optimized Hash Index On Disaggregated

Memor y via Separate Segment Structure

Xinhao Min

Kai Lu

∗

Wuhan National Laboratory for

Optoelectronics, Huazhong

University of Science and Technology

minxinhao@hust.edu.cn

kailu@hust.edu.cn

Pengyu Liu

Jiguang Wan

Changsheng Xie

Wuhan National Laboratory for

Optoelectronics, Huazhong

University of Science and Technolog

pyliu@hust.edu.cn

jgwan@hust.edu.cn

cs_xie@hust.edu.cn

Daohui Wang

Ting Yao

Huatao Wu

Huawei Cloud

wangdaogui@huawei.com

yaoting17@huawei.com

wuhuatao@huawei.com

ABSTRACT

Disaggregated memory separates compute and memory resources

into independent pools connected by fast RDMA (Remote Direct

Memory Access) networks, which can improve memory utilization,

reduce cost, and enable elastic scaling of compute and memory

resources. Hash indexes provide high-performance single-point op-

erations and are widely used in distributed systems and databases.

However, under disaggregated memory, existing hash indexes suer

from write performance degradation due to high resize overhead

and concurrency control overhead. Traditional write-optimized

hash indexes are not ecient for disaggregated memory and sacri-

ce read performance.

In this paper, we propose SepHash, a write-optimized hash index

for disaggregated memory. First, SepHash proposes a two-level

separate segment structure that signicantly reduces the band-

width consumption of resize operations. Second, SepHash employs

a low-latency concurrency control strategy to eliminate unneces-

sary mutual exclusion and check overhead during insert operations.

Finally, SepHash designs an ecient cache and lter to acceler-

ate read operations. The evaluation results show that, compared

to state-of-the-art distributed hash indexes, SepHash achieves a

3.3

higher write performance while maintaining comparable read

performance.

PVLDB Reference Format:

Xinhao Min, Kai Lu, Pengyu Liu, Jiguang Wan, Changsheng Xie, Daohui

Wang, Ting Yao, and Huatao Wu. SepHash: A Write-Optimized Hash Index

On Disaggregated Memory via Separate Segment Structure. PVLDB, 17(5):

1091 - 1104, 2024.

doi:10.14778/3641204.3641218

PVLDB Artifact Availability:

The source code, data, and/or other artifacts have been made available at

https://github.com/minxinhao/SepHash.

∗

Corresponding author

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 17, No. 5 ISSN 2150-8097.

doi:10.14778/3641204.3641218

1 INTRODUCTION

Recently, disaggregated memory architecture has received wide-

spread attention from both academia [

]

and industry (e.g., Microsoft [

], Alibaba [

], IBM [

]) due

to its high resource utilization, scalability, and failure isolation ad-

vantages. Disaggregated memory decouples compute (CPU) and

memory resources from traditional monolithic servers to form in-

dependent resource pools. The compute pool contains rich CPU

resources but minimal memory resources, whereas the memory

pool contains large amounts of memory but near-zero computa-

tion power. The compute pool accesses the memory pool through

RDMA-capable networks such as InniBand, RoCE, and Omnipath

[

], which oer salient features including remote CPU

bypass, low latency, and high bandwidth [14, 40, 59].

Distributed hash indexes are widely used for high-performance

data indexing services such as databases, key-value (KV) stores,

memory pool systems, and le systems [

]. However,

due to the near-zero computation power of memory pools, tradi-

tional solutions [

] based on two-sided RDMA operations

cannot be applied eciently to disaggregated memory architec-

tures. Prior work, e.g., RACE [

], focused on designing extendible

hash structures [

] that are friendly to one-sided RDMA access.

However, when faced with write-intensive workloads, RACE suers

from severe write performance degradation due to inecient resize

operations and high concurrency control overhead. One common

way to improve the write performance of hash indexes is to intro-

duce leveling index structures that optimize data movement, such

as CLevel [

] and Plush [

]. However, these designs do not cater to

disaggregated memory, leading to frequent RDMA communication

induced by a large number of small reads and writes. Furthermore,

the multi-level index structure signicantly compromises the read

performance. Direct transplanting of these solutions is inadequate

to attain the desired performance. In summary, it is imperative to

redesign high-performance hash indexes for disaggregated memory,

yet it is confronted with the following challenges:

1) Signicant resize overhead. Resize operations of hash in-

dexes need to be transferred to clients due to the limited computa-

tion power of the memory nodes, resulting in signicant bandwidth

consumption. Existing methods concentrate on minimizing the ag-

gregate data volume moved during each resize operation. Nonethe-

less, they utilize an entry-based transfer strategy that relocates data

1091

at the granularity of individual entries, causing substantial network

overhead and becoming a performance bottleneck. Simultaneously,

in the context of variable-length KV payloads, transferring each

entry requires accessing the original KV, resulting in a signicant

increase in read amplication.

2) Concurrency control overhead. Concurrent access to the

index requires robust mechanisms for ensuring data consistency.

When multiple clients access an index concurrently, they may read

or write the same key at the same time. To avoid issues such as

insert-miss and duplicate key [

], current approaches gener-

ally adopt lock-based or reread-based concurrency control strate-

gies. However, both methods incur additional round-trip time (RTT)

and increase the latency of insert operations.

3) Sacricing read performance. Achieving optimal write per-

formance often requires compromising the orderliness of the data

layout, which can negatively impact read performance. For instance,

in a leveling hash structure, the read operation must search through

multiple levels, causing severe read amplication. In addition, mem-

ory constraints on compute nodes and communication overhead

make optimizing read performance challenging.

In this paper, we propose SepHash, a write-optimized hash index

for disaggregated memory. To address the above challenges, we

have introduced innovative techniques, including the separate seg-

ment structure for ecient resizing, an RTT-reduced concurrency

control mechanism for reducing write latency, and ecient cache

and lter structures for enhancing read performance. With these

techniques, SepHash delivers high read and write performance.

Separate segment structure. To reduce the resize overhead,

SepHash proposes a two-level separate segment structure that com-

bines the benets of extendible hash and leveling hash. The data

ow between segments is completely batch-oriented, avoiding indi-

vidual entry movements. By storing depth information (part of the

hash) and state information in each entry, SepHash reduces access

to original KVs and quickly empties old entries during resize.

RTT-reduced concurrency control. SepHash designs an RTT-

reduced concurrency control strategy that uses append write, corou-

tine, and sliding windows. This policy allows highly concurrent

access to the index without additional locking or rereading opera-

tions. All write operations can return immediately after inserting

KV pointers, without waiting for the completion of subsequent

meta-data updates, signicantly reducing write latency.

Ecient cache and lter. To improve read performance, Sep-

Hash introduces a lter for the separate segment structure and

maintains a client-side cache. These components can eectively

reduce the access granularity of RDMA operations and reduce

unnecessary access to indexes. SepHash designs a space-ecient

cache structure and proposes an update-ecient lter structure

that requires only one RDMA operation for each update.

We implement a prototype of SepHash and evaluate performance

using Micro-Benchmarks and YCSB workloads [

] . Our experi-

ments demonstrate that, under write-intensive workloads, SepHash

improves write throughput by 3.3

and reduces write latency by

30% compared to state-of-the-art hash indexes. Furthermore, un-

der read-intensive workloads, SepHash achieves 7.1

higher read

throughput compared to other leveling index structures. In sum-

mary, we have the following contributions:

•

We perform an analysis of existing hash indexes on disag-

gregated memory and identify three factors contributing to

poor write performance: entry-based resize strategies, read

amplication caused by out-of-place KV, and additional

RTTs caused by concurrent control strategies (§3).

•

We design SepHash, a write-optimized hash index on disag-

gregated memory. SepHash employs a two-level structure,

RTT-reduced concurrent control, and ecient cache and

lter to achieve excellent performance (§4).

•

We evaluate SepHash and compare it with state-of-the-

art distributed hash indexes. Evaluations demonstrate that

SepHash outperforms other indexes in terms of write per-

formance while achieving balanced read performance (§5).

2 BACKGROUND

2.1 Disaggregated Memory

Disaggregated memory segregates compute and memory resources

into separate pools [

], which are interconnected using

high-performance RDMA networks. The compute pool has mul-

tiple CPUs (e.g., 100s) with a few GBs of DRAM. In contrast, the

memory pool has a large number of memory resources (e.g., 100s–

1000s of GBs) with weak computing capabilities (e.g., 1–2 wimpy

cores) for handling communication and memory management.

RDMA networks provide RDMA

READ

WRITE

, and

ATOMIC

inter-

faces, e.g., compare-and-swap (

CAS

) and fetch-and-add (

FAA

) oper-

ations. RDMA operations are based on the post-poll mechanism.

Users initiate data transfer by posting work requests to the send

queue. Requests in the same send queue are executed sequentially.

By using doorbell batching [

], multiple RDMA operations

can be combined into a single request. These requests are then

read by the RDMA NIC, which asynchronously writes/reads data

to/from remote memory. During data transfer, the completion queue

is polled to check if the operation is complete. Waiting for network

transmissions accounts for most of RDMA’s latency overhead [

making it suitable for optimization using coroutine techniques.

Coroutines are lightweight threads that are not managed by the

operating system but by the program. Coroutines allow a program

to pause execution at any location and resume execution later

without interrupting the entire program. This makes coroutines

well suited for asynchronous tasks like RDMA where I/O wait and

compute can be executed under the same thread.

2.2 Write Optimized Hash

Current write-optimized hash indexes include extendible hash [

30, 62] and leveling hash [9, 41, 55, 63, 64], as shown in Figure 1.

Extendible hash uses a three-level structure comprising a direc-

tory, segments, and buckets. The directory contains a global depth

variable, which represents the length of the hash sux used to

index segments in the directory. Each segment holds a local depth

variable, which represents the hash sux length shared by all KVs

in that segment. A segment includes multiple buckets stored contin-

uously, and the corresponding bucket is selected using another part

of the hash value to insert KVs. Each bucket has multiple entries to

hold KVs with the same hash value. When storing variable-length

KVs, each entry holds a pointer to the KV data. When a bucket is

full, a resize operation is triggered for the entire segment. During a

1092

of 14

免费下载

文档被以下合辑收录

VLDB2024 数据库顶会论文（共31篇）

本合辑收录了VLDB2024 数据库顶会论文。

关注

文档被以下合辑收录

评论