VLDB2024_CoroGraph：Bridging Cache Efficiency and Work Efficiency for Graph Algorithm Execution_华为.pdf

迹部景吾

673

13页

3次

2024-09-09

免费下载

CoroGraph: Bridging Cache Eiciency and Work Eiciency for

Graph Algorithm Execution

Xiangyu Zhi

Xiao Yan*

Department of Computer Science and

Engineering, Southern University of

Science and Technology

zhixy2021@mail.sustech.edu.cn

yanx@sustech.edu.cn

Bo Tang

Ziyao Yin

Department of Computer Science and

Engineering, Southern University of

Science and Technology

tangb3@sustech.edu.cn

yinzy2020@mail.sustech.edu.cn

Yanchao Zhu

Minqi Zhou

Gauss Department, Huawei Company

zhuyanchao2@huawei.com

zhouminqi@huawei.com

ABSTRACT

Many systems are designed to run graph algorithms eciently in

memory but they achieve only cache eciency or work eciency.

We tackle this fundamental trade-o in existing systems by design-

ing CoroGraph, a system that attains both cache eciency and work

eciency for in-memory graph processing. CoroGraph adopts a

novel hybrid execution model, which generates update messages

at vertex granularity to prioritize promising vertices for work e-

ciency, and commits updates at partition granularity to share data

access for cache eciency. To overlap the random memory access

of graph algorithms with computation, CoroGraph extensively uses

coroutine, i.e., a lightweight function in C++ that can yield and

resume with low overhead, to prefetch the required data. A suite

of designs are incorporated to reap the full benets of coroutine,

which include prefetch pipeline, cache-friendly graph format, and

stop-free synchronization. We compare CoroGraph with ve state-

of-the-art graph algorithm systems via extensive experiments. The

results show that CoroGraph yields shorter algorithm execution

time than all baselines in 18 out of 20 cases, and its speedup over the

best-performing baseline can be over 2x. Detailed proling suggests

that CoroGraph achieves both cache eciency and work eciency

with a low memory stall and a small number of processed edges.

PVLDB Reference Format:

Xiangyu Zhi, Xiao Yan, Bo Tang, Ziyao Yin, Yanchao Zhu, Minqi Zhou.

CoroGraph: Bridging Cache Eciency and Work Eciency for Graph

Algorithm Execution. PVLDB, 17(4): 891 - 903, 2023.

doi:10.14778/3636218.3636240

PVLDB Artifact Availability:

The source code, data, and/or other artifacts have been made available at

https://github.com/DBGroup-SUSTech/corograph.

1 INTRODUCTION

Graphs are ubiquitous in many domains such as social media [

], e-commerce [

], nance [

], and biology [

]. Algorithms

are executed on these graphs to support various applications, e.g.,

*Xiao Yan is the corresponding author.

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 17, No. 4 ISSN 2150-8097.

doi:10.14778/3636218.3636240

scatter

gather

B C D

E FA

Priority

Graph

Frontiers

...

Partition 1 Partition 2

BA A DE E

(a) Vertex-centric Model

(b) Partition-centric Model

A B C D E F

B E B C

A D B E BA D E

Frontier

State

Next Frontiers

update

B C D

E FA

E F

Frontiers

D B EB E B C

E F

High priority Low priority

Figure 1: The partition-centric and vertex-centric execution

models in existing systems (best viewed in color).

community identication [

], recommendation [

], fraud de-

tection [

], and drug discovery [

]. Examples of popular graph al-

gorithms include PageRank (PR), single source shortest path (SSSP),

weakly connected component (WCC), and k-core [

]. Many frame-

works are designed to execute graph algorithms eciently, e.g.,

Ligra [

], Galois [

], Gemini [

], GraphIt [

], and GPOP [

In this paper, we consider in-memory graph algorithm frameworks

that work on a single machine with multiple cores and focus on

reducing algorithm execution time.

1.1 Motivation and Problem

Graph algorithms generally adopt the following pattern: each vertex

𝑣

in the graph is associated with a vertex state

𝑠 [𝑣]

; there are some

active vertices called frontiers; algorithm updates the neighbors

of each frontier, and these neighbors may become new frontiers.

Existing graph algorithm frameworks optimize either the work

eciency or cache eciency of graph algorithm execution.

Work eciency is measured by the number of vertex state update

operations conducted during graph algorithm execution [

To be precise, we mean the vertex states of the neighbors for a frontier. For conciseness,

we use vertex to denote vertex state when the context is clear.

891

Since each update is conducted along an edge (i.e., from frontier to

target vertex), work eciency is also quantied by the number of

processed edges [

]. Existing systems (e.g., Galois) achieve work

eciency with a vertex-centric execution model. The observation is

that frontiers make dierent contributions to algorithm progress,

and thus work-ecient graph algorithms (e.g., SSSP, k-core, and

WCC) determine a priority value for each frontier to prioritize

promising frontiers that accelerate algorithm progress. As shown

in Figure 1(a), the vertex-centric model uses a queue to manage the

frontiers, and the highest-priority frontier (e.g.,

𝐴

) is obtained from

the queue each time. The neighbors of the dequeued frontier (e.g.,

𝐵

and

𝐸

) are updated, and if a neighbor meets the condition specied

by the algorithm after update, it is inserted into the frontier queue,

and its lower-priority instances will be pruned.

Cache eciency is quantied by the ratio of the CPU stall time

caused by cache misses over the algorithm execution time (called

memory bound [

]) [

]. Since the neighbors of a frontier are

usually not consecutive in memory, the neighbor update operation

of graph algorithms essentially conducts random memory access.

Random access causes both read amplication and cache miss

which result in high memory bound. Existing systems (e.g., GPOP)

improve cache eciency with a partition-centric execution model.

In particular, the vertex states are organized into partitions that t in

L2 or L3 cache, and processing is conducted at partition granularity.

As shown in Figure 1(b), the partition-centric model involves a

scatter phase and a gather phase. In the scatter phase, the frontiers

of each partition is copied to the partitions where its neighboring

vertices reside. For instance, the neighbors of frontier

𝐴

span two

partitions, so

𝐴

is copied to both partitions. In the gather phase,

each partition collects the states from the scatter phase, updates

the vertices within the partition, and gets the frontier for the next

iteration. By ensuring that the working partition resides in cache,

the partition-centric model avoids memory stall and achieves high

cache eciency.

Table 1 shows the trade-o between cache eciency and work

eciency in existing graph frameworks. In particular, GPOP has low

memory stall but poor work eciency. This is because its partition-

centric model batches computation by partition to share data access

and cannot follow the priority order to process individual frontiers.

Galois conducts less work but has high memory stall. This is because

its vertex-centric model processes each frontier individually and

cannot share data access among the frontiers as in the partition-

centric model. This fundamental trade-o leads us to the following

question–is it possible to architect a graph algorithm framework that

achieves both cache and work eciency?

1.2 Our Solution: CoroGraph

Our initial design is to reduce memory stall for the vertex-centric

model with a coroutine-based prefetch pipeline. In particular, soft-

ware prefetch allows programs to specify the data to load [

], and

coroutines are lightweight functions that can yield and resume with

low overhead [

]. In our pipeline, processing tasks are conducted

Read amplication means that a oat or integer vertex state is fetched via 64-byte

cache line load while cache miss happens because the automatic hardware prefetch of

CPU only loads consecutive data.

Table 1: Execution statistics of representative graph frame-

works and our CoroGraph. The algorithm is SSSP, and the

graph is Livejournal. GPOP targets cache eciency while

Galois targets work eciency. For both memory bound and

processed edges (i.e., # of edges), smaller is better.

System

SSSP

Memory bound # of edges (M) Time (ms)

Ligra 65.3% 569 1270

Gemini 55.2% 573 954

GraphIt 60.5% 145 782

GPOP 32.9% 604 594

Galois 70.2% 89 513

CoroGraph 28.3% 68 262

by coroutines, which issue prefetch instructions, switch to the com-

putation tasks of other coroutines, and resume for computation

after the required data are fetched to cache. This design overlaps

memory access and computation but the performance is unsatis-

factory for two reasons. First, memory stall dominates execution

time for vertex-centric model (over 70% for Galois in Table 1), and

thus the gain is limited even if memory access and computation are

perfectly overlapped (e.g., max speedup for Galois is 100

≈

43).

Second, the threads have data conicts when processing dierent

frontiers in parallel because the frontiers share common neighbors,

and these data conicts invalidate prefetched data.

To tackle the problems above, CoroGraph adopts a novel hybrid

execution model that combines the benets of partition-centric and

vertex-centric execution. In particular, like vertex-centric execution,

the frontiers are managed by a priority queue and processed in the

order of their contributions to algorithm progress, resulting in

good work eciency. Meanwhile, like partition-centric execution,

updates to vertex states are committed at partition granularity.

To accommodate the two mismatching patterns, the frontiers are

processed in a scatter phase that generates update messages while

the actual updates are conduct in a gather phase. A single thread is

assigned to process all updates to a partition, which results in good

cache eciency by sharing data access among update operations.

The thread-to-partition update pattern also benets prefetch by

eliminating the data conicts that invalidate prefetched data.

Beside the hybrid execution model, we tailor a suite of optimiza-

tions in CoroGraph. First, we use the aforementioned coroutine-

based prefetch pipeline to overlap memory access with computation.

Second, we propose a lightweight but eective strategy to switch

between prefetch and direct data access because prefetch becomes

slower than direct access as a thread gradually warms the cache by

processing a partition. Third, since the threads are blocked when

waiting to access shared data structures, we adopt a exible syn-

chronization strategy, which allows the threads to proceed with

other tasks instead of blocking. Besides, we also adjust the graph

storage format to reduce cache miss and consider the NUMA ar-

chitecture of modern processors when scheduling the tasks and

conguring the prefetch pipeline.

We experiment with 4 popular graph algorithms (i.e., SSSP, k-

core, PR, and WCC) and compare our CoroGraph with 5 sate-of-the-

art graph algorithm frameworks (i.e., Ligra, Gemini, GPOP, Galois,

892

of 13

免费下载

文档被以下合辑收录

VLDB2024 数据库顶会论文（共31篇）

本合辑收录了VLDB2024 数据库顶会论文。

关注

文档被以下合辑收录

评论