
to the varying computational demands of different graph
regions, to support high-performance graph computing work-
loads. Furthermore, as a DB technique for AI acceleration,
we integrate the hybrid SpMM kernel into the GNN training
pipeline to enhance the training efficiency and demonstrate its
effectiveness in complex graph computing scenarios.
However, crafting a hybrid SpMM kernel using CUDA
cores and Tensor cores based on the characteristics of graph
data and systematically integrating it into the GNN training
pipeline encounters non-trivial challenges.
Challenges of designing hybrid SpMM kernel. (1) Ef-
fectively partitioning the adjacency matrix into sub-regions
with distinct sparsity characteristics to leverage the different
cores for collaborative computation presents a considerable
challenge. (2) The computational characteristics of CUDA and
Tensor cores vary significantly, which is also influenced by the
graph features, making the selection of the optimal core for
matrices with different sparsity levels crucial for enhancing
efficiency. (3) Accurately modeling the computational perfor-
mance of CUDA and Tensor cores for SpMM to facilitate
precise core selection is another major difficulty. (4) Last but
not least, inefficiencies inherent in the SpMM kernel, such as
suboptimal memory access patterns and underutilized threads,
limit the overall performance. To address these issues, we
firstly propose a fine-grained partition strategy, which divides
the adjacency matrix into equal-sized submatrices along the
horizontal axis, allocating each submatrix to the appropriate
GPU cores for efficient computation (§ IV-A). This allows
CUDA and Tensor cores to perform calculations indepen-
dently, eliminating the need to merge results between cores.
Secondly, comprehensive quantitative experiments reveal that
CUDA cores are memory-efficient while Tensor cores are
computing-efficient (§ IV-B). These experiments identify two
pivotal factors for submatrix characterization: sparsity and
the number of non-zero columns, which dominate the most
expensive parts for CUDA and Tensor cores, computation and
memory access, respectively. Thirdly, leveraging these factors,
we develop an algorithm tailored for the selection of appro-
priate GPU cores for submatrices, optimizing computational
capability (§ IV-B). Finally, we conduct in-depth optimiza-
tions of the SpMM kernel, considering thread collaboration
mode (§ IV-D1) and memory access patterns (§ IV-D2).
Challenges of integrating the SpMM kernel into GNN.
When integrating our proposed hybrid SpMM kernel into the
GNN training pipeline, there arise new challenges. (1) The
isolation among GPU kernels within a GNN layer in prevalent
GNN training frameworks [43], [47] impedes data reuse,
leads to additional memory access overhead, and introduces
significant kernel launch overhead. (2) Tensor cores incur
significantly higher throughput than CUDA cores, potentially
offering substantial efficiency gain. However, real-world graph
layouts inherently exhibit irregularity and sparsity, resulting in
a majority of segments partitioned from the adjacency matrix
being sparse and less amenable to processing via Tensor cores.
Consequently, the performance gains achievable with Tensor
cores are often negligible [47]. To tackle the first problem,
we discover opportunities to reuse data in the shared memory
of GPU and present a kernel fusion strategy to mitigate
kernel launch costs and global memory access (§ V-A). To
address the second problem, we first introduce a metric termed
computating intensity to estimate the calculation workload of
the submatrices multiplication (§ V-B), which is calculated
by the quotient of the number of non-zero elements and the
number of non-zero columns. Higher computational intensity
is achieved with more non-zero elements and fewer non-
zero columns. Subsequently, we propose an efficient algorithm
to reconstruct submatrices, adjusting the graph layouts to
obtain more dense segments suitable for processing by Tensor
cores, gaining significant efficiency with Tensor cores (§ V-B).
This adjustment has a relatively small cost compared with
GNN training but renders the graph data more compatible
with hybrid GPU cores, unlocking the full computational
potential of Tensor cores and thereby significantly enhancing
the efficiency of GNN training (§ VI-C3).
Contributions. In this work, we propose HC-SpMM, a novel
approach for accelerating SpMM using hybrid GPU cores. Our
key contributions are summarized as follows:
• We quantify the difference between CUDA and Tensor
cores in SpMM, and propose a hybrid SpMM kernel,
which partitions the graph adjacency matrix and intelli-
gently selects appropriate GPU cores for the computation
of each submatrix based on its characteristics. We further
optimize the SpMM kernel considering thread collabora-
tion mode and memory access patterns (§ IV).
• We propose a kernel fusion strategy for integrating HC-
SpMM into the GNN training pipeline, eliminating kernel
launch time and enabling efficient data reuse, and present
a lightweight graph layout optimization algorithm to
enhance irregular graph layouts and better align with the
characteristics of both GPU cores (§ V).
• We conduct comprehensive evaluations demonstrating
that HC-SpMM outperforms existing methods, achieving
1.33× speedup in SpMM and 1.23× speedup in GNN
training on average (§ VI).
The rest of this paper is organized as follows. Section II
reviews the related work. Section III gives the preliminaries.
Section IV presents the design of the hybrid SpMM kernel.
Section V describes the optimization of integrating the SpMM
kernel into GNNs. Section VI exhibits the experimental results.
We conclude the paper in Section VII.
II. RELATED WORK
SpMM Using CUDA Cores. Optimization of SpMM has
been a subject of extensive study [3], [20], [50], [56], [65].
In recent years, Yang et al. [56] leveraged merge-based load
balancing and row-major coalesced memory access strategies
to accelerate SpMM. Hong et al. [20] designed an adap-
tive tiling strategy to enhance the performance of SpMM.
cuSPARSE [32] offers high-performance SpMM kernels and
has been integrated into numerous GNN training frameworks
such as DGL [43]. Gale et al. point out that cuSPARSE
is efficient only for matrices with sparsity exceeding 98%.
502
Authorized licensed use limited to: ZTE Corporation (One Dept). Downloaded on October 28,2025 at 06:54:20 UTC from IEEE Xplore. Restrictions apply.
评论