
Lightweight and Accurate Cardinality Estimation
by Neural Network Gaussian Process
Kangfei Zhao, Jerey Xu Yu, Zongyan He, Rui Li, Hao Zhang
The Chinese University of Hong Kong
Hong Kong S. A. R., China
{kfzhao,yu,zyhe,lirui,hzhang}@se.cuhk.edu.hk
ABSTRACT
Deep Learning (DL) has achieved great success in many real appli-
cations. Despite its success, there are some main problems when
deploying advanced DL models in database systems, such as hyper-
parameters tuning, the risk of overtting, and lack of prediction
uncertainty. In this paper, we study a lightweight and accurate
cardinality estimation for SQL queries, which is also uncertainty-
aware. By lightweight, we mean that we can train a DL model in
a few seconds. With uncertainty ensured, it becomes possible to
update the estimator to improve its prediction in areas with high
uncertainty. The approach we explore is dierent from the direction
of deploying sophisticated DL models as cardinality estimators in
database systems. We employ Bayesian deep learning (BDL), which
serves as a bridge between Bayesian inference and deep learning.
The prediction distribution by BDL provides principled uncertainty
calibration for the prediction. In addition, when the network width
of a BDL model goes to innity, the model performs equivalent to
Gaussian Process (GP). This special class of BDL, known as Neu-
ral Network Gaussian Process (NNGP), inherits the advantages
of Bayesian approach while keeping universal approximation of
neural networks, and can utilize a much larger model space to
model distribution-free data as a nonparametric model. We show
our NNGP estimator achieves high accuracy, is built fast, and is ro-
bust to query workload shift, in our extensive performance studies
by comparing with existing learned estimators. We also conrm
the eectiveness of NNGP by integrating it into PostgreSQL.
CCS CONCEPTS
• Information systems → Query optimization.
KEYWORDS
Cardinality Estimation; Machine Learning; Gaussian Process
ACM Reference Format:
Kangfei Zhao, Jerey Xu Yu, Zongyan He, Rui Li, Hao Zhang. 2022. Light-
weight and Accurate Cardinality Estimation by Neural Network Gaussian
Process. In Proceedings of the 2022 International Conference on Management
of Data (SIGMOD ’22), June 12–17, 2022, Philadelphia, PA, USA. ACM, New
York, NY, USA, 15 pages. https://doi.org/10.1145/3514221.3526156
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9249-5/22/06.. .$15.00
https://doi.org/10.1145/3514221.3526156
1 INTRODUCTION
The learning approaches are shifting from the traditional ML (Ma-
chine Learning) models (e.g., KDE, GBDT) to DL (Deep Learning)
models (e.g., NN, MSCN, VAE, MADE, Transformer, SPN, LSTM).
The shifting is motivated by the powerful approximation capability
of neural networks in end-to-end applications and the high e-
ciency of DL frameworks. Compared with traditional ML, DL has
achieved a great improvement in estimation accuracy, and more
and more advanced DL architectures are devised to improve the per-
formance of the learning tasks with deeper layers to pursue more
powerful modeling capability. In database systems, the DL models
have been extensively studied for query optimization [
49
], index
recommendation [
11
], view materialization [
44
], and cardinality
estimation [
15
,
16
,
27
,
32
,
33
,
60
,
71
,
72
]. Despite the success of DL
approaches, complex DL approaches incur some main problems in
general.
The rst is hyper-parameters on which the performance of DL
models highly rely, including the training hyper-parameters and
network architecture congurations. Note that onerous eort of
parameter tuning must be paid to pursuing satisfying performance,
which is labor-intensive. Although autoML tools [
31
] have been
developed to avoid human-in-the-loop, deploying such tools in
database systems needs great eort as searching for a suitable
hyper-parameter combination needs a huge number of trials with
high computation cost. It is worth mentioning that a well-tuned
model for one database is dicult to transfer to other databases,
which means that the retraining/tuning needs repeating when the
underlying data changes signicantly.
The second is a high risk of overtting that DL models are ex-
posed to. Subject to a particular family of functions designed, gen-
eral DL models are indexed by a large number of parameters, fully
tted by the training data. It is assumed testing data is from the
same underlying distribution as the training data, otherwise, the
prediction may have a large variance. Many regularization tech-
niques can be used, but they can only alleviate the problem to
some degree. Collecting more data to enhance training helps to
reduce the variance. But this means a higher cost for acquiring the
data/ground truth and training.
The third is prediction belief. The issue is that DL models cannot
capture and convey their prediction belief, namely, how probably
their prediction is accurate or how much is the prediction uncer-
tainty. The uncertainty comes from dierent sources. For classi-
cation tasks, DL models predict the probability distribution that
one data point is associated with the candidate classes by a
somax
function, which is shown to be over-condent on the most likely
class [
25
]. For regression tasks (e.g., cardinality estimation), DL
models can only output a scalar value without any uncertainty
Session 13: ML for Data Management and Query Processing
SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
评论