暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
Fauce Fast and Accurate Deep Ensembles with Uncertainty for Cardinality Estimation _VLDB 2021.pdf
200
14页
2次
2021-11-25
免费下载
Fauce: Fast and Accurate Deep Ensembles with Uncertainty for
Cardinality Estimation
Jie Liu
, Wenqian Dong
, Qingqing Zhou
, Dong Li
University of California, Merced
Tencent
{jliu279,wdong5,dli35}@ucmerced.edu,hewanzhou@tencent.com
ABSTRACT
Cardinality estimation is a fundamental and critical problem in
databases. Recently, many estimators based on deep learning have
been proposed to solve this problem and they have achieved promis-
ing results. However, these estimators struggle to provide accurate
results for complex queries, due to not capturing real inter-column
and inter-table correlations. Furthermore, none of these estimators
contain the uncertainty information about their estimations. In
this paper, we present a join cardinality estimator called Fauce.
Fauce learns the correlations across all columns and all tables in
the database. It also contains the uncertainty information of each
estimation. Among all studied learned estimators, our results are
promising: (1) Fauce is a light-weight estimator, it has 10
×
faster
inference speed than the state of the art estimator; (2) Fauce is
robust to the complex queries, it provides 1
.
3
×
-6
.
7
×
smaller esti-
mation errors for complex queries compared with the state of the
art estimator; (3) To the best of our knowledge, Fauce is the rst
estimator that incorporates uncertainty information for cardinality
estimation into a deep learning model.
PVLDB Reference Format:
Jie Liu
, Wenqian Dong
, Qingqing Zhou
, Dong Li
. Fauce: Fast and
Accurate Deep Ensembles with Uncertainty for Cardinality Estimation.
PVLDB, 14(11): 1950-1963, 2021.
doi:10.14778/3476249.3476254
1 INTRODUCTION
Cardinality estimation is fundamental and critical in databases. It is
widely applied to query optimization, query processing approxima-
tion, database tuning, etc. For example, the query optimizer uses
the results of the cardinality estimation to determine the best execu-
tion plans. However, the cardinality estimation can be challenging.
In some cases with complex queries where there are correlated
columns or large number of joins, the accuracy of the cardinality
estimation drops dramatically.
Recently, the researchers have been actively using the machine
learning technique to estimate the cardinality [
11
,
17
19
,
21
,
24
,
56
58
]. These approaches can be mainly classied as two types: data-
driven and query-driven estimators. Both of them have limitation.
The data-driven estimators such as Naru [
57
], NeuroCard [
56
],
and MADE [
18
] leverage the deep autoregressive (AR) models [
9
,
14
]
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 11 ISSN 2150-8097.
doi:10.14778/3476249.3476254
to approximate the data distribution of a table or joint tables. Deep
AR models capture the data distribution of a table by multiplying
the estimated data distribution of each column, based on an implicit
assumption that each column is dependent on all the previous
columns. However, such an assumption is oversimplied. In DBMS,
the dependent relationship between columns can be complex. For
example, some columns are independent from each other, while
others have correlation. As a result, the deep AR models result in
large errors for those queries on correlated columns. Furthermore,
recent study [
52
] reveals that the data-driven estimators tend to
output large errors when the data are skewed.
The query-driven estimators [
10
,
11
,
17
,
24
,
25
,
39
] rely on some
regression models to properly learn function mapping between
queries and cardinalities. Since the input of the regression models
are real-valued vectors, the query-driven estimator must use a query
featurization method to convert the queries into feature vectors.
Those vectors should contain informative features of the queries. A
good query featurization method is critical, because it can generate
highly informative feature vectors, which are useful to improve the
accuracy of the regression models. The existing query featurization
methods [
11
,
18
,
24
,
46
] apply techniques like one-hot encoding, bi-
nary encoding [
43
], basic statistics, or bitmap [
5
] to convert queries
into feature vectors. While those methods are simple to use, they
cannot capture the ne-grained correlations between columns and
between tables. As a result, the feature vectors generated by the
existing query featurization methods are not informative enough,
and using such feature vectors for cardinality estimation can be
erroneous. Furthermore, the existing query featurization methods
focus on static data [
52
]. However, in a dynamic scenario where
the data are dynamically updated, the existing methods cannot
adapt to the new data, hence degrading the accuracy of cardinality
estimation signicantly.
Furthermore, both query-driven and data-driven estimators do
not give any quantication of uncertainty or condence level of the
estimation. The estimation is used in database based on an implicit
assumption that the estimation is always safe to be used. However,
this assumption is not always valid, and using error estimation
can be problematic. For example, an erroneous estimation, when
used by the query optimizer, can lead to bad execution plans. A
better estimation approach is to output the estimated cardinality
together with the corresponding uncertainty. Based on the uncer-
tainty, DBMS can determine when to actually trust the estimator
and use its estimations. However, how to quantify an estimator’s
uncertainties for various queries and leverage the uncertainties to
boost model accuracy remains to be studied.
To address the above limitation, we propose a new cardinality
estimator, Fauce. Fauce includes a new query featurization method
1950
(§3) that leverages semantic information contained in the database
and captures real dependent relationships between table columns to
encode the queries into more informative feature vectors. Further-
more, we mathematically dene the uncertainty of the estimator
and introduce a new model that incorporates the uncertainty esti-
mation into Fauce (§5). Fauce also includes a new learning paradigm
that leverages the uncertainties to boost the estimation results and
make Fauce robust to be applied in dynamic databases (§5.3).
A query consists of four components, tables, joins, columns, and
predicate values. We leverage the semantic information contained
in the database (e.g., the relationships between two tables) to fea-
turize the tables and joins of a query (§3.1). Fauce captures the
join relationships among database tables. Those relationships are
represented as join graphs, where vertices are tables and each edge
connects two joinable tables. We make the representations of the
tables and joins of a query contain more informative features by
analyzing this graph.
To capture the real correlations across all the table columns in
a database (§3.2), we introduce dependency graphs to capture de-
pendent relationships across columns, and based on the graphs
we embed the columns into a vector to boost the estimation accu-
racy. Using a data structure to capture dependency requires the
capture of implicit dependency relationships in columns across
tables. We introduce a hierarchical dependency graph. In particular,
we rst build a local columns-dependency graph for each table.
Then we build the global columns-dependency graph for all the
columns in the database based on the local columns-dependency
graphs developed in the rst step. Finally, we use an embedding
technique [
16
] to represent each column into a vector based on the
global columns-dependency graph. Such vectors can convey real
correlations among the columns.
To include the uncertainties of the cardinality estimator into
Fauce, we design a model based on deep ensembles (§5.3) to compre-
hensively quantify the uncertainty. The uncertainty of the cardinal-
ity estimator comes from multiple sources. First, we are uncertain
about whether the learned model parameters can best describe the
distribution of the queries in the query space. This is referred to
as model uncertainty. Second, the query-based estimators train the
model based on the generated training dataset. But the training
dataset can not well reect the features for all the queries. That
is to say, there is always a data shift between the training dataset
and the inference queries. This data shift can be large especially for
dynamic databases. Thus, we are also uncertain about whether the
data used to train model can well represent the features for infer-
ence queries, this is referred as data uncertainty. These two types
of uncertainty consist the uncertainty of the learned estimator.
The two types of uncertainty are dicult to quantify. To address
the above problems, we design a model called deep ensembles with
uncertainty to estimate the cardinality and the corresponding un-
certainty. We use the ensemble technique, because it generally pro-
duces the best results among all neural network-based approaches.
Furthermore, it provides the benet of being able to separately
determine model and data uncertainties.
We conducted an extensive set of experiments over IMDB, a real-
world dataset that exhibits complex correlation and conditional
independence between table columns and have been extensively
used in prior work [
21
,
24
,
56
58
]. On the created JOB-base bench-
mark, a schema that contains 6 tables and correlated lters. Fauce
achieves 1
.
16-4
.
5
×
higher accuracy over the state of the art estima-
tor. To check whether Fauce is robust to complicated queries with
large number of lters, we create a more dicult benchmark, JOB-
more-lters. On this benchmark, Fauce achieves 1
.
31-45
.
9
×
higher
accuracy than previous estimators, including IBJS [
30
], MSCN [
24
],
DeepDB [
21
],and NeuroCard [
56
]. Lastly, to test Fauce ’s ability to
handle queries with more complex join relations, we created JOB-
complex-joins which has 15 tables and complex joins. Experimental
results show that Fauce scales well to this benchmark, it has at
least 1
.
28
×
higher accuracy than baselines. The contributions in
the paper are summarized as below:
We design and implement Fauce, the rst learned cardinality
estimator that contains the uncertainties for its results. It
is also light weight with fastest inference time and leading
accuracy among the learned methods we studied in the paper.
Fauce includes a new query featurization (§3) method that
can encode the queries into more informative feature vectors
by leveraging the join schema of the database and capturing
the real correlations across the table columns.
Fauce mathematically denes the uncertainty of the estima-
tor and designs a model called deep ensembles with uncer-
tainty (§5.3) to estimate the cardinality.
Fauce includes an uncertainty management module (§5.3).
We also show that how the uncertainty management can be
leveraged to further boost Fauce’s accuracy.
2 PROBLEM DESCRIPTION
In this section, we introduce some notations and describe why the
cardinality estimation can be solved as a regression problem.
2.1 Notations
Consider a database
D
contains
𝑚
tables,
D = {𝑇
𝑖
}
𝑚
𝑖=1
. Each ta-
ble
𝑇
𝑖
has a number of numeric columns, represented as
𝑇
𝑖
=
{𝐶𝑜𝑙
1
𝑖
, ..., 𝐶𝑜𝑙
𝑐
𝑘
𝑖
}, where
𝑐
𝑘
is the total number of columns in Ta-
ble
𝑇
𝑖
. The total number of columns in
D
is denoted as
𝐶
, where
𝐶 =
Í
𝑚
𝑘=1
𝑐
𝑘
. We dene the actual cardinality of a query
𝑞
as the
number of rows in joint tables that satisfy all predicates in
𝑞
, and
denote it as
𝐴𝑐𝑡 (𝑞)
. Similarly, we use
𝐶𝑎𝑟𝑑(𝑞)
to represent the esti-
mated cardinality for the query
𝑞
. Each query
𝑞
can be represented
as a collection of four sets:
𝑇 𝑎𝑏𝑙𝑒𝑠
,
𝐽𝑜𝑖𝑛𝑠
,
𝐶𝑜𝑙𝑢𝑚𝑛𝑠
,
𝑉 𝑎𝑙𝑢𝑒𝑠
,
and each set is dened as below.
𝑇𝑎𝑏𝑙𝑒𝑠: the set of the tables in 𝑞’s FROM clause.
𝐽𝑜𝑖𝑛𝑠: the set of the join relations in 𝑞’s WHERE clause.
𝐶𝑜𝑙𝑢𝑚𝑛𝑠
: the set of the columns involved in
𝑞
’s WHERE clause.
𝑉 𝑎𝑙𝑢𝑒𝑠: the set of the predicates values in 𝑞’s WHERE clause.
These four sets together depict the features of a query.
2.2 Formulation as a Regression Problem
As the cardinality of a query is a real-valued number, we develop
a regression model
M
, such that for any range query
𝑞
on joint
tables, the estimated cardinality
𝐶𝑎𝑟𝑑(𝑞)
produced by
M
matches
or closes to the actual cardinality 𝐴𝑐𝑡 (𝑞).
1951
of 14
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜