(§3) that leverages semantic information contained in the database
and captures real dependent relationships between table columns to
encode the queries into more informative feature vectors. Further-
more, we mathematically dene the uncertainty of the estimator
and introduce a new model that incorporates the uncertainty esti-
mation into Fauce (§5). Fauce also includes a new learning paradigm
that leverages the uncertainties to boost the estimation results and
make Fauce robust to be applied in dynamic databases (§5.3).
A query consists of four components, tables, joins, columns, and
predicate values. We leverage the semantic information contained
in the database (e.g., the relationships between two tables) to fea-
turize the tables and joins of a query (§3.1). Fauce captures the
join relationships among database tables. Those relationships are
represented as join graphs, where vertices are tables and each edge
connects two joinable tables. We make the representations of the
tables and joins of a query contain more informative features by
analyzing this graph.
To capture the real correlations across all the table columns in
a database (§3.2), we introduce dependency graphs to capture de-
pendent relationships across columns, and based on the graphs
we embed the columns into a vector to boost the estimation accu-
racy. Using a data structure to capture dependency requires the
capture of implicit dependency relationships in columns across
tables. We introduce a hierarchical dependency graph. In particular,
we rst build a local columns-dependency graph for each table.
Then we build the global columns-dependency graph for all the
columns in the database based on the local columns-dependency
graphs developed in the rst step. Finally, we use an embedding
technique [
16
] to represent each column into a vector based on the
global columns-dependency graph. Such vectors can convey real
correlations among the columns.
To include the uncertainties of the cardinality estimator into
Fauce, we design a model based on deep ensembles (§5.3) to compre-
hensively quantify the uncertainty. The uncertainty of the cardinal-
ity estimator comes from multiple sources. First, we are uncertain
about whether the learned model parameters can best describe the
distribution of the queries in the query space. This is referred to
as model uncertainty. Second, the query-based estimators train the
model based on the generated training dataset. But the training
dataset can not well reect the features for all the queries. That
is to say, there is always a data shift between the training dataset
and the inference queries. This data shift can be large especially for
dynamic databases. Thus, we are also uncertain about whether the
data used to train model can well represent the features for infer-
ence queries, this is referred as data uncertainty. These two types
of uncertainty consist the uncertainty of the learned estimator.
The two types of uncertainty are dicult to quantify. To address
the above problems, we design a model called deep ensembles with
uncertainty to estimate the cardinality and the corresponding un-
certainty. We use the ensemble technique, because it generally pro-
duces the best results among all neural network-based approaches.
Furthermore, it provides the benet of being able to separately
determine model and data uncertainties.
We conducted an extensive set of experiments over IMDB, a real-
world dataset that exhibits complex correlation and conditional
independence between table columns and have been extensively
used in prior work [
21
,
24
,
56
–
58
]. On the created JOB-base bench-
mark, a schema that contains 6 tables and correlated lters. Fauce
achieves 1
.
16-4
.
5
×
higher accuracy over the state of the art estima-
tor. To check whether Fauce is robust to complicated queries with
large number of lters, we create a more dicult benchmark, JOB-
more-lters. On this benchmark, Fauce achieves 1
.
31-45
.
9
×
higher
accuracy than previous estimators, including IBJS [
30
], MSCN [
24
],
DeepDB [
21
],and NeuroCard [
56
]. Lastly, to test Fauce ’s ability to
handle queries with more complex join relations, we created JOB-
complex-joins which has 15 tables and complex joins. Experimental
results show that Fauce scales well to this benchmark, it has at
least 1
.
28
×
higher accuracy than baselines. The contributions in
the paper are summarized as below:
•
We design and implement Fauce, the rst learned cardinality
estimator that contains the uncertainties for its results. It
is also light weight with fastest inference time and leading
accuracy among the learned methods we studied in the paper.
•
Fauce includes a new query featurization (§3) method that
can encode the queries into more informative feature vectors
by leveraging the join schema of the database and capturing
the real correlations across the table columns.
•
Fauce mathematically denes the uncertainty of the estima-
tor and designs a model called deep ensembles with uncer-
tainty (§5.3) to estimate the cardinality.
•
Fauce includes an uncertainty management module (§5.3).
We also show that how the uncertainty management can be
leveraged to further boost Fauce’s accuracy.
2 PROBLEM DESCRIPTION
In this section, we introduce some notations and describe why the
cardinality estimation can be solved as a regression problem.
2.1 Notations
Consider a database
D
contains
𝑚
tables,
D = {𝑇
𝑖
}
𝑚
𝑖=1
. Each ta-
ble
𝑇
𝑖
has a number of numeric columns, represented as
𝑇
𝑖
=
{𝐶𝑜𝑙
1
𝑖
, ..., 𝐶𝑜𝑙
𝑐
𝑘
𝑖
}, where
𝑐
𝑘
is the total number of columns in Ta-
ble
𝑇
𝑖
. The total number of columns in
D
is denoted as
𝐶
, where
𝐶 =
Í
𝑚
𝑘=1
𝑐
𝑘
. We dene the actual cardinality of a query
𝑞
as the
number of rows in joint tables that satisfy all predicates in
𝑞
, and
denote it as
𝐴𝑐𝑡 (𝑞)
. Similarly, we use
𝐶𝑎𝑟𝑑(𝑞)
to represent the esti-
mated cardinality for the query
𝑞
. Each query
𝑞
can be represented
as a collection of four sets:
⟨𝑇 𝑎𝑏𝑙𝑒𝑠⟩
,
⟨𝐽𝑜𝑖𝑛𝑠⟩
,
⟨𝐶𝑜𝑙𝑢𝑚𝑛𝑠⟩
,
⟨𝑉 𝑎𝑙𝑢𝑒𝑠⟩
,
and each set is dened as below.
• ⟨𝑇𝑎𝑏𝑙𝑒𝑠⟩: the set of the tables in 𝑞’s FROM clause.
• ⟨𝐽𝑜𝑖𝑛𝑠⟩: the set of the join relations in 𝑞’s WHERE clause.
• ⟨𝐶𝑜𝑙𝑢𝑚𝑛𝑠⟩
: the set of the columns involved in
𝑞
’s WHERE clause.
• ⟨𝑉 𝑎𝑙𝑢𝑒𝑠⟩: the set of the predicates values in 𝑞’s WHERE clause.
These four sets together depict the features of a query.
2.2 Formulation as a Regression Problem
As the cardinality of a query is a real-valued number, we develop
a regression model
M
, such that for any range query
𝑞
on joint
tables, the estimated cardinality
𝐶𝑎𝑟𝑑(𝑞)
produced by
M
matches
or closes to the actual cardinality 𝐴𝑐𝑡 (𝑞).
1951
评论