Fauce Fast and Accurate Deep Ensembles with Uncertainty for Cardinality Estimation _VLDB 2021.pdf

onboard

200

14页

2次

2021-11-25

免费下载

Fauce: Fast and Accurate Deep Ensembles with Uncertainty for

Cardinality Estimation

Jie Liu

†

, Wenqian Dong

†

, Qingqing Zhou

∗

, Dong Li

†

University of California, Merced

∗

Tencent

{jliu279,wdong5,dli35}@ucmerced.edu,hewanzhou@tencent.com

ABSTRACT

Cardinality estimation is a fundamental and critical problem in

databases. Recently, many estimators based on deep learning have

been proposed to solve this problem and they have achieved promis-

ing results. However, these estimators struggle to provide accurate

results for complex queries, due to not capturing real inter-column

and inter-table correlations. Furthermore, none of these estimators

contain the uncertainty information about their estimations. In

this paper, we present a join cardinality estimator called Fauce.

Fauce learns the correlations across all columns and all tables in

the database. It also contains the uncertainty information of each

estimation. Among all studied learned estimators, our results are

promising: (1) Fauce is a light-weight estimator, it has 10

faster

inference speed than the state of the art estimator; (2) Fauce is

robust to the complex queries, it provides 1

-6

smaller esti-

mation errors for complex queries compared with the state of the

art estimator; (3) To the best of our knowledge, Fauce is the rst

estimator that incorporates uncertainty information for cardinality

estimation into a deep learning model.

PVLDB Reference Format:

Jie Liu

†

, Wenqian Dong

†

, Qingqing Zhou

∗

, Dong Li

†

. Fauce: Fast and

Accurate Deep Ensembles with Uncertainty for Cardinality Estimation.

PVLDB, 14(11): 1950-1963, 2021.

doi:10.14778/3476249.3476254

1 INTRODUCTION

Cardinality estimation is fundamental and critical in databases. It is

widely applied to query optimization, query processing approxima-

tion, database tuning, etc. For example, the query optimizer uses

the results of the cardinality estimation to determine the best execu-

tion plans. However, the cardinality estimation can be challenging.

In some cases with complex queries where there are correlated

columns or large number of joins, the accuracy of the cardinality

estimation drops dramatically.

Recently, the researchers have been actively using the machine

learning technique to estimate the cardinality [

–

]. These approaches can be mainly classied as two types: data-

driven and query-driven estimators. Both of them have limitation.

The data-driven estimators such as Naru [

], NeuroCard [

and MADE [

] leverage the deep autoregressive (AR) models [

]

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 14, No. 11 ISSN 2150-8097.

doi:10.14778/3476249.3476254

to approximate the data distribution of a table or joint tables. Deep

AR models capture the data distribution of a table by multiplying

the estimated data distribution of each column, based on an implicit

assumption that each column is dependent on all the previous

columns. However, such an assumption is oversimplied. In DBMS,

the dependent relationship between columns can be complex. For

example, some columns are independent from each other, while

others have correlation. As a result, the deep AR models result in

large errors for those queries on correlated columns. Furthermore,

recent study [

] reveals that the data-driven estimators tend to

output large errors when the data are skewed.

The query-driven estimators [

] rely on some

regression models to properly learn function mapping between

queries and cardinalities. Since the input of the regression models

are real-valued vectors, the query-driven estimator must use a query

featurization method to convert the queries into feature vectors.

Those vectors should contain informative features of the queries. A

good query featurization method is critical, because it can generate

highly informative feature vectors, which are useful to improve the

accuracy of the regression models. The existing query featurization

methods [

] apply techniques like one-hot encoding, bi-

nary encoding [

], basic statistics, or bitmap [

] to convert queries

into feature vectors. While those methods are simple to use, they

cannot capture the ne-grained correlations between columns and

between tables. As a result, the feature vectors generated by the

existing query featurization methods are not informative enough,

and using such feature vectors for cardinality estimation can be

erroneous. Furthermore, the existing query featurization methods

focus on static data [

]. However, in a dynamic scenario where

the data are dynamically updated, the existing methods cannot

adapt to the new data, hence degrading the accuracy of cardinality

estimation signicantly.

Furthermore, both query-driven and data-driven estimators do

not give any quantication of uncertainty or condence level of the

estimation. The estimation is used in database based on an implicit

assumption that the estimation is always safe to be used. However,

this assumption is not always valid, and using error estimation

can be problematic. For example, an erroneous estimation, when

used by the query optimizer, can lead to bad execution plans. A

better estimation approach is to output the estimated cardinality

together with the corresponding uncertainty. Based on the uncer-

tainty, DBMS can determine when to actually trust the estimator

and use its estimations. However, how to quantify an estimator’s

uncertainties for various queries and leverage the uncertainties to

boost model accuracy remains to be studied.

To address the above limitation, we propose a new cardinality

estimator, Fauce. Fauce includes a new query featurization method

1950

(§3) that leverages semantic information contained in the database

and captures real dependent relationships between table columns to

encode the queries into more informative feature vectors. Further-

more, we mathematically dene the uncertainty of the estimator

and introduce a new model that incorporates the uncertainty esti-

mation into Fauce (§5). Fauce also includes a new learning paradigm

that leverages the uncertainties to boost the estimation results and

make Fauce robust to be applied in dynamic databases (§5.3).

A query consists of four components, tables, joins, columns, and

predicate values. We leverage the semantic information contained

in the database (e.g., the relationships between two tables) to fea-

turize the tables and joins of a query (§3.1). Fauce captures the

join relationships among database tables. Those relationships are

represented as join graphs, where vertices are tables and each edge

connects two joinable tables. We make the representations of the

tables and joins of a query contain more informative features by

analyzing this graph.

To capture the real correlations across all the table columns in

a database (§3.2), we introduce dependency graphs to capture de-

pendent relationships across columns, and based on the graphs

we embed the columns into a vector to boost the estimation accu-

racy. Using a data structure to capture dependency requires the

capture of implicit dependency relationships in columns across

tables. We introduce a hierarchical dependency graph. In particular,

we rst build a local columns-dependency graph for each table.

Then we build the global columns-dependency graph for all the

columns in the database based on the local columns-dependency

graphs developed in the rst step. Finally, we use an embedding

technique [

] to represent each column into a vector based on the

global columns-dependency graph. Such vectors can convey real

correlations among the columns.

To include the uncertainties of the cardinality estimator into

Fauce, we design a model based on deep ensembles (§5.3) to compre-

hensively quantify the uncertainty. The uncertainty of the cardinal-

ity estimator comes from multiple sources. First, we are uncertain

about whether the learned model parameters can best describe the

distribution of the queries in the query space. This is referred to

as model uncertainty. Second, the query-based estimators train the

model based on the generated training dataset. But the training

dataset can not well reect the features for all the queries. That

is to say, there is always a data shift between the training dataset

and the inference queries. This data shift can be large especially for

dynamic databases. Thus, we are also uncertain about whether the

data used to train model can well represent the features for infer-

ence queries, this is referred as data uncertainty. These two types

of uncertainty consist the uncertainty of the learned estimator.

The two types of uncertainty are dicult to quantify. To address

the above problems, we design a model called deep ensembles with

uncertainty to estimate the cardinality and the corresponding un-

certainty. We use the ensemble technique, because it generally pro-

duces the best results among all neural network-based approaches.

Furthermore, it provides the benet of being able to separately

determine model and data uncertainties.

We conducted an extensive set of experiments over IMDB, a real-

world dataset that exhibits complex correlation and conditional

independence between table columns and have been extensively

used in prior work [

–

]. On the created JOB-base bench-

mark, a schema that contains 6 tables and correlated lters. Fauce

achieves 1

16-4

higher accuracy over the state of the art estima-

tor. To check whether Fauce is robust to complicated queries with

large number of lters, we create a more dicult benchmark, JOB-

more-lters. On this benchmark, Fauce achieves 1

31-45

higher

accuracy than previous estimators, including IBJS [

], MSCN [

DeepDB [

],and NeuroCard [

]. Lastly, to test Fauce ’s ability to

handle queries with more complex join relations, we created JOB-

complex-joins which has 15 tables and complex joins. Experimental

results show that Fauce scales well to this benchmark, it has at

least 1

higher accuracy than baselines. The contributions in

the paper are summarized as below:

•

We design and implement Fauce, the rst learned cardinality

estimator that contains the uncertainties for its results. It

is also light weight with fastest inference time and leading

accuracy among the learned methods we studied in the paper.

•

Fauce includes a new query featurization (§3) method that

can encode the queries into more informative feature vectors

by leveraging the join schema of the database and capturing

the real correlations across the table columns.

•

Fauce mathematically denes the uncertainty of the estima-

tor and designs a model called deep ensembles with uncer-

tainty (§5.3) to estimate the cardinality.

•

Fauce includes an uncertainty management module (§5.3).

We also show that how the uncertainty management can be

leveraged to further boost Fauce’s accuracy.

2 PROBLEM DESCRIPTION

In this section, we introduce some notations and describe why the

cardinality estimation can be solved as a regression problem.

2.1 Notations

Consider a database

contains

𝑚

tables,

D = {𝑇

𝑖

}

𝑚

𝑖=1

. Each ta-

ble

𝑇

𝑖

has a number of numeric columns, represented as

𝑇

𝑖

{𝐶𝑜𝑙

𝑖

, ..., 𝐶𝑜𝑙

𝑐

𝑘

𝑖

}, where

𝑐

𝑘

is the total number of columns in Ta-

ble

𝑇

𝑖

. The total number of columns in

is denoted as

𝐶

, where

𝐶 =

𝑚

𝑘=1

𝑐

𝑘

. We dene the actual cardinality of a query

𝑞

as the

number of rows in joint tables that satisfy all predicates in

𝑞

, and

denote it as

𝐴𝑐𝑡 (𝑞)

. Similarly, we use

𝐶𝑎𝑟𝑑(𝑞)

to represent the esti-

mated cardinality for the query

𝑞

. Each query

𝑞

can be represented

as a collection of four sets:

⟨𝑇 𝑎𝑏𝑙𝑒𝑠⟩

⟨𝐽𝑜𝑖𝑛𝑠⟩

⟨𝐶𝑜𝑙𝑢𝑚𝑛𝑠⟩

⟨𝑉 𝑎𝑙𝑢𝑒𝑠⟩

and each set is dened as below.

• ⟨𝑇𝑎𝑏𝑙𝑒𝑠⟩: the set of the tables in 𝑞’s FROM clause.

• ⟨𝐽𝑜𝑖𝑛𝑠⟩: the set of the join relations in 𝑞’s WHERE clause.

• ⟨𝐶𝑜𝑙𝑢𝑚𝑛𝑠⟩

: the set of the columns involved in

𝑞

’s WHERE clause.

• ⟨𝑉 𝑎𝑙𝑢𝑒𝑠⟩: the set of the predicates values in 𝑞’s WHERE clause.

These four sets together depict the features of a query.

2.2 Formulation as a Regression Problem

As the cardinality of a query is a real-valued number, we develop

a regression model

, such that for any range query

𝑞

on joint

tables, the estimated cardinality

𝐶𝑎𝑟𝑑(𝑞)

produced by

matches

or closes to the actual cardinality 𝐴𝑐𝑡 (𝑞).

1951

of 14

免费下载

tdsql paper

关注

评论