Consistently Estimating the Selectivity of Conjuncts of Predicates.pdf

Libria

157

12页

0次

2021-02-22

50墨值下载

Consistently Estimating the Selectivity of

Conjuncts of Predicates

V. Markl

N. Megiddo

M. Kutsch

T.M. Tran

P. Haas

U. Srivastava

IBM Almaden Research Center

IBM Germany

IBM Silicon Valley Lab

Stanford University

{marklv, megiddo, peterh}@almaden.ibm.com, kutschm@de.ibm.com, minhtran@us.ibm.com, usriv@stanford.edu

Abstract

Cost-based query optimizers need to estimate the

selectivity of conjunctive predicates when com-

paring alternative query execution plans. To this

end, advanced optimizers use multivariate statis-

tics (MVS) to improve information about the

joint distribution of attribute values in a table.

The joint distribution for all columns is almost

always too large to store completely, and the re-

sulting use of partial distribution information

raises the possibility that multiple, non-equiva-

lent selectivity estimates may be available for a

given predicate. Current optimizers use ad hoc

methods to ensure that selectivities are estimated

in a consistent manner. These methods ignore

valuable information and tend to bias the opti-

mizer toward query plans for which the least in-

formation is available, often yielding poor re-

sults. In this paper we present a novel method for

consistent selectivity estimation based on the

principle of maximum entropy (ME). Our

method efficiently exploits all available infor-

mation and avoids the bias problem. In the ab-

sence of detailed knowledge, the ME approach

reduces to standard uniformity and independence

assumptions. Our implementation using a proto-

type version of DB2 UDB shows that ME im-

proves the optimizer’s cardinality estimates by

orders of magnitude, resulting in better plan

quality and significantly reduced query execution

times.

1. Introduction

Estimating the selectivity of predicates has always been a

challenging task for a query optimizer in a relational data-

base management system. A classic problem has been the

lack of detailed information about the joint frequency

distribution of attribute values in the table of interest. Per-

haps ironically, the additional information now available

to modern optimizers has in a certain sense made the se-

lectivity-estimation problem even harder.

Specifically, consider the problem of estimating the

selectivity s

1,2,…,n

of a conjunctive predicate of the form

∧ p

∧ … ∧ p

, where each p

is a simple predicate

(also called a Boolean Factor, or BF) of the form “column

op literal”. Here column is a column name, op is a

relational comparison operator such as “=”, “>”, or

“LIKE”, and literal is a literal in the domain of the col-

umn; some examples of simple predicates are ‘make =

“Honda”’ and ‘year > 1984’. By the selectivity of a predi-

cate p, we mean, as usual, the fraction of rows in the table

that satisfy p.

In older optimizers, statistics are

maintained on each individual column, so that the

individual selectivities s

, s

, …, s

of p

, p

, …, p

are

available. Such a query optimizer would then impose an

independence assumption and estimate the desired

selectivity as s

1,2,…,n

= s

* s

* … * s

. Such estimates

ignore correlations between attribute values, and

consequently can be wildly inaccurate, often

underestimating the true selectivity by orders of

magnitude and leading to a poor choice of query

execution plan (QEP).

Ideally, to overcome the problems caused by the inde-

pendence assumption, the optimizer should store the

multidimensional joint frequency distribution for all of the

columns in the database. In practice, the amount of

Note that without loss of generality each p

can also be a

disjunction of simple predicates or any other kind of predicate

(e.g., subquery, IN-list). For this work we only require that the

optimizer has some way to estimate the selectivity s

of p

Permission to copy without fee all or part of this material is

granted provided that the copies are not made or distributed

for direct commercial advantage, the VLDB copyright notice

and the title of the publication and its date appear, and notice

is given that copying is by permission of the Very Large Data

Base Endowment. To copy otherwise, or to republish, re-

quires a fee and/or special permission from the Endowment.

Proceedings of the 31st VLDB Conference,

Trondheim, Norway, 2005.

373

storage required for the full distribution is exponentially

large, making this approach infeasible. Researchers

therefore have proposed storage of selected multivariate

statistics (MVS) that summarize important partial

information about the joint distribution. Proposals have

ranged from multidimensional histograms [PI97] on

selected columns to other, simpler forms of column-group

statistics [IMH+04]. Thus, for predicates p

, p

, …, p

, the

optimizer typically has access to the individual

selectivities s

, s

, …, s

as well as a limited collection of

joint selectivities, such as s

1,2

, s

3,5

, and s

2,3,4

. The

independence assumption is then used to “fill in the gaps”

in the incomplete information, e.g., we can estimate the

unknown selectivity s

1,2,3

by s

1,2

* s

A new and serious problem now arises, however.

There may be multiple, non-equivalent ways of estimating

the selectivity for a given predicate. Figure 1, for exam-

ple, shows possible QEPs for a query consisting of the

conjunctive predicate p

∧ p

. The QEP in Figure

1(a) uses an index-ANDing operation (∧) to apply p

∧ p

and afterwards applies predicate p

by a FETCH operator,

which retrieves rows from a base table according to the

row identifiers returned from the index-ANDing operator.

∧

FETCH p

1,3

* s

∧

FETCH p

1,2

* s

∧

FETCH p

* s

1,2

* s

(b) (a) (c)

≠

∧

FETCH p

1,3

* s

∧

FETCH p

1,2

* s

∧

FETCH p

* s

1,2

* s

(b) (a) (c)

≠

Figure 1: QEPs and Selectivity Estimation

Suppose that the optimizer knows the selectivities s

, s

of the BFs p

, p

. Also suppose that it knows

about a correlation between p

and p

via knowledge of

the selectivity s

1,2

of p

∧ p

. Using independence, the

optimizer might then estimate the selectivity of p

∧ p

∧

as s

1,2,3

= s

1,2

* s

Figure 1(b) shows an alternative QEP that first applies

∧ p

and then applies p

. If the optimizer also knows

the selectivity s

1,3

of p

∧ p

, use of the independence

assumption might yield a selectivity estimate s

1,2,3

= s

1,3

. However, this would result in an inconsistency if, as is

likely, s

1,2,3

≠ s

1,2,3

. There are potentially other choices,

such as s

* s

or, if s

2,3

is known, s

1,2

2,3

; the latter

estimate amounts to a conditional independence assump-

tion. Any choice of estimate will be arbitrary, since there

is no supporting knowledge to justify ignoring a

correlation or assuming conditional independence; such a

choice will then arbitrarily bias the optimizer toward

choosing one plan over the other. Even worse, if the

optimizer does not use the same choice of estimate every

time that it is required, then different plans will be costed

inconsistently, leading to “apples and oranges”

comparisons and unreliable plan choices

Assuming that the QEP in Figure 1(a) is the first to be

evaluated, a modern optimizer would avoid the foregoing

consistency problem by recording the fact that s

1,2

was

applied and then avoiding future application of any other

MVS that contain either p

or p

, but not both. In our

example, the selectivities for the QEP in Figure 1(c)

would be used and the ones in Figure 1(b) would not. The

optimizer would therefore compute the selectivity of p

∧

to be s

* s

using independence, instead of using the

MVS s

1,3

. Thus the selectivity s

1,2,3

would be estimated in

a manner consistent with Figure 1(a). Note that, when

evaluating the QEP in Figure 1(a), the optimizer used the

estimate s

1,2,3

= s

1,2

* s

rather than s

* s

, since, in-

tuitively, the former estimate better exploits the available

correlation information. In general, there may be many

possible choices; the complicated (ad hoc) decision

algorithm used by DB2 UDB is described in more detail

in the Appendix.

Although the ad hoc method described above ensures

consistency, it ignores valuable knowledge, e.g., of the

correlation between p

and p

Moreover, this method

complicates the logic of the optimizer, because

cumbersome bookkeeping is required to keep track of

how an estimate was derived initially and to ensure that it

will always be computed in the same way when costing

other plans. Even worse, ignoring the known correlation

between p

and p

also introduces bias towards certain

QEPs: if, as is often the case with correlation, s

1,3

>> s

, and s

1,2

>> s

* s

, and if s

1,2

and s

1,3

have comparable

values, then the optimizer will be biased towards the plan

in Figure 1(c), even though the plan in Figure 1(a) might

be cheaper, i.e., the optimizer thinks that the plan in

Figure 1(c) will produce fewer rows during index-

ANDing, but this might not actually be the case. In

general, an optimizer will often be drawn towards those

QEPs about which it knows the least, because use of the

independence assumption makes these plans seem

cheaper due to underestimation. We call this problem

“fleeing from knowledge to ignorance”.

In this paper, we provide a novel method for estimat-

ing the selectivity of a conjunctive predicate; the method

exploits and combines all of the available MVS in a prin-

cipled, consistent, and unbiased manner. Our technique

rests on the principle of maximum entropy (ME) [GS85],

which is a mathematical embodiment of Occam’s Razor

and provides the “simplest” possible selectivity estimate

that is consistent with all of the available information. (In

the absence of detailed knowledge, the ME approach re-

duces to standard uniformity and independence assump-

tions.) Our new approach avoids the problems of incon-

sistent QEP comparisons and the flight from knowledge

to ignorance.

We emphasize that, unlike DB2’s ad hoc method or

the method proposed in [BC02] (which tries to choose the

“best” of the available MVS for estimating a selectivity)

the ME method is the first to exploit all of the available

MVS and actually refine the optimizer’s cardinality model

374

of 12

50墨值下载

database

关注

评论