CORDS Automatic Discovery of Correlations.pdf

Libria

207

12页

0次

2021-02-22

50墨值下载

CORDS: Automatic Discovery of Correlations

and Soft Functional Dependencies

Ihab F. Ilyas

Volker Markl

Peter Haas

Paul Brown

Ashraf Aboulnaga

Purdue University

IBM Almaden Research Center

250 N. University Street 650 Harry Road, K55/B1

West Lafayette, Indiana, 47906 San Jose, CA, 95139

ilyas@cs.purdue.edu marklv,phaas,pbrown1,aashraf@us.ibm.com

ABSTRACT

The rich dependency structure found in the columns of real-world

relational databases can be exploited to great advantage, but can

also cause query optimizers—which usually assume that columns

are statistically independent—to underestimate the selectivities

of conjunctive predicates by orders of magnitude. We introduce

cords, an eﬃcient and scalable tool for automatic discovery of

correlations and soft functional dependencies between columns.

cords searches for column pairs that might have interesting and

useful dependency relations by systematically enumerating can-

didate pairs and simultaneously pruning unpromising candidates

using a ﬂexible set of heuristics. A robust chi-squared analysis is

applied to a sample of column values in order to identify correla-

tions, and the number of distinct values in the sampled columns

is analyzed to detect soft functional dependencies. cords can be

used as a data mining tool, producing dependency graphs that

are of intrinsic interest. We focus primarily on the use of cords

in query optimization. Speciﬁcally, cords recommends groups

of columns on which to maintain certain simple joint statistics.

These “column-group” statistics are then used by the optimizer

to avoid naive selectivity estimates based on inappropriate inde-

pendence assumptions. This approach, because of its simplicity

and judicious use of sampling, is relatively easy to implement in

existing commercial systems, has very low overhead, and scales

well to the large numbers of columns and large table sizes found in

real-world databases. Experiments with a prototype implementa-

tion show that the use of cords in query optimization can speed

up query execution times by an order of magnitude. cords can

be used in tandem with query feedback systems such as the leo

learning optimizer, leveraging the infrastructure of such systems

to correct bad selectivity estimates and ameliorating the poor

performance of feedback systems during slow learning phases.

1. INTRODUCTION

When a query optimizer in a commercial relational dbms

chooses a horrible query plan, the cause of the disaster is

usually an inappropriate independence assumption that the

optimizer has imposed on two or more columns. Indepen-

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGMOD 2004, June 13–18, 2004, Paris, France.

$5.00.

dence assumptions are used in virtually all query optimizers

because they greatly simplify selectivity estimation: e.g., if

and p

are predicates on respective columns C

and C

then the selectivity of the conjunctive predicate p

∧ p

estimated by simply multiplying together the individual se-

lectivities of p

and p

. This approach, however, ignores the

rich dependency structure that is present in most real-world

data. Indeed, our experience indicates that dependency be-

tween columns is the rule, rather than the exception, in the

real world.

This paper introduces cords (CORrelation Detection via

Sampling), an eﬃcient and scalable tool for automatically

discovering statistical correlations and “soft” functional de-

pendencies (fds) between columns. By “correlations,” we

mean general statistical dependencies, not merely approxi-

mate linear relationships as measured, for example, by the

Pearson correlation coeﬃcient [7, p. 265]. By a soft fd be-

tween columns C

and C

, we mean a generalization of the

classical notion of a “hard” fd in which the value of C

com-

pletely determines the value of C

. In a soft fd (denoted

by C

⇒ C

), the value of C

determines the value of C

not with certainty, but merely with high probability. An

example of a hard fd is given by Country and Continent;

the former completely determines the latter. On the other

hand, for cars, Make is determined by Model via a soft depen-

dency: given that Model = 323, we know that Make = Mazda

with high probability, but there is also a small chance that

Make = BMW.

cords both builds upon and signiﬁcantly modiﬁes the

technology of the b-hunt system as described in [12]. As

with b-hunt, cords searches for column pairs that might

have interesting and useful correlations by systematically

enumerating candidate pairs and simultaneously pruning

unpromising candidates using a ﬂexible set of heuristics.

Also as with b-hunt, cords analyses a sample of rows in

order to ensure scalability to very large tables. The sim-

ilarities end here, however. Whereas b-hunt uses “bump

hunting” techniques to discover soft algebraic relationships

between numerical attributes, cords employs a robust chi-

squared analysis to identify correlations between both nu-

merical and categorical attributes, and an analysis of the

number of distinct values in the sampled columns to detect

soft fds. The sample size required for the chi-squared anal-

ysis is essentially independent of the database size, so that

the algorithm is highly scalable.

cords can serve as a data mining tool, e.g., its output

can be converted to a dependency graph as in Figure 6.

Such graphs are of intrinsic interest. Our primary focus,

however, is on the application of cords in query optimiza-

tion. Our proposed scheme uses the output of cords to

recommend groups of columns on which to maintain cer-

tain simple joint statistics, such as the number of distinct

combinations of values in the columns. An optimizer can

then use such “column-group” (cg) statistics to avoid in-

accurate selectivity estimates caused by naive independence

assumptions. cords can be used in conjunction with a query

feedback system, leveraging the infrastructure of such a sys-

tem to correct bad selectivity estimates and ameliorating the

poor performance of feedback systems during slow learning

phases.

cords currently considers only pairs of columns, and not

triples, quadruples, and so forth. This restriction vastly re-

duces the complexity of the algorithm, and our experiments

(see Section 5.3) indicate that most of the beneﬁt in query

performance is obtained by maintaining the cg statistics

of order 2. The cords approach, because of its simplicity,

restriction to column pairs, and judicious use of sampling,

is relatively easy to implement in existing commercial sys-

tems, has very low overhead, and scales well to the large

numbers of columns and large table sizes found in real-world

databases.

The rest of the paper is organized as follows. Section 2

gives an overview of recent technology for relaxing the in-

dependence assumption in query optimization, and relates

the results in the current paper to this body of work. Sec-

tion 3 describes the cords detection algorithm in detail,

including generation of candidate column pairs, pruning of

unpromising candidates, sampling-based chi-squared analy-

sis of correlation, and discovery of soft fds. In Section 4,

we discuss our simple but eﬀective method for using the

output of cords to improve an optimizer’s selectivity esti-

mates and hence its choice of query plans. Section 5 contains

an experimental evaluation of cords in which we validate

the correlation detection algorithm using a database with

known correlations. We also explore the eﬀect of varying

the sampling rate and of using higher-order cg statistics,

and demonstrate the speedup in query execution times that

can result from the use of cords. In Section 6 we describe an

application of cords to several real-world and benchmark

databases. Section 7 contains our conclusions and recom-

mendations for future work.

2. RELATED WORK

Recent years have witnessed a large body of work aimed at

relaxing the independence assumption in selectivity estima-

tion. As in [12], we can distinguish between “data driven”

and “query driven” approaches to detecting and exploiting

dependencies among database columns.

2.1 Query-Driven Approaches

Some query-driven approaches focus on information con-

tained in a query workload, i.e., a list of relevant queries. For

example, Bruno and Chaudhuri [4] use the query workload

together with optimizer estimates of query execution times

to determine a beneﬁcial set of “sits” to retain. sitsare

statistics (typically multidimensional histograms) on query

expressions that can be used to avoid large selectivity esti-

mation errors due to independence assumptions.

Alternatively, a query feedback system (qfs) uses feedback

from query execution to increase optimizer accuracy. leo,

the DB2 learning optimizer [20], is a typical example of a

qfs. leo compares the actual selectivities of query results

with the optimizer’s estimated selectivities. In this way, leo

can detect errors caused by faulty independence assumptions

and create adjustment factors which can be applied in the

future to improve the optimizer’s selectivity estimates.

In [1, 5], query feedback is used to incrementally build a

multidimensional histogram that can be used to estimate the

selectivity of conjunctive predicates. The algorithms in [1, 5]

do not discover correlation between columns, however: the

set of columns over which to build the histogram must be

speciﬁed apriori.

The sash algorithm [14] decomposes the set of columns

in a table into disjoint clusters. Columns within a clus-

ter are considered correlated and a multidimensional his-

togram is maintained for these columns. Columns in diﬀer-

ent clusters are treated as independent. In other words, sash

approximates the full joint distribution of the columns by

maintaining detailed histograms on certain low-dimensional

marginals in accordance with a high-level statistical inter-

action model—namely, a Markov network model—and then

computing joint frequencies as a product of the marginals.

sash uses query feedback together with a steepest-descent

method to incrementally adjust both the structure of the

high-level model and the frequency counts in the various

histogram buckets.

The advantage of query-driven approaches is that they

are eﬃcient, scale well, and result in immediate performance

gains, since they focus their eﬀorts on columns that appear

in actual queries. The disadvantage is that the system can

produce poor estimates—and hence choose poor plans—if it

has not yet received enough feedback, either during the ini-

tial start-up period or after a sudden change in the workload.

Indeed, during one of these slow learning phases the opti-

mizer is likely to avoid query plans with accurate feedback-

based cost estimates in favor of more expensive plans that

appear to be cheap due to cost estimates based on limited

real data and faulty independence assumptions.

2.2 Data-Driven Approaches

Data-driven methods analyze the base data to discover

correlations between database columns, usually without con-

sidering the query workload. These methods can comple-

ment the query-driven approaches by identifying correla-

tions between columns that have not yet been referenced

in the workload. By proactively gathering information, a

data-driven method can mitigate the poor performance of a

query-driven method during a slow learning phase.

Most data-driven methods use discovered correlations to

construct and maintain a synopsis (lossy compressed repre-

sentation) of the joint distribution of all the attributes. In

some methods, the synopsis takes the form of a graphical

statistical model. Getoor, et al. [10], for example, use prob-

abilistic relational models—which extend Bayesian network

models to the relational setting—for selectivity estimation.

Deshpande, et al. [8] provide a technique that, similarly

to sash, combines a Markov network model with a set of

low-dimensional histograms. In [8], however, the synopsis

is constructed based on a full scan of the base data rather

than from query feedback. Both of the foregoing techniques

search through the space of possible models and evaluate

them according to a scoring function. Because the num-

berofpossiblemodelsisexponentialinthenumberofat-

of 12

50墨值下载

database

关注

评论