Such graphs are of intrinsic interest. Our primary focus,
however, is on the application of cords in query optimiza-
tion. Our proposed scheme uses the output of cords to
recommend groups of columns on which to maintain cer-
tain simple joint statistics, such as the number of distinct
combinations of values in the columns. An optimizer can
then use such “column-group” (cg) statistics to avoid in-
accurate selectivity estimates caused by naive independence
assumptions. cords can be used in conjunction with a query
feedback system, leveraging the infrastructure of such a sys-
tem to correct bad selectivity estimates and ameliorating the
poor performance of feedback systems during slow learning
phases.
cords currently considers only pairs of columns, and not
triples, quadruples, and so forth. This restriction vastly re-
duces the complexity of the algorithm, and our experiments
(see Section 5.3) indicate that most of the benefit in query
performance is obtained by maintaining the cg statistics
of order 2. The cords approach, because of its simplicity,
restriction to column pairs, and judicious use of sampling,
is relatively easy to implement in existing commercial sys-
tems, has very low overhead, and scales well to the large
numbers of columns and large table sizes found in real-world
databases.
The rest of the paper is organized as follows. Section 2
gives an overview of recent technology for relaxing the in-
dependence assumption in query optimization, and relates
the results in the current paper to this body of work. Sec-
tion 3 describes the cords detection algorithm in detail,
including generation of candidate column pairs, pruning of
unpromising candidates, sampling-based chi-squared analy-
sis of correlation, and discovery of soft fds. In Section 4,
we discuss our simple but effective method for using the
output of cords to improve an optimizer’s selectivity esti-
mates and hence its choice of query plans. Section 5 contains
an experimental evaluation of cords in which we validate
the correlation detection algorithm using a database with
known correlations. We also explore the effect of varying
the sampling rate and of using higher-order cg statistics,
and demonstrate the speedup in query execution times that
can result from the use of cords. In Section 6 we describe an
application of cords to several real-world and benchmark
databases. Section 7 contains our conclusions and recom-
mendations for future work.
2. RELATED WORK
Recent years have witnessed a large body of work aimed at
relaxing the independence assumption in selectivity estima-
tion. As in [12], we can distinguish between “data driven”
and “query driven” approaches to detecting and exploiting
dependencies among database columns.
2.1 Query-Driven Approaches
Some query-driven approaches focus on information con-
tained in a query workload, i.e., a list of relevant queries. For
example, Bruno and Chaudhuri [4] use the query workload
together with optimizer estimates of query execution times
to determine a beneficial set of “sits” to retain. sitsare
statistics (typically multidimensional histograms) on query
expressions that can be used to avoid large selectivity esti-
mation errors due to independence assumptions.
Alternatively, a query feedback system (qfs) uses feedback
from query execution to increase optimizer accuracy. leo,
the DB2 learning optimizer [20], is a typical example of a
qfs. leo compares the actual selectivities of query results
with the optimizer’s estimated selectivities. In this way, leo
can detect errors caused by faulty independence assumptions
and create adjustment factors which can be applied in the
future to improve the optimizer’s selectivity estimates.
In [1, 5], query feedback is used to incrementally build a
multidimensional histogram that can be used to estimate the
selectivity of conjunctive predicates. The algorithms in [1, 5]
do not discover correlation between columns, however: the
set of columns over which to build the histogram must be
specified apriori.
The sash algorithm [14] decomposes the set of columns
in a table into disjoint clusters. Columns within a clus-
ter are considered correlated and a multidimensional his-
togram is maintained for these columns. Columns in differ-
ent clusters are treated as independent. In other words, sash
approximates the full joint distribution of the columns by
maintaining detailed histograms on certain low-dimensional
marginals in accordance with a high-level statistical inter-
action model—namely, a Markov network model—and then
computing joint frequencies as a product of the marginals.
sash uses query feedback together with a steepest-descent
method to incrementally adjust both the structure of the
high-level model and the frequency counts in the various
histogram buckets.
The advantage of query-driven approaches is that they
are efficient, scale well, and result in immediate performance
gains, since they focus their efforts on columns that appear
in actual queries. The disadvantage is that the system can
produce poor estimates—and hence choose poor plans—if it
has not yet received enough feedback, either during the ini-
tial start-up period or after a sudden change in the workload.
Indeed, during one of these slow learning phases the opti-
mizer is likely to avoid query plans with accurate feedback-
based cost estimates in favor of more expensive plans that
appear to be cheap due to cost estimates based on limited
real data and faulty independence assumptions.
2.2 Data-Driven Approaches
Data-driven methods analyze the base data to discover
correlations between database columns, usually without con-
sidering the query workload. These methods can comple-
ment the query-driven approaches by identifying correla-
tions between columns that have not yet been referenced
in the workload. By proactively gathering information, a
data-driven method can mitigate the poor performance of a
query-driven method during a slow learning phase.
Most data-driven methods use discovered correlations to
construct and maintain a synopsis (lossy compressed repre-
sentation) of the joint distribution of all the attributes. In
some methods, the synopsis takes the form of a graphical
statistical model. Getoor, et al. [10], for example, use prob-
abilistic relational models—which extend Bayesian network
models to the relational setting—for selectivity estimation.
Deshpande, et al. [8] provide a technique that, similarly
to sash, combines a Markov network model with a set of
low-dimensional histograms. In [8], however, the synopsis
is constructed based on a full scan of the base data rather
than from query feedback. Both of the foregoing techniques
search through the space of possible models and evaluate
them according to a scoring function. Because the num-
berofpossiblemodelsisexponentialinthenumberofat-
评论