暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
CORDS Automatic Discovery of Correlations.pdf
195
12页
0次
2021-02-22
50墨值下载
CORDS: Automatic Discovery of Correlations
and Soft Functional Dependencies
Ihab F. Ilyas
1
Volker Markl
2
Peter Haas
2
Paul Brown
2
Ashraf Aboulnaga
2
1
Purdue University
2
IBM Almaden Research Center
250 N. University Street 650 Harry Road, K55/B1
West Lafayette, Indiana, 47906 San Jose, CA, 95139
ilyas@cs.purdue.edu marklv,phaas,pbrown1,aashraf@us.ibm.com
ABSTRACT
The rich dependency structure found in the columns of real-world
relational databases can be exploited to great advantage, but can
also cause query optimizers—which usually assume that columns
are statistically independent—to underestimate the selectivities
of conjunctive predicates by orders of magnitude. We introduce
cords, an efficient and scalable tool for automatic discovery of
correlations and soft functional dependencies between columns.
cords searches for column pairs that might have interesting and
useful dependency relations by systematically enumerating can-
didate pairs and simultaneously pruning unpromising candidates
using a flexible set of heuristics. A robust chi-squared analysis is
applied to a sample of column values in order to identify correla-
tions, and the number of distinct values in the sampled columns
is analyzed to detect soft functional dependencies. cords can be
used as a data mining tool, producing dependency graphs that
are of intrinsic interest. We focus primarily on the use of cords
in query optimization. Specifically, cords recommends groups
of columns on which to maintain certain simple joint statistics.
These “column-group” statistics are then used by the optimizer
to avoid naive selectivity estimates based on inappropriate inde-
pendence assumptions. This approach, because of its simplicity
and judicious use of sampling, is relatively easy to implement in
existing commercial systems, has very low overhead, and scales
well to the large numbers of columns and large table sizes found in
real-world databases. Experiments with a prototype implementa-
tion show that the use of cords in query optimization can speed
up query execution times by an order of magnitude. cords can
be used in tandem with query feedback systems such as the leo
learning optimizer, leveraging the infrastructure of such systems
to correct bad selectivity estimates and ameliorating the poor
performance of feedback systems during slow learning phases.
1. INTRODUCTION
When a query optimizer in a commercial relational dbms
chooses a horrible query plan, the cause of the disaster is
usually an inappropriate independence assumption that the
optimizer has imposed on two or more columns. Indepen-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD 2004, June 13–18, 2004, Paris, France.
Copyright 2004 ACM 1-58113-859-8/04/06...
$5.00.
dence assumptions are used in virtually all query optimizers
because they greatly simplify selectivity estimation: e.g., if
p
1
and p
2
are predicates on respective columns C
1
and C
2
,
then the selectivity of the conjunctive predicate p
1
p
2
is
estimated by simply multiplying together the individual se-
lectivities of p
1
and p
2
. This approach, however, ignores the
rich dependency structure that is present in most real-world
data. Indeed, our experience indicates that dependency be-
tween columns is the rule, rather than the exception, in the
real world.
This paper introduces cords (CORrelation Detection via
Sampling), an efficient and scalable tool for automatically
discovering statistical correlations and “soft” functional de-
pendencies (fds) between columns. By “correlations,” we
mean general statistical dependencies, not merely approxi-
mate linear relationships as measured, for example, by the
Pearson correlation coefficient [7, p. 265]. By a soft fd be-
tween columns C
1
and C
2
, we mean a generalization of the
classical notion of a “hard” fd in which the value of C
1
com-
pletely determines the value of C
2
. In a soft fd (denoted
by C
1
C
2
), the value of C
1
determines the value of C
2
not with certainty, but merely with high probability. An
example of a hard fd is given by Country and Continent;
the former completely determines the latter. On the other
hand, for cars, Make is determined by Model via a soft depen-
dency: given that Model = 323, we know that Make = Mazda
with high probability, but there is also a small chance that
Make = BMW.
cords both builds upon and significantly modifies the
technology of the b-hunt system as described in [12]. As
with b-hunt, cords searches for column pairs that might
have interesting and useful correlations by systematically
enumerating candidate pairs and simultaneously pruning
unpromising candidates using a flexible set of heuristics.
Also as with b-hunt, cords analyses a sample of rows in
order to ensure scalability to very large tables. The sim-
ilarities end here, however. Whereas b-hunt uses “bump
hunting” techniques to discover soft algebraic relationships
between numerical attributes, cords employs a robust chi-
squared analysis to identify correlations between both nu-
merical and categorical attributes, and an analysis of the
number of distinct values in the sampled columns to detect
soft fds. The sample size required for the chi-squared anal-
ysis is essentially independent of the database size, so that
the algorithm is highly scalable.
cords can serve as a data mining tool, e.g., its output
can be converted to a dependency graph as in Figure 6.
Such graphs are of intrinsic interest. Our primary focus,
however, is on the application of cords in query optimiza-
tion. Our proposed scheme uses the output of cords to
recommend groups of columns on which to maintain cer-
tain simple joint statistics, such as the number of distinct
combinations of values in the columns. An optimizer can
then use such “column-group” (cg) statistics to avoid in-
accurate selectivity estimates caused by naive independence
assumptions. cords can be used in conjunction with a query
feedback system, leveraging the infrastructure of such a sys-
tem to correct bad selectivity estimates and ameliorating the
poor performance of feedback systems during slow learning
phases.
cords currently considers only pairs of columns, and not
triples, quadruples, and so forth. This restriction vastly re-
duces the complexity of the algorithm, and our experiments
(see Section 5.3) indicate that most of the benefit in query
performance is obtained by maintaining the cg statistics
of order 2. The cords approach, because of its simplicity,
restriction to column pairs, and judicious use of sampling,
is relatively easy to implement in existing commercial sys-
tems, has very low overhead, and scales well to the large
numbers of columns and large table sizes found in real-world
databases.
The rest of the paper is organized as follows. Section 2
gives an overview of recent technology for relaxing the in-
dependence assumption in query optimization, and relates
the results in the current paper to this body of work. Sec-
tion 3 describes the cords detection algorithm in detail,
including generation of candidate column pairs, pruning of
unpromising candidates, sampling-based chi-squared analy-
sis of correlation, and discovery of soft fds. In Section 4,
we discuss our simple but effective method for using the
output of cords to improve an optimizer’s selectivity esti-
mates and hence its choice of query plans. Section 5 contains
an experimental evaluation of cords in which we validate
the correlation detection algorithm using a database with
known correlations. We also explore the effect of varying
the sampling rate and of using higher-order cg statistics,
and demonstrate the speedup in query execution times that
can result from the use of cords. In Section 6 we describe an
application of cords to several real-world and benchmark
databases. Section 7 contains our conclusions and recom-
mendations for future work.
2. RELATED WORK
Recent years have witnessed a large body of work aimed at
relaxing the independence assumption in selectivity estima-
tion. As in [12], we can distinguish between “data driven”
and “query driven” approaches to detecting and exploiting
dependencies among database columns.
2.1 Query-Driven Approaches
Some query-driven approaches focus on information con-
tained in a query workload, i.e., a list of relevant queries. For
example, Bruno and Chaudhuri [4] use the query workload
together with optimizer estimates of query execution times
to determine a beneficial set of sits” to retain. sitsare
statistics (typically multidimensional histograms) on query
expressions that can be used to avoid large selectivity esti-
mation errors due to independence assumptions.
Alternatively, a query feedback system (qfs) uses feedback
from query execution to increase optimizer accuracy. leo,
the DB2 learning optimizer [20], is a typical example of a
qfs. leo compares the actual selectivities of query results
with the optimizer’s estimated selectivities. In this way, leo
can detect errors caused by faulty independence assumptions
and create adjustment factors which can be applied in the
future to improve the optimizer’s selectivity estimates.
In [1, 5], query feedback is used to incrementally build a
multidimensional histogram that can be used to estimate the
selectivity of conjunctive predicates. The algorithms in [1, 5]
do not discover correlation between columns, however: the
set of columns over which to build the histogram must be
specified apriori.
The sash algorithm [14] decomposes the set of columns
in a table into disjoint clusters. Columns within a clus-
ter are considered correlated and a multidimensional his-
togram is maintained for these columns. Columns in differ-
ent clusters are treated as independent. In other words, sash
approximates the full joint distribution of the columns by
maintaining detailed histograms on certain low-dimensional
marginals in accordance with a high-level statistical inter-
action model—namely, a Markov network model—and then
computing joint frequencies as a product of the marginals.
sash uses query feedback together with a steepest-descent
method to incrementally adjust both the structure of the
high-level model and the frequency counts in the various
histogram buckets.
The advantage of query-driven approaches is that they
are efficient, scale well, and result in immediate performance
gains, since they focus their efforts on columns that appear
in actual queries. The disadvantage is that the system can
produce poor estimates—and hence choose poor plans—if it
has not yet received enough feedback, either during the ini-
tial start-up period or after a sudden change in the workload.
Indeed, during one of these slow learning phases the opti-
mizer is likely to avoid query plans with accurate feedback-
based cost estimates in favor of more expensive plans that
appear to be cheap due to cost estimates based on limited
real data and faulty independence assumptions.
2.2 Data-Driven Approaches
Data-driven methods analyze the base data to discover
correlations between database columns, usually without con-
sidering the query workload. These methods can comple-
ment the query-driven approaches by identifying correla-
tions between columns that have not yet been referenced
in the workload. By proactively gathering information, a
data-driven method can mitigate the poor performance of a
query-driven method during a slow learning phase.
Most data-driven methods use discovered correlations to
construct and maintain a synopsis (lossy compressed repre-
sentation) of the joint distribution of all the attributes. In
some methods, the synopsis takes the form of a graphical
statistical model. Getoor, et al. [10], for example, use prob-
abilistic relational models—which extend Bayesian network
models to the relational setting—for selectivity estimation.
Deshpande, et al. [8] provide a technique that, similarly
to sash, combines a Markov network model with a set of
low-dimensional histograms. In [8], however, the synopsis
is constructed based on a full scan of the base data rather
than from query feedback. Both of the foregoing techniques
search through the space of possible models and evaluate
them according to a scoring function. Because the num-
berofpossiblemodelsisexponentialinthenumberofat-
of 12
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜