1 Introduction
In today’s database market performance and the total cost of ownership (TCO) are two
important factors that provide a competitive advantage. Performance is dependant on the
query optimizer’s choices of physical execution plans, which heavily rely on the accu-
racy of available statistics [IC91, Lyn88]. The total cost of ownership is influenced by
the administration cost, among others. A superior performance and TCO will provide a
significant competitive advantage. Traditionally, statistics are updated manually by the
database administrator (DBA) at intervals according to vendor-recommendations, based
on experience or in response to bad performance. On the one hand, this approach requires
expertise and time of the DBA which may increase the total cost of administration. On
the other hand, manually updating statistics at certain intervals may result in either too
frequent gathering of statistics (which is expensive and may impact the processing of user
queries) or too infrequent gathering of statistics, resulting in suboptimal query plans due
to inaccurate selectivity estimates. So far, there exists no optimal solution to determine
when to update the statistics for which tables and columns. For tackling this problem
we propose a method for automatically and autonomously gathering statistics. It ensures
accurate statistics because they automatically evolve with changes in the database and it
minimizes the need for manual intervention by the DBA which reduces costs. The general
idea of feedback based statistics as described in [SLMK01, CR94] suggests exploiting re-
sults of actual queries that run on the database, creating a feedback-loop to improve future
selectivity estimates by “learning” from previous queries. Embodiments of this principle
may include suggesting tables and columns for which to update statistics (e.g. based on a
significant discrepancy between estimated and actual cardinalities), suggesting statistical
parameters such as the number of frequent values and quantiles to gather [Beh05], and
creating multi-dimensional histograms (see chapter 2), and others.
In this work we present a query-feedback based method for creating and maintaining
single-column statistics in IBM Informix Dynamic Server. Our work is a specialisation of
ISOMER [SHM
+
06] to the one dimensional-case which allows for performance improve-
ments (see 2). We gather a sampling-based distribution for a given column and continu-
ously refine it using QF. The process of refinement consists of eliminating inconsistencies
based on a timestamp (recent information is preferred) and creating a maximum entropy
(ME) cumulative frequency distribution. This distribution is consistent with all available
information (QF, previous sampling information) and uniform in the absence of informa-
tion (see chapter 6). In contrast to the work described in [Beh05], we do not use the ME
distribution for recommending statistical parameters but make this distribution available to
the optimizer directly for selectivity estimation. Moreover, this approach offers flexibility
when thinking of cases in which feedback only delivers information on small regions in
the column domain. During the course of time the accuracy of statistics will degrade as
the data in the column changes. Hence, there is a good chance that queries not covered
by feedback will yield a large discrepancy between estimated and actual cardinalities. If
a significant discrepancy were to be detected one might schedule an updating of statistics
via sampling. So, in between sampling procedures the QF is used to continuously refine
the distribution until a said discrepancy is detected. In case feedback continuously delivers
information on the whole column domain it may not be necessary to sample the column
评论