暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
Integrating Query-Feedback Based Statistics into Informix Dynamic Server.pdf
156
19页
0次
2021-02-22
50墨值下载
Integrating Query-Feedback Based Statistics
into Informix Dynamic Server
Alexander Behm
1
, Volker Markl
2
, Peter Haas
3
, Keshava Murthy
4
1
Berufsakademie Stuttgart / IBM Germany
alexbehm@gmx.de
2
IBM Almaden Research Center
marklv@us.ibm.com
3
IBM Almaden Research Center
peterh@almaden.ibm.com
4
IBM Menlo Park
rkeshav@us.ibm.com
Abstract: Statistics that accurately describe the distribution of data values in the
columns of relational tables are essential for effective query optimization in a database
management system. Manually maintaining such statistics in the face of changing
data is difficult and can lead to suboptimal query performance and high administration
costs. In this paper, we describe a method and prototype implementation for auto-
matically maintaining high quality single-column statistics, as used by the optimizer
in IBM Informix Dynamic Server (IDS). Our method both refines and extends the
ISOMER algorithm of Srivastava et al. for maintaining a multidimensional histogram
based on query feedback (QF). Like ISOMER, our new method is based on the max-
imum entropy (ME) principle, and therefore incorporates information about the data
distribution in a principled and consistent manner. However, because IDS only needs
to maintain one-dimensional histograms, we can simplify the ISOMER algorithm in
several ways, significantly speeding up performance. First, we replace the expen-
sive STHoles data structure used by ISOMER with a simple binning scheme, using a
sweep-line algorithm to determine bin boundaries. Next, we use an efficient method
for incorporating new QF into the histogram; the idea is to aggregate, prior to the ME
computation, those bins that do not overlap with the new feedback records. Finally,
we introduce a fast pruning method to ensure that the number of bins in the frequency
distribution stays below a specified upper bound. Besides refining ISOMER to deal
efficiently with one-dimensional histograms, we extend previous work by combining
the reactive QF approach with a proactive sampling approach. Sampling is triggered
whenever (as determined from QF records) actual and estimated selectivities diverge
to an unacceptably large degree. Our combined proactive/reactive approach greatly
improves the robustness of the estimation mechanism, ensuring very high quality se-
lectivity estimates for queries falling inside the range of available feedback while guar-
anteeing reasonably good estimates for queries outside of the range. By automatically
updating statistics, query execution is improved due to better selectivity estimates, and
the total cost of ownership (TCO) is reduced since the database administrator need not
update statistics manually for monitored columns.
1 Introduction
In today’s database market performance and the total cost of ownership (TCO) are two
important factors that provide a competitive advantage. Performance is dependant on the
query optimizer’s choices of physical execution plans, which heavily rely on the accu-
racy of available statistics [IC91, Lyn88]. The total cost of ownership is influenced by
the administration cost, among others. A superior performance and TCO will provide a
significant competitive advantage. Traditionally, statistics are updated manually by the
database administrator (DBA) at intervals according to vendor-recommendations, based
on experience or in response to bad performance. On the one hand, this approach requires
expertise and time of the DBA which may increase the total cost of administration. On
the other hand, manually updating statistics at certain intervals may result in either too
frequent gathering of statistics (which is expensive and may impact the processing of user
queries) or too infrequent gathering of statistics, resulting in suboptimal query plans due
to inaccurate selectivity estimates. So far, there exists no optimal solution to determine
when to update the statistics for which tables and columns. For tackling this problem
we propose a method for automatically and autonomously gathering statistics. It ensures
accurate statistics because they automatically evolve with changes in the database and it
minimizes the need for manual intervention by the DBA which reduces costs. The general
idea of feedback based statistics as described in [SLMK01, CR94] suggests exploiting re-
sults of actual queries that run on the database, creating a feedback-loop to improve future
selectivity estimates by “learning” from previous queries. Embodiments of this principle
may include suggesting tables and columns for which to update statistics (e.g. based on a
significant discrepancy between estimated and actual cardinalities), suggesting statistical
parameters such as the number of frequent values and quantiles to gather [Beh05], and
creating multi-dimensional histograms (see chapter 2), and others.
In this work we present a query-feedback based method for creating and maintaining
single-column statistics in IBM Informix Dynamic Server. Our work is a specialisation of
ISOMER [SHM
+
06] to the one dimensional-case which allows for performance improve-
ments (see 2). We gather a sampling-based distribution for a given column and continu-
ously refine it using QF. The process of refinement consists of eliminating inconsistencies
based on a timestamp (recent information is preferred) and creating a maximum entropy
(ME) cumulative frequency distribution. This distribution is consistent with all available
information (QF, previous sampling information) and uniform in the absence of informa-
tion (see chapter 6). In contrast to the work described in [Beh05], we do not use the ME
distribution for recommending statistical parameters but make this distribution available to
the optimizer directly for selectivity estimation. Moreover, this approach offers flexibility
when thinking of cases in which feedback only delivers information on small regions in
the column domain. During the course of time the accuracy of statistics will degrade as
the data in the column changes. Hence, there is a good chance that queries not covered
by feedback will yield a large discrepancy between estimated and actual cardinalities. If
a significant discrepancy were to be detected one might schedule an updating of statistics
via sampling. So, in between sampling procedures the QF is used to continuously refine
the distribution until a said discrepancy is detected. In case feedback continuously delivers
information on the whole column domain it may not be necessary to sample the column
of 19
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜