Integrating Query-Feedback Based Statistics into Informix Dynamic Server.pdf

Libria

165

19页

0次

2021-02-22

50墨值下载

Integrating Query-Feedback Based Statistics

into Informix Dynamic Server

Alexander Behm

, Volker Markl

, Peter Haas

, Keshava Murthy

Berufsakademie Stuttgart / IBM Germany

alexbehm@gmx.de

IBM Almaden Research Center

marklv@us.ibm.com

IBM Almaden Research Center

peterh@almaden.ibm.com

IBM Menlo Park

rkeshav@us.ibm.com

Abstract: Statistics that accurately describe the distribution of data values in the

columns of relational tables are essential for effective query optimization in a database

management system. Manually maintaining such statistics in the face of changing

data is difﬁcult and can lead to suboptimal query performance and high administration

costs. In this paper, we describe a method and prototype implementation for auto-

matically maintaining high quality single-column statistics, as used by the optimizer

in IBM Informix Dynamic Server (IDS). Our method both reﬁnes and extends the

ISOMER algorithm of Srivastava et al. for maintaining a multidimensional histogram

based on query feedback (QF). Like ISOMER, our new method is based on the max-

imum entropy (ME) principle, and therefore incorporates information about the data

distribution in a principled and consistent manner. However, because IDS only needs

to maintain one-dimensional histograms, we can simplify the ISOMER algorithm in

several ways, signiﬁcantly speeding up performance. First, we replace the expen-

sive STHoles data structure used by ISOMER with a simple binning scheme, using a

sweep-line algorithm to determine bin boundaries. Next, we use an efﬁcient method

for incorporating new QF into the histogram; the idea is to aggregate, prior to the ME

computation, those bins that do not overlap with the new feedback records. Finally,

we introduce a fast pruning method to ensure that the number of bins in the frequency

distribution stays below a speciﬁed upper bound. Besides reﬁning ISOMER to deal

efﬁciently with one-dimensional histograms, we extend previous work by combining

the reactive QF approach with a proactive sampling approach. Sampling is triggered

whenever (as determined from QF records) actual and estimated selectivities diverge

to an unacceptably large degree. Our combined proactive/reactive approach greatly

improves the robustness of the estimation mechanism, ensuring very high quality se-

lectivity estimates for queries falling inside the range of available feedback while guar-

anteeing reasonably good estimates for queries outside of the range. By automatically

updating statistics, query execution is improved due to better selectivity estimates, and

the total cost of ownership (TCO) is reduced since the database administrator need not

update statistics manually for monitored columns.

1 Introduction

In today’s database market performance and the total cost of ownership (TCO) are two

important factors that provide a competitive advantage. Performance is dependant on the

query optimizer’s choices of physical execution plans, which heavily rely on the accu-

racy of available statistics [IC91, Lyn88]. The total cost of ownership is inﬂuenced by

the administration cost, among others. A superior performance and TCO will provide a

signiﬁcant competitive advantage. Traditionally, statistics are updated manually by the

database administrator (DBA) at intervals according to vendor-recommendations, based

on experience or in response to bad performance. On the one hand, this approach requires

expertise and time of the DBA which may increase the total cost of administration. On

the other hand, manually updating statistics at certain intervals may result in either too

frequent gathering of statistics (which is expensive and may impact the processing of user

queries) or too infrequent gathering of statistics, resulting in suboptimal query plans due

to inaccurate selectivity estimates. So far, there exists no optimal solution to determine

when to update the statistics for which tables and columns. For tackling this problem

we propose a method for automatically and autonomously gathering statistics. It ensures

accurate statistics because they automatically evolve with changes in the database and it

minimizes the need for manual intervention by the DBA which reduces costs. The general

idea of feedback based statistics as described in [SLMK01, CR94] suggests exploiting re-

sults of actual queries that run on the database, creating a feedback-loop to improve future

selectivity estimates by “learning” from previous queries. Embodiments of this principle

may include suggesting tables and columns for which to update statistics (e.g. based on a

signiﬁcant discrepancy between estimated and actual cardinalities), suggesting statistical

parameters such as the number of frequent values and quantiles to gather [Beh05], and

creating multi-dimensional histograms (see chapter 2), and others.

In this work we present a query-feedback based method for creating and maintaining

single-column statistics in IBM Informix Dynamic Server. Our work is a specialisation of

ISOMER [SHM

06] to the one dimensional-case which allows for performance improve-

ments (see 2). We gather a sampling-based distribution for a given column and continu-

ously reﬁne it using QF. The process of reﬁnement consists of eliminating inconsistencies

based on a timestamp (recent information is preferred) and creating a maximum entropy

(ME) cumulative frequency distribution. This distribution is consistent with all available

information (QF, previous sampling information) and uniform in the absence of informa-

tion (see chapter 6). In contrast to the work described in [Beh05], we do not use the ME

distribution for recommending statistical parameters but make this distribution available to

the optimizer directly for selectivity estimation. Moreover, this approach offers ﬂexibility

when thinking of cases in which feedback only delivers information on small regions in

the column domain. During the course of time the accuracy of statistics will degrade as

the data in the column changes. Hence, there is a good chance that queries not covered

by feedback will yield a large discrepancy between estimated and actual cardinalities. If

a signiﬁcant discrepancy were to be detected one might schedule an updating of statistics

via sampling. So, in between sampling procedures the QF is used to continuously reﬁne

the distribution until a said discrepancy is detected. In case feedback continuously delivers

information on the whole column domain it may not be necessary to sample the column

of 19

50墨值下载

database

关注

评论