AUTO_SAMPLE_SIZE在Oracle Database 11g中是如何工作的？

原创赵勇 2023-03-25

898

注：如果你有兴趣了解Oracle Database 12（中AUTO_SAMPLE_SIZE是如何工作的），这里有一篇本文的更新版本

当涉及统计信息的收集时，你需要做出一个最重要的决定，就是使用多大的采样大小？100%的采样可以保证准确的统计信息，但是可能花费很久的时间。而1%的采样可以完成得很快，但是可能导致较差的统计信息。

在DBMS_STATS.GATHER_*_STATS存储过程中的ESTIMATE_PERCENT参数，控制着收集统计信息时采样的大小，其默认值是AUTO_SAMPLE_SIZE。

在此前的博文中，我们谈到过AUTO_SAMPLE_SIZE在Oracle Database 11g中新的实现方法。与Oracle Database 11g之前的老的AUTO_SAMPLE_SIZE相比较而言，在统计信息的收集速度和准确度方面都有了改进。

在本博文中，我们会近距离观察新的AUTO_SAMPLE_SIZE是如何工作的，是如何影响收集的统计信息的准确度的。

在我们研究新算法的工作原理之前，让我们简要回顾一下旧算法的工作方式及其缺点。旧的AUTO_SAMPLE_SIZE使用了以下方法：

第1步，Oracle从一个小的采样比例开始。如果需要收集直方图，根据采样的比例，Oracle可能会物化采样（译者注：指将采样的结果物理保存下来）

第2步，Oracle在样本上收集基础的列统计信息。例如，假设表T上只有一个列C1,那么基础统计信息收集的查询类似如下所示（这不是我们使用的一个准确的语法，只是为了演示而给出的一个简化版本）：

图片.png

查询1 11g之前AUTO_SAMPLE_SIZE所使用的收集基础列统计信息的查询

SELECT中的项，分别对应于表T中的行数，（C1列上的）非空值数量，唯一值的数量，列的总长度，C1列上的最小和最大值。FROM子句中的“X.0000000000”，是由Oracle确定的采样百分比。

第3步：如果需要收集直方图，Oracle会对需要收集直方图的列，对样本发出查询SQL。

第4步：对于需要直方图的每一列，Oracle使用若干个维度来确定当前的采样是否是满足的：
非空值维度：目标列在采样中是否包含了足够的非空值;
NDV（唯一值数量）维度：唯一值的数量，对于采样是否有适合的比例。

第5步：如果第4步的所有维度都符合，Oracle推断当前的采样大小是满足的，并且完成该列直方图的创建。否则，它会调大采样大小并重复执行以上步骤，直到找到一个令人满意的采样大小，或者采样达到了100%。

注意第3步到第5步，每个列都是要做的。比如，如果表中有3个列需要直方图，在第一次迭代时，我们会得到一个样本并物化它。我们发出3个查询，一个列一个。在相同的采样物化上去收集直方图。假设Oracle确定该采样是满足列1和列2的，但是不满足列3，那么它会调大采样大小。在第二次迭代时，只发出1个查询到样本上，来收集列3的直方图。

如你所见，老的AUTO_SAMPLE_SIZE如果需要多次迭代，是不够高效的。多次迭代的主要原因是使用小的样本，不能收集到准确的NDV。如果数据上存在倾斜，大量的低频值可能就不会被采样到，从而在NDV维度上失败。

在Oracle Database 11g中，我们使用了一种完全不同的方法来收集基础的列统计信息。我们发出下面的查询去收集基础的列统计信息（这也是一个为了演示而给出的简化版本）

图片.png
查询2：11g中AUTO_SAMPLE_SIZE所使用的收集基础列统计信息的查询

你会注意到，在新的基础列统计信息收集查询中，没有使用采样子句。而是使用了全表扫描。而且，没有为收集C1的NDV值而使用count(distinct C1)。做为替代，在执行期间，我们为该查询注入了一个特殊的统计信息收集行源。该特殊的收集行源使用了一次传输（one-pass),基于HASH的唯一值算法来收集NDV。更多有关该算法是如何工作的，可以查看 “efficient and scalable statistics gathering for large databases in Oracle 11g”（Oracle 11g的大数库中，高效和可扩展的统计信息收集）.
该算法需要对数据全表扫描，只使用有限的内存量，就可以产生高度准确的NDV，其几乎与百分百采样的结果相同（在数学上是可以证明的）。特殊的统计信息收集行源还会收集行数，空值数量和平均列长。由于我们在表上做了全表扫描，所以，行数，平均列长，最小和最大值是百分百准确的。

AUTO_SAMPLE_SIZE还会在以下方面影响直方图和索引统计信息的收集。

直方图收集对自动采样大小的影响
使用新的AUTO_SAMPLE_SIZE，直方图收集与基础列统计信息收集脱勾了（过去是在同样的样本上收集）。因此，当我们决定是否需要调大采样的大小时，新的AUTO_SAMPLE_SIZE的算法不再执行“NDV维度”的检查（请见上面第4步的描述），因为我们不需要从样本中推导NDV。只有当样本中包含了太多的空值或者行太少，才需对直方图的采样大小调大。这有助于减少直方图创建的迭代次数。更多有关这方面的信息，可以参考本文.

如果用于收集直方图的样本中出现的最小值（或最大值），不是收集基础统计信息中出现的最小值（或最大值），我们会修改直方图，以便将基础统计信息收集的最小值（或最大值）做为直方图中第一个桶（或最后一个桶）的endpoint值。

索引统计信息收集对自动采样大小的影响

新的AUTO_SAMPLE_SIZE还会影响索引统计信息的收集。下面的流程图展示了，当在11g中指定使用AUTO_SAMPLE_SIZE时，索引统计信息是如何被收集的。索引统计信息的收集是基于采样的。为正确收集唯一键的数量（NDK)，若由于样本中包含过少的块，或者样本的尺寸过小，这可能潜在地会发生多次的迭代。采用新的AUTO_SAMPLE_SIZE算法，如果索引是定义在单列上，或者，如果索引定在多列上，但它们对应于单个列组，那么列或列组的唯一值将会做为索引的唯一键的数量。这时，索引统计信息的收集查询不会收集NDK。这将有助于改善为了索引统计信息的收集，而调大采样大小的需要。

图片.png

总结：

新的AUTO_SAMPLE_SIZE算法为收集基础列统计信息，会进行全表扫描。
由新的AUTO_SAMPLE_SIZE所收集的NDV有接近于百分百采样时的精确度。
其它的基础列统计信息，比如空值数量，平均列长度，最小和最大值拥有与百分百采样同等的精确度。
在新AUTO_SAMPLE_SIZE算法下的直方图和索引统计信息的收集，仍然是基于采样的。但是新的自动采样算法，有助于降低增大采样大小的需要。

原文链接：https://blogs.oracle.com/optimizer/post/how-does-auto-sample-size-work-in-oracle-database-11g

How does AUTO_SAMPLE_SIZE work in Oracle Database 11g?
January 2, 2020 | 5 minute read
Hong Su

Note that if you’re interested in learning about Oracle Database 12c, there’s an updated version of this post here.

When it comes to gathering statistics, one of the most critical decisions you have to make is, what sample size should be used? A 100% sample will ensure accurate statistics but could take a really long time. Whereas a 1% sample will finish quickly but could result in poor statistics.

The ESTIMATE_PERCENT parameter in the DBMS_STATS.GATHER_*_STATS procedures controls the sample size used when gathering statistics and its default value is AUTO_SAMPLE_SIZE.

In an earlier blog post, we talked about the new implementation of AUTO_SAMPLE_SIZE in Oracle Database 11g in terms of its improvements in the speed and accuracy of statistics gathering compared to the old AUTO_SAMPLE_SIZE prior to Oracle Database 11g.

In this post, we will offer a closer look at the how the new AUTO_SAMPLE_SIZE algorithm works and how it affects the accuracy of the statistics being gathered.

Before we delve into how the new algorithm works, let us briefly recap how the old algorithm works and its downsides. The old AUTO_SAMPLE_SIZE used the following approach:

Step 1. Oracle starts with a small sampling percentage. If histograms need to be gathered, Oracle might materialize the sample, depending on the sampling percentage.

Step 2. Oracle gathers basic column statistics on the sample. For example, suppose a table T has only one column C1, then the basic stats gathering query looks like below (this is not the exact syntax we use but a simplified version for illustration purpose):
图片.png

Query 1 Query Gathering Basic Column Statistics Using AUTO_SAMPLE_SIZE Prior to 11g

The select list items in the query correspond to number of rows in table T, number of non-null values, number of distinct values, total column length, minimal and maximal values of column C1 respectively. “X.0000000000” in the FROM clause is the sampling percentage determined by Oracle.

Step 3: if histograms need to be gathered, Oracle issues a SQL query on the sample for each column that requires a histogram.

Step 4: For each column that requires a histogram, Oracle uses several metrics to determine whether the current sample is sufficient:

Non-null value metric: Whether the sample contains sufficient non-null values of this column;
NDV metric: Whether number of distinct values (NDV) can be properly scaled from the sample.

Step 5: If all metrics in step 4 pass, Oracle concludes that the current sample size is sufficient and the histogram creation for that column is complete. Otherwise, it bumps up the sample size and goes though the above steps again until it finds a satisfactory sample or reaches 100% sampling.

Note that step 3 to step 5 are done per column. For example, if there are 3 columns in the table that require histograms. In the first iteration, we get a sample and materialize it. We issue 3 queries, one per column, on the same materialized sample to gather histograms. Suppose Oracle determines that the sample is sufficient for columns 1 and 2 but insufficient for column 3. Then we bump up the sample size. In the second iteration, only 1 query is issued on the sample to gather histogram for column 3.

As you can see the old AUTO_SAMPLE_SIZE can be inefficient if several iterations are required. A dominating contributor for several iterations is the inability to gather accurate NDVs using a small sample. If there is a skew in the data, a lot of low frequency values may not make into the sample and thus the sample fails the NDV metric.

In Oracle Database 11g, we use a completely different approach for gathering basic column statistics. We issue the following query to gather basic column statistics (again this is a simplified version for illustration purpose).
图片.png

Query 2: Query Gathering Basic Column Statistics Using AUTO_SAMPLE_SIZE in 11g

You will notice in the new basic column statistics gathering query, no sampling clause is used. Instead we do a full table scan. Also, there is no more count(distinct C1) to gather NDV for C1. Instead, during the execution we inject a special statistics gathering row source to this query. The special gathering row source uses a one-pass, hash-based distinct algorithm to gather NDV. More information on how this algorithm works can be found in the paper, “efficient and scalable statistics gathering for large databases in Oracle 11g”. The algorithm requires a full scan of the data, uses a bounded amount of memory and yields a highly accurate NDV that is nearly identical to a 100 percent sampling (can be proven mathematically). The special statistics gathering row source also gathers the number of rows, number of nulls and average column length on the side. Since we do a full scan on the table, the number of rows, average column length, minimal and maximal values are 100% accurate.

AUTO_SAMPLE_SIZE also affects histogram gathering and index statistics gathering in the following ways.
Effect of auto sample size on histogram gathering

With the new AUTO_SAMPLE_SIZE, histogram gathering is decoupled from basic column statistics gathering (they used to be gathered on the same sample). Therefore when determining whether we need to bump up the sample size, the new AUTO_SAMPLE_SIZE algorithm no longer performs the “NDV metric” check (see step 4 in above description) because we do not derive NDV from the sample. Sample size needs to be bumped up for a histogram only when the sample contains too many nulls or too few rows. This helps to reduce number of iterations of the histogram creation. More information on this can be found in this blog post.

If the minimal (resp. maximal) value that appears in the sample used for gathering the histogram is not the minimal (resp. maximal) value gathered in basic statistics, we will modify the histogram so that the minmal (resp. maximal) value gathered in basic statistics now appears as the endpoint of the first (resp. last) bucket in the histogram.

Effect of auto sample size on index stats gathering

The new AUTO_SAMPLE_SIZE also affects how index statistics are gathered. The flow chart below shows how index statistics are gathered in 11g when AUTO_SAMPLE_SIZE is specified. Index statistics gathering are sampling based. It could potentially go through several iterations because either the sample contained too few blocks or the sample size was too small to properly gather number of distinct keys (NDKs). With the new AUTO_SAMPLE_SIZE algorithm, however, if the index is defined on a single column, or if the index is defined on multiple columns that correspond to a column group, then the NDV of the column or column group will be used as NDK of the index. The index statistics gathering query will NOT gather NDK in such cases. This helps to alleviate the need to bump up sample size for index statistics gathering.
图片.png
Summary:

New AUTO_SAMPLE_SIZE algorithm does a full table scan to gather basic column statistics
NDV gathered by new AUTO_SAMPLE_SIZE has an accuracy close to 100% sampling
Other basic column statistics, such as the number of nulls, average column length, minimal and maximal values have an accuracy equivalent to 100% sampling
Both Histogram and index statistics gathering under new auto sample size algorithm still use sampling. But new auto sample size algorithm helps to alleviate the need to bump up sample size.

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

AUTO_SAMPLE_SIZE在Oracle Database 11g中是如何工作的？

评论