AUTO_SAMPLE_SIZE在Oracle Database 12c 中是如何工作的？

原创赵勇 2023-03-25

668

当收集优化器统计信息时，DBMS_STATS.GATHER_*_STATS存储过程中的ESTIMATE_PERCENT参数控制着采样行数的百分比。多大的采样百分比可以实现准确的统计信息呢？100%可以确保采样准确，但这会花费大量的时间。1%会完成得足够快，但会导致不好的统计信息。要回答这个问题并不简单，这也是为什么最佳实践中使用默认值：AUTO_SAMPLE_SIZE。

在本文中，会涉及AUTO_SAMPLE_SIZE算法在Oracle Database 12c中是如何工作的，以及它会如何影响被收集的统计信息的准确度。如果你想了解Oracle Database 12c之前的历史，那么这一篇《AUTO_SAMPLE_SIZE在Oracle Database 11g中是如何工作的？》是一个不错的介绍。我会在下面指出AUTO_SAMPLE_SIZE在Oracle Database 11g和 Oracle Database 12c中的不同。

人们并不总是意识到，（通常情况下）大部分收集统计信息所需的时间和成本，都花费在评估每一个列上的唯一值数量上了。使用精准的算法计算NDV（唯一值数量）会是很昂贵的。因为统计信息被收集时，数据库需要记录并排序列上的值。如果NDV很大，保留并排序列值会变成资源密集操作，尤其是如果还需要将排序输出到临时表空间。自动采样大小使用概略算法（但准确）来计算NDV，其避免了排序列数据或者输出到临时表空间的需要。其带来的好处，是数据库可以用全表扫描来确保其它基础的列统计信息是准确的。

类似的，产生直方图也可能是资源密集操作，但数据库减轻了成本，如下所述：

频率和TOP频率直方图在数据库收集基础的列统计信息（比如NDV，MIN，MAX）时，使用前述提及的全表扫描来创建。TOP频率直方图是在Oracle Database 12c中新引入的。如果频率直方图或者 top频率直方图不可行，数据库会使用列数据的采样去收集混合直方图。TOP频率直方图只在前254个值占到全部非空值的99%以上时才可行，而频率直方图只有在NDV小于等于254时才可行。

为自动创建直方图，用户需在METHOD_OPT子句中指定‘SIZE AUTO’,Oracle数据库基于优化器收集的列的使用数据，来选择在哪些列上创建直方图。没有在WHERE子句或JOIN中出现的列，是不会考虑创建直方图的。

Oracle Database 11g 和 Oracle Database 12c均使用如下查询来收集基础的列统计信息（这是为了演示而简化版本的SQL）

SELECT COUNT(c1), MIN(c1), MAX(c1)
FROM  t;

该查询读取表T并扫描所有行（而不是使用采样）。数据库还需要为每一列计算唯一值数量，但查询中并没有使用类似COUNT(DISTINCT C1)这样的方法。做为替代，执行期间，一个特殊的统计信息收集行源被注入到查询中。该统计信息收集行源使用one-pass(单次传输），基于HASH的去重算法来收集NDV。该算法需要扫描全部数据，使用有限的内存量，就可以产生高度准确的NDV，其几乎与百分百采样的结果相同（在数学上是可以证明的事实）。统计信息收集行源还会收集行数，空值数量和平均列长度。由于使用了全表扫描，行数，平均列长，最小和最大值是100%准确的。

自动采样大小对直方图收集的影响
混合直方图的收集与基础的列统计信息收集脱勾，而是使用了列值的采样。该技术在Oracle Database 11g中创建高度平衡直方图时所使用。更多有关这方面的信息，可以查看该文。Oracle Database 12c用混合直方图替代了高度平衡直方图。（译者注：替代，但不是取消。当ESTIMATE_PERCENT未使用默认值AUTO_SAMPLE_SIZE时，且生成频率直方图不可行时，仍会生成高度平衡直方图。）
自动采样大小对索引统计信息收集的影响
AUTO_SAMPLE_SIZE影响索引统计信息如何来收集。索引统计信息的收集是基于采样的，若对于恰当地收集唯一键数量，采样中的块过少或者采样的大小过小，其潜在会通过若干次迭代。该算法自Oracle Database 11g以来还没有变化过。所以，我把它留给了上一篇文章来详细介绍。还有一件事需要注意：

在撰写本文时，有一些索引采样，会导致对复合索引NDV的错误评估的案例。最好的做法是在相关列上创建列组并使用GATHER_TABLE_STATS。或者，有一个一次性的修补fix - 27268249。该补丁改变了在大表上（不需要使用列组）计算索引的NDV的方法。该补丁自12.2.0.1可用，但需注意它不能backported。就像你猜测的那样，它会明显慢于索引块采样，但仍然是很快的。如果你发现索引的NDV导致了问题执行计划的情况，建议的方法是添加列组，而不是应用该补丁。

总结：
注：TOP频率直方图和混合直方图都是Oracle Database 12c中新提供的。Oracle Database 11g中只有频率和高度平衡直方图。混合直方图替代了高度平衡直方图。

自动采样大小算法使用全表扫描（100%采样）来收集基础的列统计信息。
全表扫描的成本（相比于行采样）通过概略NDV算法被降低，其消除了排序列数据的需要。
由AUTO_SAMPLE_SIZE所收集的概略NDV接近于100%采样的准确度。
其它基础的列统计信息，比如空值数量，平均列长度，最小和最大值有和100%采样相同的准确度。
频率和TOP频率直方图均使用列值的100%采样来创建，并且在基础的列统计信息收集时创建。这与Oracle Database 11g不同，11g中频率直方图的创建与基础的列统计信息收集（使用列值的采样）是脱勾的。
混合直方图使用列值的采样来创建。在内部，是与基础的列统计信息收集脱勾的。
索引统计信息收集使用列值的采样，采样的大小是自动确定的。

*上面的情况5有一个例外，如果OPTIONS=>‘GATHER AUTO’，在批量加载数据后收集的统计信息，使用在线统计信息收集技术，其频率直方图使用采样来创建。

原文链接：https://blogs.oracle.com/optimizer/post/how-does-auto-sample-size-work-in-oracle-database-12c

How does AUTO_SAMPLE_SIZE work in Oracle Database 12c?
January 1, 2020 | 4 minute read
Nigel Bayliss
Product Manager

The ESTIMATE_PERCENT parameter in DBMS_STATS.GATHER_*_STATS procedures controls the percentage of rows to sample when gathering optimizer statistics. What percentage of rows should you sample to achieve accurate statistics? 100% will ensure that statistics are accurate, but it could take a long time. A 1% sample will finish much more quickly but it could result in poor statistics. It’s not an easy question to answer, which is why it is best practice to use the default: AUTO_SAMPLE_SIZE.

In this post, I’ll cover how the AUTO_SAMPLE_SIZE algorithm works in Oracle Database 12c and how it affects the accuracy of the statistics being gathered. If you want to learn more of the history prior to Oracle Database 12c, then this post on Oracle Database 11g is a good place to look. I will indicate below where there are differences between Oracle Database 11g and Oracle Database 12c.

It’s not always appreciated that (in general) a large proportion of the time and resource cost required to gather statistics is associated with evaluating the number of distinct values (NDVs) for each column. Calculating NDV using an exact algorithm can be expensive because the database needs to record and sort column values while statistics are being gathered. If the NDV is high, retaining and sorting column values can become resource-intensive, especially if the sort spills to TEMP. Auto sample size instead uses an approximate (but accurate) algorithm to calculate NDV that avoids the need to sort column data or spill to TEMP. In return for this saving, the database can afford to use a full table scan to ensure that the other basic column statistics are accurate.

Similarly, it can be resource-intensive to generate histograms but the Oracle Database mitigates this cost as follows:

Frequency and top frequency histograms are created as the database gathers basic column statistics (such as NDV, MIN, MAX) from the full table scan mentioned above. This is new to Oracle Database 12c.
If a frequency or top frequency histogram is not feasible, then the database will collect hybrid histograms using a sample of the column data. Top frequency is only feasible when the top 254 values constitute more than 99% of the entire non null column values and frequency is only feasible if NDV is 254 or less.
When the user has specified ‘SIZE AUTO’ in the METHOD_OPT clause for automatic histogram creation, the Oracle Database chooses which columns to consider for histogram creation based column usage data that’s gathered by the optimizer. Columns that are not used in WHERE-clause predicates or joins are not considered for histograms.

Both Oracle Database 11g and Oracle Database 12c use the following query to gather basic column statistics (it is a simplified here for illustrative purposes).

SELECT COUNT(c1), MIN(c1), MAX(c1)
FROM  t;

The query reads the table (T) and scans all rows (rather than using a sample). The database also needs to calculate the number of distinct values (NDV) for each column but the query does not use COUNT(DISTINCT c1) and so on, but instead, during execution, a special statistics gathering row source is injected into the query. The statistics gathering row source uses a one-pass, hash-based distinct algorithm to gather NDV. The algorithm requires a full scan of the data, uses a bounded amount of memory and yields a highly accurate NDV that is nearly identical to a 100 percent sampling (a fact that can be proven mathematically). The statistics gathering row source also gathers the number of rows, number of nulls and average column length. Since a full scan is used, the number of rows, average column length, minimum and maximum values are 100% accurate.

Effect of auto sample size on histogram gathering

Hybrid histogram gathering is decoupled from basic column statistics gathering and uses a sample of column values. This technique was used in Oracle Database 11g to build height-balanced histograms. More information on this can be found in this blog post. Oracle Database 12c replaced height-balanced histograms with hybrid histograms.

Effect of auto sample size on index stats gathering

AUTO_SAMPLE_SIZE affects how index statistics are gathered. Index statistics gathering is sample-based and it can potentially go through several iterations if the sample contains too few blocks or the sample size was too small to properly gather number of distinct keys (NDKs). The algorithm has not changed since Oracle Database 11g, so I’ve left it to the previous blog to go more detail. There one other thing to note:

At the time of writing, there are some cases where index sampling can lead to NDV mis-estimates for composite indexes. The best work-around is to create a column group on the relevant columns and use gather_table_stats. Alternatively, there is a one-off fix - 27268249. This patch changes the way NDV is calculated for indexes on large tables (and no column group is required). It is available for 12.2.0.1 at the moment, but note that it cannot be backported. As you might guess, it’s significantly slower than index block sampling, but it’s still very fast. At the time of writing, if you find a case where index NDV is causing an issue with a query plan, then the recommended approach is to add a column group rather than attempting to apply this patch.

Summary:

Note that top frequency and hybrid histograms are new to Oracle Database 12c. Oracle Database 11g had frequency and height-balanced histograms only. Hybrid histograms replaced height-balanced histograms.

The auto sample size algorithm uses a full table scan (a 100% sample) to gather basic column statistics.
The cost of a full table scan (verses row sampling) is mitigated by the approximate NDV algorithm, which eliminates the need to sort column data.
The approximate NDV gathered by AUTO_SAMPLE_SIZE is close to the accuracy of a 100% sample.
Other basic column statistics, such as the number of nulls, average column length, minimal and maximal values have an accuracy equivalent to 100% sampling.
Frequency and top frequency histograms are created using a 100%* sample of column values and are created when basic column statistics are gathered. This is different to Oracle Database 11g, which decoupled frequency histogram creation from basic column statistics gathering (and used a sample of column values).
Hybrid histograms are created using a sample of column values. Internally, this step is decoupled from basic column statistics gathering.
Index statistics are gathered using a sample of column values. The sample size is determined automatically.

*There is an exception to case 5, above. Frequency histograms are created using a sample if OPTIONS=>‘GATHER AUTO’ is used after a bulk load where statistics have been gathered using online statistics gathering.

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

AUTO_SAMPLE_SIZE在Oracle Database 12c 中是如何工作的？

评论