oracle12c 新特性--- Hybrid直方图混合直方图

刘洪云和恩墨交付战队 2021-10-20

956

oracle12c 新特性— Hybrid直方图混合直方图

官网：http://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm#TGSQL372

A hybrid histogram combines characteristics of both height-based histograms and frequency histograms. This “best of both worlds” approach enables the optimizer to obtain better selectivity estimates in some situations.
Hybrid Histograms结合了height-based histograms和frequency histograms的特点。这种“两全其美”的方法使优化器能够在某些情况下获得更好的选择性估计。

The height-based histogram sometimes produces inaccurate estimates for values that are almost popular. For example, a value that occurs as an endpoint value of only one bucket but almost occupies two buckets is not considered popular.
基于高度的直方图有时会对几乎流行的值产生不准确的估计。例如，仅作为一个桶的端点值出现但几乎占据两个桶的值不被认为是流行的。

To solve this problem, a hybrid histogram distributes values so that no value occupies more than one bucket, and then stores the endpoint repeat count value, which is the number of times the endpoint value is repeated, for each endpoint (bucket) in the histogram. By using the repeat count, the optimizer can obtain accurate estimates for almost popular values.

为了解决这个问题，混合直方图分配值，使得没有值占据一个以上的桶，然后为直方图中的每个端点(桶)存储端点重复计数值，端点重复计数值是端点值重复的次数。通过使用重复计数，优化器可以获得几乎流行的值的准确估计。

Endpoint是如何重复统计：


官网上的例子：col列的所有值：1 1 1 5 5 5 10 10 25 25 25 25 50 100 100
-- 1 第一次值的分布情况：
桶1： 1  1  1  5   5
桶2： 5  10 10 25  25
桶3： 25 25 50 100 100
-- 2 之后优化器会移动bucket的边界(就是同一个bucket中所包含的值)，以保证同一个值都在同一个bucket中。
-- 优化之后
桶1： 1  1   1   5   5   5    重复出现次数：3次
桶2： 10 10  25  25  25  25   重复出现次数：4次 
桶3： 50 100 100              重复出现次数：2次
-- 3 重新优化后，要重新计算每个bucket中endpoint重复值的总和。如下图中的，Bucket1中重复值5总和是3。

产生Hybrid Histograms的条件：

NDV is greater than n, where n is the number of histogram buckets (default is 254). --默认254个桶  
条件1： 不同值的个数要大于桶数  
  
The criteria for top frequency histograms do not apply. 关键是频率直方图不适用  
This is another way to stating that the percentage of rows occupied by the top n frequent values is less than threshold p, where p = (1-(1/n))\*100.  
条件2：与top frequency histograms相反，Hybrid Histograms是top n出现频率要小于P阈值。p = (1-(1/n))\*100。 p=(100\*(n-1))/n  
– 也就是收集top n包含值占所有值得比例要小于p，n是桶数 如果n=254 则p= 100\*(254-1)/254 = 99.6  
  
The estimate\_percent parameter in the DBMS\_STATS statistics gathering procedure is set to AUTO\_SAMPLE\_SIZE.  
If users specify their own percentage, then the database creates frequency or height-balanced histograms.  
条件3：收集时需要指定 estimate\_percent=>DBMS\_STATS.AUTO\_SAMPLE\_SIZE 如果用户指定收集百分比，那么就会创建高度平衡直方图，无法收集Hybrid直方图

测试Hybrid直方图

-- 创建测试表
drop table h1 purge;
create table h1(id number,name char(10));

insert into h1 select c id,to_char(c) name
from (select mod(rownum,ceil(dbms_random.value(1,12))) c from dual connect by rownum<=10000) t 
order by c;

commit;

select * from h1 where id=8 and rownum<=10;
select id,count(1) cnt from h1 group by id order by 1;  -- 12个不同值
select id,cnt,sum(cnt) over(order by id) lj_cnt from (select id,count(1) cnt from h1 group by id order by 1) t;  -- 12个不同值
      ID      CNT   LJ_CNT
-------- -------- --------
       0     1832     1832
       1     1953     3785
       2     1453     5238
       3     1213     6451
       4      937     7388
       5      744     8132
       6      606     8738
       7      443     9181
       8      339     9520
       9      254     9774
      10      168     9942
      11       58    10000

-- 收集直方图统计信息 10个桶
exec sys.dbms_stats.gather_table_stats('TEST','H1',method_opt=>'FOR COLUMNS ID SIZE 10');
@ti h1

P值=100*(10-1)/10 = 90%
top 10(10表示桶数)频繁的值所占比例 = 9774/10000 = 97% 大于P值，调整一下  生成的是TOP直方图 TOP-FREQUENCY
update h1 set id=11 where id=0 and rownum<=500;
update h1 set id=10 where id=1 and rownum<=500;
update h1 set id=9 where id=2 and rownum<=500;
update h1 set id=8 where id=0 and rownum<=500;
commit;
调整之后 top 10(10表示桶数)频繁的值所占比例 = 8774/10000 = 87%  小于P值，生成Hybrid直方图

col HISTOGRAM for a15
select COLUMN_NAME,HISTOGRAM from dba_tab_cols where owner='TEST' and table_name='H1';
COLUMN_NAME     HIS
--------------- ---------------
ID              HYBRID
NAME            NONE

select COLUMN_NAME,ENDPOINT_NUMBER,ENDPOINT_VALUE,substr(ENDPOINT_ACTUAL_VALUE,1,15) ACTUAL_VALUE,ENDPOINT_REPEAT_COUNT
from DBA_HISTOGRAMS where OWNER='TEST' and TABLE_NAME='H1';

COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ACTUAL_VALUE ENDPOINT_REPEAT_COUNT
----------- --------------- -------------- ------------ ---------------------
ID                      832              0 0                              832
ID                     2285              1 1                             1453
ID                     3238              2 2                              953
ID                     4451              3 3                             1213
ID                     5388              4 4                              937
ID                     6738              6 6                              606
ID                     7181              7 7                              443
ID                     8020              8 8                              839
ID                     8774              9 9                              754
ID                    10000             11 11                             558

select id,cnt,sum(cnt) over(order by id) lj_cnt from (select id,count(1) cnt from h1 group by id order by 1) t;
      ID      CNT   LJ_CNT
-------- -------- --------
       0      832      832
       1     1453     2285
       2      953     3238
       3     1213     4451
       4      937     5388
       5      744     6132  
       6      606     6738  
       7      443     7181
       8      839     8020
       9      754     8774
      10      668     9442  
      11      558    10000

-- 没有出现在直方图端点值中 可以用ENDPOINT_NUMBER - 上一个ENDPOINT_NUMBER - ENDPOINT_REPEAT_COUNT计算 
-- 也可以准确的估算除端点值之外的其它值
-- 没有出现在直方图端点值中

height-based histogram中会将10000条数据平均分配到10个bucket中，每个bucket有7条数据，可能会造成同一的数据分配到不同的bucket中。
而hybrid histogram只取top 10的值分配到10个bucket中，而且相同值只会存在一个bucket中。然后通过endpoint_repeat_count，优化器可以准确估计基数。


TEST@TDB(TDB): 1> @hist test h1
TAB  COL HIS ENDPOINT_NUMBER ENDPOINT_VALUE COL_PREKEY COL_VALUE OTH_VALUE
---- --- --- --------------- -------------- ---------- --------- ---------
H1   ID  HYB             832              0                  832         0
H1   ID  HYB            2285              1          0      1453         0
H1   ID  HYB            3238              2          1       953         0
H1   ID  HYB            4451              3          2      1213         0
H1   ID  HYB            5388              4          3       937         0
H1   ID  HYB            6738              6          4       606       744
H1   ID  HYB            7181              7          6       443         0
H1   ID  HYB            8020              8          7       839         0
H1   ID  HYB            8774              9          8       754         0
H1   ID  HYB           10000             11          9       558       668
-- 端点6的  oth_value 用ENDPOINT_NUMBER - 上一个ENDPOINT_NUMBER - ENDPOINT_REPEAT_COUNT计算  774和id=5的count(*)对应
-- 端点11的 oth_value 用ENDPOINT_NUMBER - 上一个ENDPOINT_NUMBER - ENDPOINT_REPEAT_COUNT计算  668和id=10的count(*)对应

参考：https://blog.csdn.net/qianglei6077/article/details/92990317

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

oracle12c 新特性--- Hybrid直方图 混合直方图

官网：http://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm#TGSQL372

Endpoint是如何重复统计：

产生Hybrid Histograms的条件：

测试Hybrid直方图

评论

oracle12c 新特性--- Hybrid直方图混合直方图