oracle12c 新特性— Hybrid直方图 混合直方图
官网:http://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm#TGSQL372
A hybrid histogram combines characteristics of both height-based histograms and frequency histograms. This “best of both worlds” approach enables the optimizer to obtain better selectivity estimates in some situations.
Hybrid Histograms结合了height-based histograms和frequency histograms的特点。这种“两全其美”的方法使优化器能够在某些情况下获得更好的选择性估计。
The height-based histogram sometimes produces inaccurate estimates for values that are almost popular. For example, a value that occurs as an endpoint value of only one bucket but almost occupies two buckets is not considered popular.
基于高度的直方图有时会对几乎流行的值产生不准确的估计。例如,仅作为一个桶的端点值出现但几乎占据两个桶的值不被认为是流行的。
To solve this problem, a hybrid histogram distributes values so that no value occupies more than one bucket, and then stores the endpoint repeat count value, which is the number of times the endpoint value is repeated, for each endpoint (bucket) in the histogram. By using the repeat count, the optimizer can obtain accurate estimates for almost popular values.
为了解决这个问题,混合直方图分配值,使得没有值占据一个以上的桶,然后为直方图中的每个端点(桶)存储端点重复计数值,端点重复计数值是端点值重复的次数。通过使用重复计数,优化器可以获得几乎流行的值的准确估计。
Endpoint是如何重复统计:
官网上的例子:col列的所有值:1 1 1 5 5 5 10 10 25 25 25 25 50 100 100
-- 1 第一次值的分布情况:
桶1: 1 1 1 5 5
桶2: 5 10 10 25 25
桶3: 25 25 50 100 100
-- 2 之后优化器会移动bucket的边界(就是同一个bucket中所包含的值),以保证同一个值都在同一个bucket中。
-- 优化之后
桶1: 1 1 1 5 5 5 重复出现次数:3次
桶2: 10 10 25 25 25 25 重复出现次数:4次
桶3: 50 100 100 重复出现次数:2次
-- 3 重新优化后,要重新计算每个bucket中endpoint重复值的总和。如下图中的,Bucket1中重复值5总和是3。
产生Hybrid Histograms的条件:
NDV is greater than n, where n is the number of histogram buckets (default is 254). --默认254个桶
条件1: 不同值的个数要大于桶数
The criteria for top frequency histograms do not apply. 关键是频率直方图不适用
This is another way to stating that the percentage of rows occupied by the top n frequent values is less than threshold p, where p = (1-(1/n))\*100.
条件2:与top frequency histograms相反,Hybrid Histograms是top n出现频率要小于P阈值。p = (1-(1/n))\*100。 p=(100\*(n-1))/n
– 也就是收集top n包含值占所有值得比例要小于p,n是桶数 如果n=254 则p= 100\*(254-1)/254 = 99.6
The estimate\_percent parameter in the DBMS\_STATS statistics gathering procedure is set to AUTO\_SAMPLE\_SIZE.
If users specify their own percentage, then the database creates frequency or height-balanced histograms.
条件3:收集时需要指定 estimate\_percent=>DBMS\_STATS.AUTO\_SAMPLE\_SIZE 如果用户指定收集百分比,那么就会创建高度平衡直方图,无法收集Hybrid直方图
测试Hybrid直方图
-- 创建测试表
drop table h1 purge;
create table h1(id number,name char(10));
insert into h1 select c id,to_char(c) name
from (select mod(rownum,ceil(dbms_random.value(1,12))) c from dual connect by rownum<=10000) t
order by c;
commit;
select * from h1 where id=8 and rownum<=10;
select id,count(1) cnt from h1 group by id order by 1; -- 12个不同值
select id,cnt,sum(cnt) over(order by id) lj_cnt from (select id,count(1) cnt from h1 group by id order by 1) t; -- 12个不同值
ID CNT LJ_CNT
-------- -------- --------
0 1832 1832
1 1953 3785
2 1453 5238
3 1213 6451
4 937 7388
5 744 8132
6 606 8738
7 443 9181
8 339 9520
9 254 9774
10 168 9942
11 58 10000
-- 收集直方图统计信息 10个桶
exec sys.dbms_stats.gather_table_stats('TEST','H1',method_opt=>'FOR COLUMNS ID SIZE 10');
@ti h1
P值=100*(10-1)/10 = 90%
top 10(10表示桶数)频繁的值所占比例 = 9774/10000 = 97% 大于P值,调整一下 生成的是TOP直方图 TOP-FREQUENCY
update h1 set id=11 where id=0 and rownum<=500;
update h1 set id=10 where id=1 and rownum<=500;
update h1 set id=9 where id=2 and rownum<=500;
update h1 set id=8 where id=0 and rownum<=500;
commit;
调整之后 top 10(10表示桶数)频繁的值所占比例 = 8774/10000 = 87% 小于P值,生成Hybrid直方图
col HISTOGRAM for a15
select COLUMN_NAME,HISTOGRAM from dba_tab_cols where owner='TEST' and table_name='H1';
COLUMN_NAME HIS
--------------- ---------------
ID HYBRID
NAME NONE
select COLUMN_NAME,ENDPOINT_NUMBER,ENDPOINT_VALUE,substr(ENDPOINT_ACTUAL_VALUE,1,15) ACTUAL_VALUE,ENDPOINT_REPEAT_COUNT
from DBA_HISTOGRAMS where OWNER='TEST' and TABLE_NAME='H1';
COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ACTUAL_VALUE ENDPOINT_REPEAT_COUNT
----------- --------------- -------------- ------------ ---------------------
ID 832 0 0 832
ID 2285 1 1 1453
ID 3238 2 2 953
ID 4451 3 3 1213
ID 5388 4 4 937
ID 6738 6 6 606
ID 7181 7 7 443
ID 8020 8 8 839
ID 8774 9 9 754
ID 10000 11 11 558
select id,cnt,sum(cnt) over(order by id) lj_cnt from (select id,count(1) cnt from h1 group by id order by 1) t;
ID CNT LJ_CNT
-------- -------- --------
0 832 832
1 1453 2285
2 953 3238
3 1213 4451
4 937 5388
5 744 6132
6 606 6738
7 443 7181
8 839 8020
9 754 8774
10 668 9442
11 558 10000
-- 没有出现在直方图端点值中 可以用ENDPOINT_NUMBER - 上一个ENDPOINT_NUMBER - ENDPOINT_REPEAT_COUNT计算
-- 也可以准确的估算除端点值之外的其它值
-- 没有出现在直方图端点值中
height-based histogram中会将10000条数据平均分配到10个bucket中,每个bucket有7条数据,可能会造成同一的数据分配到不同的bucket中。
而hybrid histogram只取top 10的值分配到10个bucket中,而且相同值只会存在一个bucket中。然后通过endpoint_repeat_count,优化器可以准确估计基数。
TEST@TDB(TDB): 1> @hist test h1
TAB COL HIS ENDPOINT_NUMBER ENDPOINT_VALUE COL_PREKEY COL_VALUE OTH_VALUE
---- --- --- --------------- -------------- ---------- --------- ---------
H1 ID HYB 832 0 832 0
H1 ID HYB 2285 1 0 1453 0
H1 ID HYB 3238 2 1 953 0
H1 ID HYB 4451 3 2 1213 0
H1 ID HYB 5388 4 3 937 0
H1 ID HYB 6738 6 4 606 744
H1 ID HYB 7181 7 6 443 0
H1 ID HYB 8020 8 7 839 0
H1 ID HYB 8774 9 8 754 0
H1 ID HYB 10000 11 9 558 668
-- 端点6的 oth_value 用ENDPOINT_NUMBER - 上一个ENDPOINT_NUMBER - ENDPOINT_REPEAT_COUNT计算 774和id=5的count(*)对应
-- 端点11的 oth_value 用ENDPOINT_NUMBER - 上一个ENDPOINT_NUMBER - ENDPOINT_REPEAT_COUNT计算 668和id=10的count(*)对应
参考:https://blog.csdn.net/qianglei6077/article/details/92990317




