MySQL8.0之前的版本中,只有简单的统计信息,并没有提供统计直方图(如果熟悉Oracle的小伙伴,肯定知道那个让我们有又爱又恨的统计直方图了)。8.0的版本中,MySQL实现了这个功能,通过column_statistics这个数据字典存储列的直方图信息,供优化器生成查询执行计划的使用。个人认为,由于MySQL的一些特性,在MySQL中,好处远远多于坏处。我们可以用ANALYZE TABLE语句对直方图进行管理。
column_statistics的特征:
该表包含除空间数据和JSON之外的所有数据类型的列的统计信息。
该表是持久化的,因此不必在每次服务器启动时创建列统计信息。
MySQL会对这张表执行更新. 用户没有办法用update语句修改。
column_statistics表是不能被用户直接访问的,因为它是数据字典的一部分。但是查看直方图信息,我们可以使用INFORMATION_SCHEMA.COLUMN_STATISTICS,它作为数据字典表上的一个视图实现。COLUMN_STATISTICS有以下几列:
SCHEMA_NAME, TABLE_NAME,COLUMN_NAME:这个看名字就知道没必要多说。
HISTOGRAM: 是一个JSON值, 以直方图的方式描述列统计信息。
列直方图包含存储在列中的值范围的部分bucket。直方图是JSON对象,可以灵活地表示列统计信息。例如,
CREATE TABLE `t_histograms` (`id` int NOT NULL,`c1` int DEFAULT NULL,`c2` varchar(10) DEFAULT NULL,PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
--收集列c1直方图统计信息
analyze table t_histograms UPDATE HISTOGRAM on c1;
--查看C1(int)直方图select json_pretty(histogram) fromINFORMATION_SCHEMA.COLUMN_STATISTICS where schema_name='test' and table_name='t_histograms' and column_name='c1';{"buckets": [[1,0.6],[2,0.8],[3,1.0]],"data-type": "int","null-values": 0.0,"collation-id": 8,"last-updated": "2020-07-05 10:57:03.429238","sampling-rate": 1.0,"histogram-type": "singleton","number-of-buckets-specified": 100
}
--收集列c2直方图统计信息analyze table t_histograms UPDATE HISTOGRAM on c2;--查看C2(字符串)直方图select json_pretty(histogram) from INFORMATION_SCHEMA.COLUMN_STATISTICS where schema_name='test' and table_name='t_histograms' and column_name='c2';{...[4800,4849,0.96999399879976,50],[4850,4899,0.9799959991998399,50],[4900,4949,0.98999799959992,50],[4950,4999,1.0,50]],"data-type": "int","null-values": 0.0,"collation-id": 8,"last-updated": "2020-07-05 15:07:24.070861","sampling-rate": 1.0,"histogram-type": "equi-height","number-of-buckets-specified": 100}
无法在主键或唯一键上收集统计直方图!
analyze table tt1 update histogram on c1;+----------+-----------+----------+-----------------------------------------------------------+| Table | Op | Msg_type | Msg_text |+----------+-----------+----------+-----------------------------------------------------------+| test.tt1 | histogram | Error | The column 'c1' is covered by a single-part unique index. |+----------+-----------+----------+-----------------------------------------------------------+
直方图的Key
*****Buckets:直方图bucket
1、对于singleton直方图,包括两个值:
Value 1: bucket的值, 取决于列的数据类型及数据
Value 2: 表示该值的累积频率(double类型). 例如{2,0.8}表示80%的值小于等于2,2的值占20%
2、对于equi-height(等高)直方图, 包含以下四个值:
Value 1,2 : bucket中的最大值和最小值
Value 3: 表示该值的累积频率(double类型)
Value 4: 在这个bucket中列值的num_distinct(不同的值有几个)
***** null-values: 取值范围0.0-1.0 列值中null值所占的比例。
***** last-updated: 直方生成的时间。
***** sample-rate: 收集时的采样比。
***** histogram-type:
1、 Singleton: 频度直方图.当列的NDV小于analyze table … with N buckets指定的桶数时, 一个直方图表示一个值的占比. N 默认为100。
2.、Equi-height: 等高直方图.当列的NDV超过analyze table时指定的桶数时, 只能用等高直方图表示了。
***** number-of-buckets-specified: 指定的直方图bucket数,默认为100。
***** data-type: 列的数据类型。int,double,decimal,datetime,string…
***** collation-id: 直方图数据的校对ID。当数据类型值为string时,它最有意义。值对应于INFORMATION_SCHEMA中的ID列值。
优化器应用直方图统计信息,根据列值与常数值比较的选择性(过滤效果)来确定行估计值。这些谓词表格符合直方图使用:
col_name = constant col_name <> constant col_name != constant col_name > constant col_name < constant col_name >= constant col_name <= constant col_name IS NULL col_name IS NOT NULL col_name BETWEEN constant AND constant col_name NOT BETWEEN constant AND constant col_name IN (constant[, constant] ...) col_name NOT IN (constant[, constant] ...) |
案例
Sample1:
mysql> select table_rows from information_schema.tables where table_name='tt3';+------------+| TABLE_ROWS |+------------+| 5 |+------------+mysql> select json_pretty(histogram) from information_schema.column_statistics where table_name='tt3' and column_name='c1';...{"buckets": [[1,0.6],[2,0.8],[3,1.0]],"data-type": "int","null-values": 0.0,"collation-id": 8,"last-updated": "2020-07-05 16:37:08.023031","sampling-rate": 1.0,"histogram-type": "singleton","number-of-buckets-specified": 100}
从上面的直方图信息,可以看到
如果 c1=2选择率为20%, 选择率*总行数=预估行数,应该为1行。

如果 c1=1选择率为60%,选择率*总行数=预估行数,应该为3行。

Sample2(等高直方图):
mysql> select json_pretty(histogram) from information_schema.column_statistics where table_name='bigtable' and column_name='c2';…["base64:type254:MDAwMDAwMDAwMDAwMDAwOTc4MDk=","base64:type254:MDAwMDAwMDAwMDAwMDAwOTg1Mzg=",0.9799942488600419,993],["base64:type254:MDAwMDAwMDAwMDAwMDAwOTg1Mzk=","base64:type254:MDAwMDAwMDAwMDAwMDAwOTkyNjk=",0.9900039710252092,994],["base64:type254:MDAwMDAwMDAwMDAwMDAwOTkyNzA=","base64:type254:MDAwMDAwMDAwMDAwMDAwOTk5OTk=",1.0,993]],"data-type": "string","null-values": 0.0,"collation-id": 255,"last-updated": "2020-07-06 06:01:04.501517","sampling-rate": 0.7353382379413353,"histogram-type": "equi-height","number-of-buckets-specified": 100}mysql> select from_base64('MDAwMDAwMDAwMDAwMDAwOTg1Mzk=');+------------------------------------------------------------------------------------------+| from_base64('MDAwMDAwMDAwMDAwMDAwOTg1Mzk=') |+------------------------------------------------------------------------------------------+| 0x3030303030303030303030303030303938353339 |+------------------------------------------------------------------------------------------+1 row in set (0.00 sec)mysql> select * from bigtable where c2=0x3030303030303030303030303030303938353339;+-------+------+----------------------+---------------------+| id | c1 | c2 | d1 |+-------+------+----------------------+---------------------+| 98539 | 1 | 00000000000000098539 | 2020-07-06 05:46:20 |+-------+------+----------------------+---------------------+1 row in set (0.05 sec)
我们以c2>'00000000000000098539' 为例,["base64:type254:MDAwMDAwMDAwMDAwMDAwOTg1Mzk=", <<===00000000000000098539"base64:type254:MDAwMDAwMDAwMDAwMDAwOTkyNjk=", <<===000000000000000992690.9900039710252092,994]
"c2>'00000000000000098539'" 这个条件选择率为:
(1 - 0.9900039710252092) +(0.9900039710252092 - 0.9799942488600419) = 0.0200057511399581=2%

总结:
网上很多人都说统计直方图这个功能很鸡肋,需要手动来建立更新,并且单纯的使用直方图对数据的查询帮助有限,但是正确的估算选择率与Cardinality对查询的访问路径及连接方式有至关重要的作用。MySQL 从8.0开始已经不再仅仅是互联网类的数据库,正在往全方位,企业级的数据迈进,期待以后的功能越做越好~
今天很正经,没有美女图🙈,只有单纯的技术分享,以上仅是我个人的学习分享,如有哪里理解得不对,欢迎大家指出,谢谢~




