在Hive数据库中对SQL进行调优的时候,往往需要了解表的统计信息,比如:分区数量,数据行数,表的大小,文件个数等等。获取Hive表统计信息之前,需要先对Hive表收集统计信息:
非分区表
ANALYZE TABLE table_name COMPUTE
STATISTICS;
分区表
ANALYZE TABLE table_name
PARTITION(partition_col='partition_value') COMPUTE STATISTICS;
下面提供两种方式来获取Hive表的统计信息。
方式一
方式一比较简单,只需执行命令DESC FORMATTED
table_name;即可。
方式二
方式二需要开发者具有连接Hive元数据数据库的权限,使用SQL语句来获取统计信息。
非分区表
SELECT
b.tbl_name
AS '表名'
, SUM(CASE WHEN a.param_key = 'numRows' THEN a.param_value ELSE 0
END) AS '表数据行'
, SUM(CASE WHEN a.param_key = 'numRows' THEN 1 ELSE 0 END) AS '表分区数'
, SUM(CASE WHEN a.param_key = 'totalSize' THEN a.param_value ELSE 0 END)
/ 1024 / 1024 / 1024 AS '数据量GB'
, SUM(CASE WHEN a.param_key = 'numFiles' THEN a.param_value ELSE 0
END) AS '文件数'
FROM
TABLE_PARAMS a
JOIN
TBLS b
ON
a.tbl_id = b.tbl_id
WHERE
b.tbl_name IN ('table_name')
AND
b.owner = 'hive'
;
分区表
SELECT
c.tbl_name
AS '表名'
, SUM(CASE WHEN b.param_key = 'numRows' THEN b.param_value ELSE 0
END) AS '表数据行'
, SUM(CASE WHEN b.param_key = 'numRows' THEN 1 ELSE 0 END) AS '表分区数'
, SUM(CASE WHEN b.param_key = 'totalSize' THEN b.param_value ELSE 0 END)
/ 1024 / 1024 / 1024 AS '数据量GB'
, SUM(CASE WHEN b.param_key = 'numFiles' THEN b.param_value ELSE 0
END) AS '文件数'
FROM
PARTITIONS a
JOIN
PARTITION_PARAMS b
ON
a.part_id = b.part_id
JOIN
TBLS c
ON
a.tbl_id = c.tbl_id
WHERE
c.tbl_name IN ('table_name')
AND
c.owner = 'hive'
GROUP BY c.tbl_name
;




