点击关注上方“知了小巷”,
设为“置顶或星标”,第一时间送达干货。
Hive数据过滤之having子句
hadoop-3.1.1
hive-3.1.1
Hive表:
hive> desc emp;
OK
empno int
ename varchar(10)
job varchar(9)
mgr int
hiredate date
sal float
comm float
deptno int
Time taken: 0.313 seconds, Fetched: 8 row(s)
Hive中的数据过滤
where子句过滤 having子句过滤 distinct命令过滤 表过滤 分区过滤 分桶过滤 索引过滤 列过滤
having子句过滤
在SQL中增加 HAVING 子句原因是,WHERE 关键字无法与聚合函数一起使用。
一般用法:
-- 聚合函数aggregate_function
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name operator value
-- GROUP BY子句
GROUP BY column_name
-- HAVING子句,包含聚合函数aggregate_function
HAVING aggregate_function(column_name) operator value
having子句发生在数据聚合之后,在MapReduce引擎中,having发生在Reduce阶段进行条件过滤。下面是根据job进行分组统计个数,并筛选出个数超过2的job。
-- 查看执行计划
hive> explain select job, count(1) from emp group by job having count(1)>2;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: emp
Statistics: ....
Select Operator
expressions: job (type: varchar(9))
outputColumnNames: job
Statistics: ...
Group By Operator
aggregations: count()
keys: job (type: varchar(9))
mode: hash
outputColumnNames: _col0, _col1
Statistics: ...
Reduce Output Operator
key expressions: _col0 (type: varchar(9))
sort order: +
Map-reduce partition columns: _col0 (type: varchar(9))
Statistics: ...
value expressions: _col1 (type: bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: varchar(9))
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: ...
Filter Operator
predicate: (_col1 > 2L) (type: boolean)
Statistics: ...
File Output Operator
compressed: false
Statistics: ...
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.675 seconds, Fetched: 52 row(s)
上述执行计划可以清晰的看到Filter Operator发生在Reduce Operator Tree下面,即Reduce Operator和Group By Operator两个算子之后,在Reduce阶段进行分组聚合做数据过滤。
MapReduce伪代码:
map(inkey, invalue, context):
colsArray = invalue.split("\t")
job = colsArray[2]
// 输出的key是job,value是固定值,用1表示存在一个这样的job
context.write(job, 1)
reduce(inkey, invalues, context):
// 数组长度 size
// 每一个key对应的总个数
count = size(invalues)
// 如果每个key对应的总个数大于2才输出
if count > 2:
context.write(inkey, count)
Having和Where的区别
where作用于表中的列,having作用于查询结果中的列 where后不能写分组函数,having后可以使用分组函数 having只用于group by分组统计语句
-- having后使用count后的结果进行过滤
hive> select job, count(1) cnt from emp group by job having cnt>2;
...
Ended Job = job_local1323837859_0015
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 4584 HDFS Write: 3230 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
CLERK 4
MANAGER 3
SALESMAN 4
Time taken: 8.698 seconds, Fetched: 3 row(s)
小问题:where能不能替换为having呢?
-- 答案是:不能
hive> select * from emp having mgr>7800;
FAILED: SemanticException HAVING specified without GROUP BY
Spark源码解析-Yarn部署流程(ApplicationMaster)
Spark源码解析-Yarn部署流程(SparkSubmit)

文章转载自大数据真有意思,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




