
Hive中提供了很多分析函数,用于完成负责的统计分析。
现在给出SUM、AVG、MIN、MAX的示例
用于实现分组内所有和连续累积的统计。
数据准备
/tmp/a.txt数据文件内容如下:
cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
hive> CREATE TABLE xxm_t1 (
cookieid string,
createtime string, --day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
hive> DESC xxm_t1;
cookieid STRING
createtime STRING
pv INT
hive> LOAD DATA INPATH '/tmp/a.txt' INTO TABLE xxm_t1;
hive> select * from xxm_t1;
OK
cookie1 2015-04-10 1
cookie1 2015-04-11 5
cookie1 2015-04-12 7
cookie1 2015-04-13 3
cookie1 2015-04-14 2
cookie1 2015-04-15 4
cookie1 2015-04-16 4
SUM — 结果和ORDER BY相关,默认为升序
SELECT cookieid,createtime,pv,SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行FROM xxm_t1;cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6-----------------------------------------------------------------------------cookie1 2015-04-10 1 1 1 26 1 6 26cookie1 2015-04-11 5 6 6 26 6 13 25cookie1 2015-04-12 7 13 13 26 13 16 20cookie1 2015-04-13 3 16 16 26 16 18 13cookie1 2015-04-14 2 18 18 26 17 21 10cookie1 2015-04-15 4 22 22 26 16 20 8cookie1 2015-04-16 4 26 26 26 13 13 4Time taken: 2.54 seconds, Fetched 7 row(s)
pv2: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW同pv1
pv3: 分组内(cookie1)所有的pv累加
pv4: ROWS BETWEEN 3 PRECEDING AND CURRENT ROW分组内当前行+往前3行
pv5: ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING分组内当前行+往前3行+往后1行
pv6: ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING分组内当前行+往后所有行
如果不指定ORDER BY,则将分组内所有值累加;
ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
SELECTcookieid,createtime,pv,AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1AVG(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行FROM xxm_t1;cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6-----------------------------------------------------------------------------cookie1 2015-04-10 1 1.0 1.0 3.7142857142857144 1.0 3.0 3.7142857142857144cookie1 2015-04-11 5 3.0 3.0 3.7142857142857144 3.0 4.333333333333333 4.166666666666667cookie1 2015-04-12 7 4.333333333333333 4.333333333333333 3.7142857142857144 4.333333333333333 4.0 4.0cookie1 2015-04-13 3 4.0 4.0 3.7142857142857144 4.0 3.6 3.25cookie1 2015-04-14 2 3.6 3.6 3.7142857142857144 4.25 4.2 3.3333333333333335cookie1 2015-04-15 4 3.6666666666666665 3.6666666666666665 3.7142857142857144 4.0 4.0 4.0cookie1 2015-04-16 4 3.7142857142857144 3.7142857142857144 3.7142857142857144 3.25 3.25 4.0Time taken: 2.619 seconds, Fetched 7 row(s)
--MINSELECT cookieid,createtime,pv,MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1MIN(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行FROM xxm_t1;cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6-----------------------------------------------------------------------------cookie1 2015-04-10 1 1 1 1 1 1 1cookie1 2015-04-11 5 1 1 1 1 1 2cookie1 2015-04-12 7 1 1 1 1 1 2cookie1 2015-04-13 3 1 1 1 1 1 2cookie1 2015-04-14 2 1 1 1 2 2 2cookie1 2015-04-15 4 1 1 1 2 2 4cookie1 2015-04-16 4 1 1 1 2 2 4Time taken: 1.703 seconds, Fetched 7 row(s)----MAXSELECT cookieid,createtime,pv,MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1MAX(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行FROM xxm_t1;cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6-----------------------------------------------------------------------------cookie1 2015-04-10 1 1 1 7 1 5 7cookie1 2015-04-11 5 5 5 7 5 7 7cookie1 2015-04-12 7 7 7 7 7 7 7cookie1 2015-04-13 3 7 7 7 7 7 4cookie1 2015-04-14 2 7 7 7 7 7 4cookie1 2015-04-15 4 7 7 7 7 7 4cookie1 2015-04-16 4 7 7 7 4 4 4Time taken: 1.492 seconds, Fetched 7 row(s)
文章转载自漫谈大数据与数据分析,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




