Benchmark通俗来讲就是跑分软件,像鲁大娘或者手机跑分各种软件一样。数据行业也有跑分软件,就是TPC(事务处理性能理事会)这个组织搞起来的一套评价体系。你常见的关系型数据库Mysql,PG和大数据体系的Hive,Flink,Spark,Kylin,Hbase,Presto等数据库和计算机引擎,都会有评价报告。以及其他基于TPC的规范产生的细化,如今天讲的SSB。
简单来说,它会提供统一的数据集,和统计的逻辑,给各大都说自己牛的数据库和引擎来跑分,看谁能跑完,跑得快。
一、关于TPC和SSB
TPC(事务处理性能理事会),一个非盈利组织,定义了一套在决策支持系统领域(数据库)的基准测试规范。SSB(全称Star Schema Benchmark),是一套用于测试数据库产品在星型模式下性能表现的基准测试规范,目前在学术界和工业界都得到了广泛的使用。它将TPC-H的雪花模式简化为了星型模式,将基准查询由TPC-H的复杂Ad-Hoc查询改为了结构更固定的OLAP查询。
没错,SSB就是为了做OLAP,基于大家都容易理解的电商场景,而且只有5张表。(订单事实表,客户表,供应商表,日期维表,商品表)

二、如何下载SSB的数据集
其实之前SSB他爹TPC的数据集是很容易下的,这几天发现需要各种验证了,可能是老美对我们做了限制,说好学术无GJ的。就找了一个Kylin官方提供的工具(数据由TPC->Hive->Kylin),我们只要到Hive就行。
整理官网步骤如下:
1、下载git代码
https://github.com/Kyligence/ssb-kylin
git clone https://github.com/Kyligence/ssb-kylin.git# 如果连不上githup可以在gitee拉一份git clone https://gitee.com/biyinggun/ssb-kylin.git
2、需要设置一个环境变量(这步一般是不需要的)
如果hadoop-streaming.jar找不到的话,指定环境变量。
export HADOOP_STREAMING_JAR=/usr/hdp/<version>/hadoop-mapreduce-client/hadoop-streaming.jar
3、 确保你的HDFS目录权限
hdfs dfs -chmod -R 777 user/root
4、生成数据并导入到Hive
.bin/run.sh# 后面参数10代表10G,可以不带参数 默认是0.1 范围是0.1~1000G# 生成10G是这样# bin/run.sh --scale 10
可能产生的问题:在docker容器下执行的话,可能会产生权限问题1)文件权限的问题 可以都加上777权限
5、结束后验证数据
hive> show databases;OKdefaultssbhive> use ssb;hive> show tables;customerdateslineorderp_lineorderpartsupplierTime taken: 0.052 seconds, Fetched: 6 row(s)hive> select count(1) from lineorder;600572Time taken: 7.336 seconds, Fetched: 1 row(s)
到此为止,我们就产生了10G的数据了,如果为了练习下载可以先下载1G。
如果是为了测试性能做压测,按需下载,重新执行下载进会清楚原来的数据。
三、欣赏一下专业的测试SQL写法
附上SSB提供的13个测试SQL(其实都没有你们写得漂亮)。
# Q1.1select sum(v_revenue) as revenuefrom p_lineorderleft join dates on lo_orderdate = d_datekeywhere d_year = 1993and lo_discount between 1 and 3and lo_quantity < 25;# Q1.2select sum(v_revenue) as revenuefrom p_lineorderleft join dates on lo_orderdate = d_datekeywhere d_yearmonthnum = 199401and lo_discount between 4 and 6and lo_quantity between 26 and 35;Q 1.3select sum(v_revenue) as revenuefrom p_lineorderleft join dates on lo_orderdate = d_datekeywhere d_weeknuminyear = 6 and d_year = 1994and lo_discount between 5 and 7and lo_quantity between 26 and 35;# Q2.1select sum(lo_revenue) as lo_revenue, d_year, p_brandfrom p_lineorderleft join dates on lo_orderdate = d_datekeyleft join part on lo_partkey = p_partkeyleft join supplier on lo_suppkey = s_suppkeywhere p_category = 'MFGR#12' and s_region = 'AMERICA'group by d_year, p_brandorder by d_year, p_brand;# Q2.2select sum(lo_revenue) as lo_revenue, d_year, p_brandfrom p_lineorderleft join dates on lo_orderdate = d_datekeyleft join part on lo_partkey = p_partkeyleft join supplier on lo_suppkey = s_suppkeywhere p_brand between 'MFGR#2221' and 'MFGR#2228' and s_region = 'ASIA'group by d_year, p_brandorder by d_year, p_brand;# Q 2.3select sum(lo_revenue) as lo_revenue, d_year, p_brandfrom p_lineorderleft join dates on lo_orderdate = d_datekeyleft join part on lo_partkey = p_partkeyleft join supplier on lo_suppkey = s_suppkeywhere p_brand = 'MFGR#2221' and s_region = 'EUROPE'group by d_year, p_brandorder by d_year, p_brand;# Q 3.1select c_nation, s_nation, d_year, sum(lo_revenue) as lo_revenuefrom p_lineorderleft join dates on lo_orderdate = d_datekeyleft join customer on lo_custkey = c_custkeyleft join supplier on lo_suppkey = s_suppkeywhere c_region = 'ASIA' and s_region = 'ASIA'and d_year >= 1992 and d_year <= 1997group by c_nation, s_nation, d_yearorder by d_year asc, lo_revenue desc;# Q 3.2select c_city, s_city, d_year, sum(lo_revenue) as lo_revenuefrom p_lineorderleft join dates on lo_orderdate = d_datekeyleft join customer on lo_custkey = c_custkeyleft join supplier on lo_suppkey = s_suppkeywhere c_nation = 'UNITED STATES' and s_nation = 'UNITED STATES'and d_year >= 1992 and d_year <= 1997group by c_city, s_city, d_yearorder by d_year asc, lo_revenue desc;# Q3.3select c_city, s_city, d_year, sum(lo_revenue) as lo_revenuefrom p_lineorderleft join dates on lo_orderdate = d_datekeyleft join customer on lo_custkey = c_custkeyleft join supplier on lo_suppkey = s_suppkeywhere (c_city='UNITED KI1' or c_city='UNITED KI5')and (s_city='UNITED KI1' or s_city='UNITED KI5')and d_year >= 1992 and d_year <= 1997group by c_city, s_city, d_yearorder by d_year asc, lo_revenue desc;# Q3.4select c_city, s_city, d_year, sum(lo_revenue) as lo_revenuefrom p_lineorderleft join dates on lo_orderdate = d_datekeyleft join customer on lo_custkey = c_custkeyleft join supplier on lo_suppkey = s_suppkeywhere (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED KI5') and d_yearmonth = 'Dec1997'group by c_city, s_city, d_yearorder by d_year asc, lo_revenue desc;# Q4.1select d_year, c_nation, sum(lo_revenue) - sum(lo_supplycost) as profitfrom p_lineorderleft join dates on lo_orderdate = d_datekeyleft join customer on lo_custkey = c_custkeyleft join supplier on lo_suppkey = s_suppkeyleft join part on lo_partkey = p_partkeywhere c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')group by d_year, c_nationorder by d_year, c_nation;# Q4.2select d_year, s_nation, p_category, sum(lo_revenue) - sum(lo_supplycost) as profitfrom p_lineorderleft join dates on lo_orderdate = d_datekeyleft join customer on lo_custkey = c_custkeyleft join supplier on lo_suppkey = s_suppkeyleft join part on lo_partkey = p_partkeywhere c_region = 'AMERICA'and s_region = 'AMERICA'and (d_year = 1997 or d_year = 1998)and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')group by d_year, s_nation, p_categoryorder by d_year, s_nation, p_category;# 4.3select d_year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profitfrom p_lineorderleft join dates on lo_orderdate = d_datekeyleft join customer on lo_custkey = c_custkeyleft join supplier on lo_suppkey = s_suppkeyleft join part on lo_partkey = p_partkeywhere c_region = 'AMERICA'and s_nation = 'UNITED STATES'and (d_year = 1997 or d_year = 1998)and p_category = 'MFGR#14'group by d_year, s_city, p_brandorder by d_year, s_city, p_brand;




