暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

想玩大数据没数据怎么办?1000G生产数据来了!

闲话少说聊数据 2021-06-14
1243


     Benchmark通俗来讲就是跑分软件,像鲁大娘或者手机跑分各种软件一样。数据行业也有跑分软件,就是TPC(事务处理性能理事会)这个组织搞起来的一套评价体系。你常见的关系型数据库Mysql,PG和大数据体系的Hive,Flink,Spark,Kylin,Hbase,Presto等数据库和计算机引擎,都会有评价报告。以及其他基于TPC的规范产生的细化,如今天讲的SSB。

     简单来说,它会提供统一的数据集,和统计的逻辑,给各大都说自己牛的数据库和引擎来跑分,看谁能跑完,跑得快。


一、关于TPC和SSB          

TPC(事务处理性能理事会),一个非盈利组织,定义了一套在决策支持系统领域(数据库)的基准测试规范。SSB(全称Star Schema Benchmark),是一套用于测试数据库产品在星型模式下性能表现的基准测试规范,目前在学术界和工业界都得到了广泛的使用。它将TPC-H的雪花模式简化为了星型模式,将基准查询由TPC-H的复杂Ad-Hoc查询改为了结构更固定的OLAP查询。

没错,SSB就是为了做OLAP,基于大家都容易理解的电商场景,而且只有5张表。(订单事实表,客户表,供应商表,日期维表,商品表)


二、如何下载SSB的数据集

    其实之前SSB他爹TPC的数据集是很容易下的,这几天发现需要各种验证了,可能是老美对我们做了限制,说好学术无GJ的。就找了一个Kylin官方提供的工具(数据由TPC->Hive->Kylin),我们只要到Hive就行。

整理官网步骤如下:

1、下载git代码

https://github.com/Kyligence/ssb-kylin

    git clone https://github.com/Kyligence/ssb-kylin.git


    # 如果连不上githup可以在gitee拉一份
    git clone https://gitee.com/biyinggun/ssb-kylin.git



    2、需要设置一个环境变量(这步一般是不需要的)

    如果hadoop-streaming.jar找不到的话,指定环境变量。

      export HADOOP_STREAMING_JAR=/usr/hdp/<version>/hadoop-mapreduce-client/hadoop-streaming.jar


      3、 确保你的HDFS目录权限

        hdfs dfs -chmod -R 777 user/root


        4、生成数据并导入到Hive

          .bin/run.sh# 后面参数10代表10G,可以不带参数 默认是0.1 范围是0.1~1000G
          # 生成10G是这样
          # bin/run.sh --scale 10

          可能产生的问题:在docker容器下执行的话,可能会产生权限问题
          1)文件权限的问题 可以都加上777权限


          5、结束后验证数据

            hive> show databases;
            OK
            default
            ssb
            hive> use ssb;
            hive> show tables;
            customer
            dates
            lineorder
            p_lineorder
            part
            supplier
            Time taken: 0.052 seconds, Fetched: 6 row(s)
            hive> select count(1) from lineorder;
            600572
            Time taken: 7.336 seconds, Fetched: 1 row(s)


            到此为止,我们就产生了10G的数据了,如果为了练习下载可以先下载1G。

            如果是为了测试性能做压测,按需下载,重新执行下载进会清楚原来的数据。


            三、欣赏一下专业的测试SQL写法 

            附上SSB提供的13个测试SQL(其实都没有你们写得漂亮)。

              # Q1.1
              select sum(v_revenue) as revenue
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              where d_year = 1993
              and lo_discount between 1 and 3
              and lo_quantity < 25;
              # Q1.2
              select sum(v_revenue) as revenue
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              where d_yearmonthnum = 199401
              and lo_discount between 4 and 6
              and lo_quantity between 26 and 35;


              Q 1.3
              select sum(v_revenue) as revenue
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              where d_weeknuminyear = 6 and d_year = 1994
              and lo_discount between 5 and 7
              and lo_quantity between 26 and 35;
              # Q2.1
              select sum(lo_revenue) as lo_revenue, d_year, p_brand
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              left join part on lo_partkey = p_partkey
              left join supplier on lo_suppkey = s_suppkey
              where p_category = 'MFGR#12' and s_region = 'AMERICA'
              group by d_year, p_brand
              order by d_year, p_brand;
              # Q2.2
              select sum(lo_revenue) as lo_revenue, d_year, p_brand
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              left join part on lo_partkey = p_partkey
              left join supplier on lo_suppkey = s_suppkey
              where p_brand between 'MFGR#2221' and 'MFGR#2228' and s_region = 'ASIA'
              group by d_year, p_brand
              order by d_year, p_brand;


              # Q 2.3
              select sum(lo_revenue) as lo_revenue, d_year, p_brand
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              left join part on lo_partkey = p_partkey
              left join supplier on lo_suppkey = s_suppkey
              where p_brand = 'MFGR#2221' and s_region = 'EUROPE'
              group by d_year, p_brand
              order by d_year, p_brand;


              # Q 3.1
              select c_nation, s_nation, d_year, sum(lo_revenue) as lo_revenue
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              left join customer on lo_custkey = c_custkey
              left join supplier on lo_suppkey = s_suppkey
              where c_region = 'ASIA' and s_region = 'ASIA'and d_year >= 1992 and d_year <= 1997
              group by c_nation, s_nation, d_year
              order by d_year asc, lo_revenue desc;


              # Q 3.2
              select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              left join customer on lo_custkey = c_custkey
              left join supplier on lo_suppkey = s_suppkey
              where c_nation = 'UNITED STATES' and s_nation = 'UNITED STATES'
              and d_year >= 1992 and d_year <= 1997
              group by c_city, s_city, d_year
              order by d_year asc, lo_revenue desc;
              # Q3.3
              select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              left join customer on lo_custkey = c_custkey
              left join supplier on lo_suppkey = s_suppkey
              where (c_city='UNITED KI1' or c_city='UNITED KI5')
              and (s_city='UNITED KI1' or s_city='UNITED KI5')
              and d_year >= 1992 and d_year <= 1997
              group by c_city, s_city, d_year
              order by d_year asc, lo_revenue desc;


              # Q3.4
              select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              left join customer on lo_custkey = c_custkey
              left join supplier on lo_suppkey = s_suppkey
              where (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED KI5') and d_yearmonth = 'Dec1997'
              group by c_city, s_city, d_year
              order by d_year asc, lo_revenue desc;


              # Q4.1
              select d_year, c_nation, sum(lo_revenue) - sum(lo_supplycost) as profit
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              left join customer on lo_custkey = c_custkey
              left join supplier on lo_suppkey = s_suppkey
              left join part on lo_partkey = p_partkey
              where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
              group by d_year, c_nation
              order by d_year, c_nation;


              # Q4.2
              select d_year, s_nation, p_category, sum(lo_revenue) - sum(lo_supplycost) as profit
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              left join customer on lo_custkey = c_custkey
              left join supplier on lo_suppkey = s_suppkey
              left join part on lo_partkey = p_partkey
              where c_region = 'AMERICA'and s_region = 'AMERICA'
              and (d_year = 1997 or d_year = 1998)
              and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
              group by d_year, s_nation, p_category
              order by d_year, s_nation, p_category;


              # 4.3
              select d_year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profit
              from p_lineorder
              left join dates on lo_orderdate = d_datekey
              left join customer on lo_custkey = c_custkey
              left join supplier on lo_suppkey = s_suppkey
              left join part on lo_partkey = p_partkey
              where c_region = 'AMERICA'and s_nation = 'UNITED STATES'
              and (d_year = 1997 or d_year = 1998)
              and p_category = 'MFGR#14'
              group by d_year, s_city, p_brand
              order by d_year, s_city, p_brand;


              文章转载自闲话少说聊数据,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

              评论