Hive数据过滤之分区和分桶

大数据真有意思 2020-11-09

1656

点击关注上方“知了小巷”，

设为“置顶或星标”，第一时间送达干货。

Hive数据过滤之分区和分桶

hadoop-3.1.1
hive-3.1.1

Hive表：

hive> desc emp;
OK
empno                int                                      
ename                varchar(10)                              
job                  varchar(9)                               
mgr                  int                                      
hiredate             date                                     
sal                  float                                    
comm                 float                                    
deptno               int                                      
Time taken: 0.313 seconds, Fetched: 8 row(s)

Hive中的数据过滤

where子句过滤
having子句过滤
distinct子句过滤
表过滤
分区过滤
分桶过滤
索引过滤
列过滤

分区过滤

为什么要分区

hive为了避免全表查询，引进分区，将数据按目录进行划分，查询时指定目录，就可以减少查询时扫描的数据集规模，从而减少不必要的查询，提高查询效率。

Hive分区 VS MySQL分区

MySQL的分区字段用的是表内字段，hive的分区字段采用表外字段，也就是使用伪字段，分区字段在创建表的时候指定。

Hive中的动态分区

-- 是否允许动态分区
hive.exec.dynamic.partition=true; 
-- 动态区模式为严格模式
-- strict：严格模式，最少需要一个静态分区列(指定固定值)
-- nostrict：非严格模式，允许所有的分区字段都为动态。
hive.exec.dynamic.partition.mode=strict/nostrict;
-- 允许最大的动态分区数
hive.exec.max.dynamic.partitions=1000; 
-- 单个节点允许最大分区数
hive.exec.max.dynamic.partitions.pernode=100;

HiveSQL分区过滤使用方式

HiveSQL分区过滤是在where子句后面新增分区列的筛选条件。普通where子句的过滤是在Map阶段或者直接Fetch，增加判断条件用来剔除不满足条件的数据行，分区列的筛选是在Map的上一个阶段，在数据输入阶段进行路径的过滤。

Hive分区实际上是在分布式文件系统中以目录形式存在，一个分区对应一个目录。分区表在HDFS中的表现形式：

$ hdfs dfs -ls /user/hive/warehouse/dept_partition
Found 3 items
drwxr-xr-x   - shaozhipeng supergroup          0 2020-05-24 17:48 /user/hive/warehouse/dept_partition/month=202005
drwxr-xr-x   - shaozhipeng supergroup          0 2020-05-24 17:54 /user/hive/warehouse/dept_partition/month=202006
drwxr-xr-x   - shaozhipeng supergroup          0 2020-05-24 17:55 /user/hive/warehouse/dept_partition/month=202007

MapReduce设置数据文件路径的方式：

@Test
// Test regular inputformat
public void testNewInputFormat() throws Exception {
  Job job = new Job(conf, "orc test");
  job.setInputFormatClass(OrcNewInputFormat.class);
  job.setJarByClass(TestNewInputOutputFormat.class);
  job.setMapperClass(OrcTestMapper1.class);
  job.setNumReduceTasks(0);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  // 设置输入路径或者单个文件
  // 分区可以从FileInputFormat这里就进行了数据的过滤
  FileInputFormat.addInputPath(job,
      new Path(HiveTestUtils.getFileFromClasspath("orc-file-11-format.orc")));
  Path outputPath = new Path(workDir,
      "TestOrcFile." + testCaseName.getMethodName() + ".txt");
  localFs.delete(outputPath, true);
  FileOutputFormat.setOutputPath(job, outputPath);
  boolean result = job.waitForCompletion(true);
  assertTrue(result);
  Path outputFilePath = new Path(outputPath, "part-m-00000");

  assertTrue(localFs.exists(outputFilePath));
  BufferedReader reader = new BufferedReader(
      new InputStreamReader(localFs.open(outputFilePath)));
  int count=0;
  String line;
  String lastLine=null;
  while ((line=reader.readLine()) != null) {
    count++;
    lastLine = line;
  }
  reader.close();
  assertEquals(count, 7500);
  assertEquals(lastLine, "{true, 100, 2048, 65536," +
      " 9223372036854775807, 2.0, -5.0" + 
      ", , bye, {[{1, bye}, {2, sigh}]}, [{100000000, cat}," +
      " {-100000, in}, {1234, hat}]," +
      " {chani={5, chani}, mauddib={1, mauddib}}," +
      " 2000-03-12 15:00:01, 12345678.6547457}");
  localFs.delete(outputPath, true);
}

分桶过滤

分桶是对列值取哈希值的方式，将不同数据放到不同文件中存储。对于hive中每一个表、分区都可以进一步进行分桶。

列的哈希值除以桶的个数来决定每条数据划分在哪个桶中。

分桶适用场景：数据抽样（ sampling ）、map-join

分区是对目录的过滤，分桶是对文件的过滤

每个记录存储到桶的算法：
数据记录存储的桶 = mod(hash(分桶列的值), 桶个数)

Hive会将mod后计算出来一样结果的数据放在一起，如果查询数据的时候，条件中带有分桶字段的列，就会直接定位到相应的文件，避免扫描所有文件块，对于有上万、几十万文件的大表来说，可以极大缩短读取数据的时间，也方便在Map端进行join操作。

分桶表长这样：

hive> desc formatted stu_buck;
OK
# col_name             data_type            comment             
id                   int                                      
name                 string                                   
    
# Detailed Table Information    
Database:            default               
OwnerType:           USER                  
Owner:               shaozhipeng           
CreateTime:          Mon May 25 22:20:54 CST 2020  
LastAccessTime:      UNKNOWN               
Retention:           0                     
Location:            hdfs://localhost/user/hive/warehouse/stu_buck  
Table Type:          MANAGED_TABLE         
Table Parameters:    
 COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
 bucketing_version    2  
  # 文件数量
 numFiles             4                   
 numRows              19                  
 rawDataSize          159                 
 totalSize            178                 
 transient_lastDdlTime 1590417784          
    
# Storage Information    
SerDe Library:       org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat:         org.apache.hadoop.mapred.TextInputFormat  
OutputFormat:        org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Compressed:          No                    
Num Buckets:         4                     
Bucket Columns:      [id]                  
Sort Columns:        []                    
Storage Desc Params:    
 field.delim          \t                  
 serialization.format \t                  
Time taken: 0.296 seconds, Fetched: 33 row(s)

值得注意的是：
hive.enforce.bucketing这个参数已经默认为true了。
Configuration Properties#hive.enforce.bucketing (Hive 0.x and 1.x only)

hive.enforce.bucketing
Default Value:
Hive 0.x: false
Hive 1.x: false
Hive 2.x: removed, which effectively makes it always true (HIVE-12331)
Added In: Hive 0.6.0
Whether bucketing is enforced. If true, while inserting into the table, bucketing is enforced.

往期精选

Apache Kafka生产环境集群资源规划与配置

入门Apache Kafka需要了解的方方面面

数据库

文章转载自大数据真有意思，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

Hive数据过滤之分区和分桶

Hive中的数据过滤

分区过滤

分桶过滤

评论