Hive学习总结

数据与共享 2023-06-25

270

一、基本概念

1. 什么是hive

hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供简单的sql查询功能，可以将sql语句转换为MapReduce任务进行运行。hive提供了数据查询和计算的入口，作为使用者不需要考虑hadoop分布式架构的计算过程和方式。注：hive不支持数据的改写。

2. 如何操作hive

hql语句，也就是sql语言的一种方言，换句话说：只要你之前写过sql语句，不管是oracle，db2，mysql，还是sqlserver。你都可以无障碍地写hql语句。

3. hive数据的存储

分两部分，元数据（metastore）和具体数据。

元数据：存于默认元数据库Derby（嵌入模式），也可修改存于mysql数据库中。所谓元数据：hive表的数据库名、表名、字段名称与类型、分区字段与类型表的分区，分区的属性location，即：一些基本的属性。

基础数据：存于hadoop的hdfs，以文件的形式，存于大数据平台。

二、基本操作

1. 进入linux平台

键入：hive或者beeline
hive> 输入命令操作。

2. 建第一张表

# 建表：
create table temp_tab(
id int,
name string)row format delimited fields terminated by '\t' storedas textfile;


# 说明：
指定了字段的分隔符为逗号，所以load数据的时候，
load的文本也要为tab，否则加载后为NULL。hive只支持单个字符的分隔符，hive默认的分隔符是\001。


# 导入数据，有overwrite会将表中原来数据覆盖，否则进行增量导入。
hive> load data local inpath'/home/ocdc/gxs/hive1.txt' overwrite into table temp_tab;

3. 表的分类及建表操作

# 内部表
create table temp_tab(
id int,
name string)rowformat delimited fields terminated by ',' stored as textfile;


# 外部表
create external  table temp_tab(
id int,
name string)rowformat delimited fields terminated by ',' stored as textfile;


# 分区表
create table test1_gxs1(
id int,
name string)partitioned by(sexstring) row format delimited fields terminated by ',' stored as textfile;


# 导入数据：
load datalocal inpath '/home/ocdc/gxs/hive2.txt'overwrite into table test1_gxs1 partition(sex='unkonw');

4.内部表和外部表的区别

①未被external修饰的是内部表，被external修饰的为外部表（externaltable）；

②内部表数据由Hive自身管理，外部表数据由HDFS管理；

③内部表数据存储的位置是hive.metastore.warehouse.dir（默认：/user/hive/warehouse），外部表数据的存储位置由自己制定；

④删除内部表会直接删除元数据（metadata）及存储数据；删除外部表仅仅会删除元数据，HDFS上的文件并不会被删除；

⑤ 对内部表的修改会将修改直接同步给元数据，而对外部表的表结构和分区进行修改，则需要修复（MSCKREPAIR TABLE table_name）。

5.分区和桶

分区：分类、归类，大大提高了查询效率，在业务中我们常以“地域”或者“时间”作为分区键，这样便于管理和维护。

桶：对于每一个表（table）或者分区， Hive可以进一步组织成桶，相比分区，桶的数据范围颗粒度更小。Hive也是针对某一列进行桶的组织。Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

分区和桶的优点：

获得更高的查询处理效率。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。比如JOIN操作。对于JOIN操作两个表有一个相同的列，如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以，可以大大减少JOIN的数据量。

使取样（sampling）更高效。在处理大规模数据集时，在开发和修改查询的阶段，能在数据集的一小部分数据上试运行查询，会带来很多方便。

6. 数据的导入导出

# 装载数据
load datalocal inpath '/home/ocdc/gxs/hive3.txt' overwrite into table  temp_external1;


# hdfs数据
load datainpath '***/hive3.txt'  into table  temp_external1;


#  导出数据 (到本地)
hive> insertoverwrite local directory '/home/ocdc/gxs/ss1.txt' select id,name fromtemp_external1;


# 到hdfs
hive > insertoverwrite  directory'hdfs://master:9090/**/mate_load' select * from temp_external1;

三、基本命令函数和查询语句

1.操作表基本命令

# 查看表hdfs上存储地址
show create table table_name;


# 查看表分区:    
hive> show partitions test1_gxs1;  


# 修改表名：
hive>  alter table temp_tab rename to temp_tab1;


# 增加列：
hive> alter table temp_tab1 add columns(sex string);


# 修改字段名：
hive> alter tabletemp_tab1 change sex gender string;


# 查看内置函数：
hive> showfunctions;             


#具体函数含义：
hive> descfunction year;


# 杀死job: 
hive> hadoop job  -kill job_×;


# 查看表结构：
hive>  desc tab_name;


# 替换列结构：
hive> alter tabletemp_tab1 replace columns(id string,name string);


# 替换前：
hive>desc temp_tab1;
OK
id      int
name   string
gender string


#替换后：
hive>desc temp_tab1;
OK
id      string
name    string

2.常用内置函数

数据量：count()
平均值：avg()
去重：distinct
最小值：min();
最大值：max();
字符串切分：substr();  
将16进制转换成10进制：conv(cell_id,16,10)；

3.插入语句

insert into  table test1_gxs1 partition(sex)
select id,name,'boy'  from   temp_tab;

4. 查询语句：

# 限制行数：大数据中心的数据自然数量很大，所以查询时一定要进行限定，否则会很慢，占用很多资源，导致大家的任务瘫痪。
select  * from temp_tab limit 10;
select * from temp_tab sort by age desc limit 5;(


# 表的连接操作，遵循小表在前的规则：小表在前产生中间缓存数据较少，避免内存区缓存溢出。（left join）


# 此时分区（桶）的重要性也可以体会到了。

*欢迎关注*

hive table 外部表数据库分区 linux分区

文章转载自数据与共享，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

Hive学习总结

二、基本操作

三、基本命令函数和查询语句

评论