Hive数据仓库

科技Dou知道 2021-01-20

306

hive是基于Hadoop的一个数据仓库工具，用来进行数据提取、转化、加载，这是一种可以存储、查询和分析存储在Hadoop中的大规模数据的机制。hive数据仓库工具能将结构化的数据文件映射为一张数据库表，并提供SQL查询功能，能将SQL语句转变成MapReduce任务来执行。

Hive的优点是学习成本低，可以通过类似SQL语句实现快速MapReduce统计，使MapReduce变得更加简单，而不必开发专门的MapReduce应用程序。hive十分适合对数据仓库进行统计分析。[1]

hive 构建在基于静态批处理的Hadoop 之上，Hadoop 通常都有较高的延迟并且在作业提交和调度的时候需要大量的开销。因此，hive 并不能够在大规模数据集上实现低延迟快速的查询，例如，hive 在几百MB 的数据集上执行查询一般有分钟级的时间延迟。[2]

因此，hive 并不适合那些需要高实时性的应用，例如，联机事务处理（OLTP）。hive 查询操作过程严格遵守Hadoop MapReduce 的作业执行模型，hive 将用户的hiveQL 语句通过解释器转换为MapReduce 作业提交到Hadoop 集群上，Hadoop 监控作业执行过程，然后返回作业执行结果给用户。hive 并非为联机事务处理而设计，hive 并不提供实时的查询和基于行级的数据更新操作。hive 的最佳使用场合是大数据集的批处理作业，例如，网络日志分析。[3]

Hive的元数据：元数据，通俗的讲，就是存储在 Hive 中的数据的描述信息。Hive 中的元数据通常包括：表的名字，表的列和分区及其属性，表的属性（内部表和外部表），表的数据所在目录Metastore 默认存在自带的 Derby 数据库中，缺点就是不适合多用户操作，并且数据存储目录不固定。数据库跟着 Hive 走，极度不方便管理，通常存我们自己创建的 MySQL 库（本地或远程），Hive 和 MySQL 之间通过 MetaStore 服务交互。[4]

Hive的部署模式分为3种，分别是嵌入模式、本地模式和远程模式。

(1) 嵌入模式：使用内嵌的Derby数据库存储元数据，这种方式是Hive的默认安装方式，配置简单，但是一次只能连接一个客户端，适合用来测试，不适合生产环境。

(2) 本地模式：采用外部数据库存储元数据，该模式不需要单独开启Metastore服务，因为本地模式使用的是和Hive在同一个进程中的Metastore服务。

(3) 远程模式：与本地模式一样，远程模式也是采用外部数据库存储元数据。不同的是，远程模式需要单独开启Metastore服务，然后每个客户端都在配置文件中配置连接该Metestore服务。远程模式中，Metastore服务和Hive运行在不同的进程中。[5]

Hive的部署

下面演示远程模式部署方式，master节点为Hive客户端，slave1节点为Hive服务器端即Metastore服务，slave2为MySQL服务

(1) MySQL服务的安装及配置（在slave2节点上）

安装MySQL

$ sudo apt-get update
$ sudo apt-get install -y mysql-server-5.7
$ sudo mysql_secure_installation

初始化配置项较多，如下所示：

#1
VALIDATE PASSWORD PLUGIN can be used to test passwords...
Press y|Y for Yes, any other key for No: N (我的选项)


#2
Please set the password for root here...
New password: 123456(输入密码)
Re-enter new password: 123456(重复输入)


#3
By default, a MySQL installation has an anonymous user,
allowing anyone to log into MySQL without having to have
a user account created for them...
Remove anonymous users? (Press y|Y for Yes, any other key for No) : N (我的选项)


#4
Normally, root should only be allowed to connect from
'localhost'. This ensures that someone cannot guess at
the root password from the network...
Disallow root login remotely? (Press y|Y for Yes, any other key for No) : Y (我的选项)


#5
By default, MySQL comes with a database named 'test' that
anyone can access...
Remove test database and access to it? (Press y|Y for Yes, any other key for No) : N (我的选项)


#6
Reloading the privilege tables will ensure that all changes
made so far will take effect immediately.
Reload privilege tables now? (Press y|Y for Yes, any other key for No) : Y (我的选项)

查看MySQL服务是否启用

$ sudo service mysql status

若出现以下提示信息则证明MySQL服务已经开启

hadoop@slave2:~$ sudo service mysql status
● mysql.service - MySQL Community Server
   Loaded: loaded (/lib/systemd/system/mysql.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2021-01-20 04:02:50 UTC; 11min ago
  Process: 4325 ExecStart=/usr/sbin/mysqld --daemonize --pid-file=/run/mysqld/mysqld.pid (code=exited, status=0/SUCCESS)
  Process: 4294 ExecStartPre=/usr/share/mysql/mysql-systemd-start pre (code=exited, status=0/SUCCESS)
 Main PID: 4327 (mysqld)
    Tasks: 28 (limit: 4215)
   CGroup: /system.slice/mysql.service
           └─4327 /usr/sbin/mysqld --daemonize --pid-file=/run/mysqld/mysqld.pid


Jan 20 04:02:49 slave2 systemd[1]: Starting MySQL Community Server...
Jan 20 04:02:50 slave2 systemd[1]: Started MySQL Community Server.

若没有开启MySQL服务则需要执行以下命令开启MySQL

$ sudo service mysql start

配置远程访问

$ sudo mysql -uroot -p123456
> use mysql;
> create user hadoop identified by 'hadoop';
> grant all privileges on *.* to 'hadoop'@'%' identified by '123456' with grant option;
> flush privileges;
> exit;

(2) 安装Hive服务器端（在slave1节点上）

使用Xftp软件将apache-hive-2.1.1-bin.tar.gz安装包上传到 ~/software/ 路径下，然后进行解压

$ tar -zxvf ~/software/apache-hive-2.1.1-bin.tar.gz -C ~/servers

重命名

$ mv ~/servers/apache-hive-2.1.1-bin ~/servers/hive

修改环境变量

$ vim ~/.bashrc

在文件末尾添加

export HIVE_HOME=/home/hadoop/servers/hive
export PATH=$PATH:$HIVE_HOME/bin

环境变量生效

$ source ~/.bashrc

配置hive-env.sh文件

$ cp ~/servers/hive/conf/hive-env.sh.template ~/servers/hive/conf/hive-env.sh
$ vim ~/servers/hive/conf/hive-env.sh

在文件末尾添加

# 配置Hadoop安装路径
HADOOP_HOME=/home/hadoop/servers/hadoop
# 配置Hive配置文件存放路径
export HIVE_CONF_DIR=/home/hadoop/servers/hive/conf
# 配置Hive运行资源库路径
export HIVE_AUX_JARS_PATH=/home/hadoop/servers/hive/lib

配置hive-site.xml文件

$ vim ~/servers/hive/conf/hive-site.xml

添加

<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://slave2:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>
        <description>MySQL连接协议</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>JDBC连接驱动</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hadoop</value>
        <description>MySQL登录用户名</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123456</value>
        <description>MySQL登录密码</description>
    </property>
</configuration>

使用Xftp软件将mysql-connector-java-5.1.47-bin.jar驱动包上传到 ~/software/ 路径下，然后将MySQL连接驱动复制到Hive安装路径下的lib目录中

$ cp ~/software/mysql-connector-java-5.1.47-bin.jar ~/servers/hive/lib

将hive分发到客户端

$ scp -r ~/servers/hive master:~/servers

(3) 安装Hive客户端（在master节点上）

配置hive-site.xml文件

$ rm -f ~/servers/hive/conf/hive-site.xml
$ vim ~/servers/hive/conf/hive-site.xml

添加

<configuration>
    <property>
        <name>hive.metastore.local</name>
        <value>false</value>
        <description>是否使用本地服务连接Hive</description>
    </property>
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://slave1:9083</value>
        <description>连接Metastore服务器</description>
    </property>
    <property>
        <name>hive.cli.print.current.db</name>
        <value>true</value>
        <description>显示当前数据库名称信息</description>
    </property>
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
        <description>显示当前查询表的头信息</description>
    </property>
</configuration>

修改环境变量

$ vim ~/.bashrc

在文件末尾添加

export HIVE_HOME=/home/hadoop/servers/hive
export PATH=$PATH:$HIVE_HOME/bin

环境变量生效

$ source ~/.bashrc

至此，Hive远程部署已经完成。

Hive的启动

(1) 启动Hadoop集群（在master节点上），没有搭建好Hadoop集群的小伙伴请在公众号后台回复关键词 hadoop入门获得图文教程

$ start-all.sh

(2) 启动MySQL服务（在slave2节点上）：请查看Hive部署步骤中的第一步，这里不再重复说明

(3) 启动Metastore服务（在slave1节点上）

初次启动之前首先要初始化元数据‍

$ schematool -dbType mysql -initSchema

若出现以下提示信息则说明初始化完成

hadoop@slave1:~$ schematool -dbType mysql -initSchema
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/servers/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/servers/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:   jdbc:mysql://slave2:3306/hive?createDatabaseIfNotExist=true&useSSL=false&useUnicode=true&characterEncoding=UTF-8
Metastore Connection Driver :   com.mysql.jdbc.Driver
Metastore connection User:   hadoop
Starting metastore schema initialization to 2.1.0
Initialization script hive-schema-2.1.0.mysql.sql
Initialization script completed
schemaTool completed

但若出现以下提示信息则说明初始化失败

hadoop@slave1:~$ schematool -dbType mysql -initSchema
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/servers/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/servers/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:   jdbc:mysql://slave2:3306/hive?createDatabaseIfNotExist=true&useSSL=false&useUnicode=true&characterEncoding=UTF-8
Metastore Connection Driver :   com.mysql.jdbc.Driver
Metastore connection User:   hadoop
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
Underlying cause: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException : Communications link failure


The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
SQL Error code: 0
Use --verbose for detailed stacktrace.
*** schemaTool failed ***

为了解决The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.问题，需要在slave2上执行命令

$ sudo vim etc/mysql/mysql.conf.d/mysqld.cnf

将bind-address=127.0.0.1修改为bind-address=0.0.0.0，然后执行命令$ sudo service mysql restart重启MySQL服务，最后在slave1节点上初始化元数据即可

接着在slave1节点上启动Metastore服务（使用nohup命令是为了在后台运行）

$ nohup hive --service metastore &

(4) 启动Hive客户端（在master节点上）

直接启动hive

$ hive

若出现以下提示信息则说明Hive已经启动完毕

hadoop@master:~$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/servers/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/servers/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]


Logging initialized using configuration in jar:file:/home/hadoop/servers/hive/lib/hive-common-2.1.1.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive (default)>