暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

最强指南!数据湖Apache Hudi、Iceberg、Delta环境搭建

ApacheHudi 2021-04-20
2908

1. 引入

作为依赖Spark的三个数据湖开源框架Delta,Hudi和Iceberg,本篇文章为这三个框架准备环境,并从Apache Spark、Hive和Presto的查询角度进行比较。主要分为三部分

  • 准备单节点集群,包括:Hadoop,Spark,Hive,Presto和所有依赖项。

  • 测试Delta,Hudi,Iceberg在更新,删除,时间旅行,Schema合并中的行为方式。还会检查事务日志,以及默认配置和相同数据量的大小差异。

  • 使用Apache Hive和Presto查询。

2. 环境准备

2.1 单节点集群

版本如下

  1. ubuntu-18.04.3-live-server-amd64

  2. openjdk-8-jdk

  3. scala-2.11.12

  4. spark-2.4.4-bin-hadoop2.7

  5. hadoop-2.7.7

  6. apache-hive-2.3.6-bin

  7. presto-server-329.tar

  8. org.apache.iceberg:iceberg-spark-runtime:0.7.0-incubating

  9. org.apache.hudi:hudi-spark-bundle:0.5.0-incubating

  10. io.delta:delta-core_2.11:0.5.0

在Ubuntu中,我使用的是超级用户spuser,并为该用户生成hadoop所需的授权密钥。

  1. ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

  2. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

  3. chmod 0600 ~/.ssh/authorized_keys

为Spark安装Java 1.8

  1. #1.

  2. sudo add-apt-repository ppa:openjdk-r/ppa

  3. sudo apt-get update

  4. sudo apt-get install openjdk-8-jdk

  5. sudo update-alternatives --config java

  6. sudo update-alternatives --config javac

确认版本为Java 1.8

  1. #2.

  2. spuser@acid:~$ java -version

  3. openjdk version "1.8.0_232"

  4. OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-0ubuntu1~16.04.1-b09)

  5. OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

下载所有的依赖包

  1. #3.

  2. mkdir downloads

  3. cd downloads/

  4. wget https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.deb

  5. wget http://apache.mirror.vu.lt/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

  6. wget http://apache.mirror.vu.lt/apache/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz

  7. wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.7/hadoop-2.7.7.tar.gz

  8. wget http://apache.mirror.vu.lt/apache/hive/hive-2.3.6/apache-hive-2.3.6-bin.tar.gz

  9. wget https://repo1.maven.org/maven2/io/prestosql/presto-cli/329/presto-cli-329-executable.jar

  10. wget https://repo1.maven.org/maven2/io/prestosql/presto-server/329/presto-server-329.tar.gz

检查下载项

  1. #4.

  2. spuser@acid:~/downloads$ ll -h

安装Scala

  1. #5.

  2. sudo dpkg -i scala-2.11.12.deb

安装至/usr/local目录,对于特定版本,创建符号链接,以便将来进行更轻松的迁移

  1. #6.

  2. sudo tar -xzf apache-hive-2.3.6-bin.tar.gz -C /usr/local/

  3. sudo tar -xzf hadoop-2.7.7.tar.gz -C /usr/local/

  4. sudo tar -xzf spark-2.4.4-bin-hadoop2.7.tgz -C /usr/local/

  5. sudo tar -xzf spark-3.0.0-preview2-bin-hadoop2.7.tgz -C /usr/local/

  6. sudo tar -xzf presto-server-329.tar.gz -C /usr/local

  7. sudo chown -R spuser /usr/local/apache-hive-2.3.6-bin/

  8. sudo chown -R spuser /usr/local/hadoop-2.7.7/

  9. sudo chown -R spuser /usr/local/spark-2.4.4-bin-hadoop2.7/

  10. sudo chown -R spuser /usr/local/spark-3.0.0-preview2-bin-hadoop2.7/

  11. sudo chown -R spuser /usr/local/presto-server-329/

  12. cd /usr/local/

  13. sudo ln -s /usr/local/apache-hive-2.3.6-bin/ usr/local/hive

  14. sudo chown -h spuser:spuser /usr/local/hive

  15. sudo ln -s /usr/local/hadoop-2.7.7/ /usr/local/hadoop

  16. sudo chown -h spuser:spuser /usr/local/hadoop

  17. sudo ln -s /usr/local/spark-2.4.4-bin-hadoop2.7 /usr/local/spark

  18. sudo chown -h spuser:spuser /usr/local/spark

  19. sudo ln -s /usr/local/spark-3.0.0-preview2-bin-hadoop2.7 /usr/local/spark3

  20. sudo chown -h spuser:spuser /usr/local/spark3

  21. sudo ln -s /usr/local/presto-server-329 /usr/local/presto

  22. sudo chown -h spuser:spuser /usr/local/presto

为日志和HDFS创建几个文件夹。在根目录下创建一些文件夹并不是最佳做法,但可起到沙盒作用

  1. #7.

  2. sudo mkdir /logs

  3. sudo chown -R spuser /logs

  4. mkdir /logs/hadoop

  5. #Add dir for data

  6. sudo mkdir /hadoop

  7. sudo chown -R spuser /hadoop

  8. mkdir -p /hadoop/hdfs/namenode

  9. mkdir -p /hadoop/hdfs/datanode

  10. #create tmp hadoop dir:

  11. mkdir -p /tmp/hadoop

更新环境变量,.bashrc

  1. #8.

  2. sudo nano ~/.bashrc

  3. #Add entries in existing file:

  4. export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

  5. export PATH=$PATH:$JAVA_HOME/bin

  6. export HADOOP_HOME=/usr/local/hadoop

  7. export HIVE_HOME=/usr/local/hive

  8. export PATH=$PATH:$HADOOP_HOME/bin

  9. export PATH=$PATH:$HADOOP_HOME/sbin

  10. export PATH=$PATH:$HIVE_HOME/bin

  11. export HADOOP_MAPRED_HOME=$HADOOP_HOME

  12. export HADOOP_COMMON_HOME=$HADOOP_HOME

  13. export HADOOP_HDFS_HOME=$HADOOP_HOME

  14. export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

  15. export YARN_HOME=$HADOOP_HOME

  16. export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

  17. export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

  18. export HADOOP_LOG_DIR=/logs/hadoop

  19. export SPARK_HOME=/usr/local/spark

  20. export PATH=$PATH:$SPARK_HOME/bin

  21. #Save it!

  22. #Source it:

  23. source ~/.bashrc

2.2 Hadoop配置

更改Hadoop配置,切换至目录

  1. #9.

  2. cd /usr/local/hadoop/etc/hadoop

hadoop-env.sh

  1. #10.

  2. #Comment existing JAVA_HOME and add new one:

  3. export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

core-site.xml

  1. #11.

  2. <configuration>

  3. <property>

  4. <name>hadoop.tmp.dir</name>

  5. <value>/tmp/hadoop</value>

  6. <description>A base for other temporary directories.</description>

  7. </property>

  8. <property>

  9. <name>fs.defaultFS</name>

  10. <value>hdfs://localhost:9000</value>

  11. </property>

  12. </configuration>

mapred-site.xml

  1. #12.

  2. <configuration>

  3. <property>

  4. <name>mapreduce.framework.name</name>

  5. <value>yarn</value>

  6. </property>

  7. </configuration>

hdfs-site.xml

  1. #13.

  2. <configuration>

  3. <property>

  4. <name>dfs.replication</name>

  5. <value>1</value>

  6. </property>

  7. <property>

  8. <name>dfs.namenode.name.dir</name>

  9. <value>file:/hadoop/hdfs/namenode</value>

  10. </property>

  11. <property>

  12. <name>dfs.datanode.data.dir</name>

  13. <value>file:/hadoop/hdfs/datanode</value>

  14. </property>

  15. </configuration>

yarn-site.xml

  1. #14.

  2. <configuration>

  3. <property>

  4. <name>yarn.nodemanager.aux-services</name>

  5. <value>mapreduce_shuffle</value>

  6. </property>

  7. </configuration>

准备好HDFS之后,格式化并启动服务

  1. #15.

  2. hdfs namenode -format

  3. start-all.sh

检查运行情况

  1. #16.

  2. spuser@acid:/usr/local/hadoop/etc/hadoop$ jps

  3. 9890 DataNode

  4. 10275 ResourceManager

  5. 10115 SecondaryNameNode

  6. 10613 NodeManager

  7. 9705 NameNode

  8. 10732 Jps

2.3 Hive配置

为Hive创建Hdfs目录

  1. #17.

  2. #Create HDFS dirs:

  3. hdfs dfs -mkdir -p /user/hive/warehouse

  4. hdfs dfs -mkdir /tmp

  5. hdfs dfs -chmod g+w /user/hive/warehouse

  6. hdfs dfs -chmod g+w /tmp

切换至Hive conf目录

  1. #18.

  2. cd /usr/local/hive/conf

hive-site.xml

  1. #19.

  2. <configuration>

  3. <property>

  4. <name>javax.jdo.option.ConnectionURL</name>

  5. <value>jdbc:derby:;databaseName=/usr/local/hive/metastore_db;create=true</value>

  6. <description>

  7. JDBC connect string for a JDBC metastore.

  8. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.

  9. For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.

  10. </description>

  11. </property>

  12. <property>

  13. <name>hive.metastore.warehouse.dir</name>

  14. <value>/user/hive/warehouse</value>

  15. <description>location of default database for the warehouse</description>

  16. </property>

  17. <property>

  18. <name>hive.metastore.uris</name>

  19. <value/>

  20. <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>

  21. </property>

  22. <property>

  23. <name>javax.jdo.option.ConnectionDriverName</name>

  24. <value>org.apache.derby.jdbc.EmbeddedDriver</value>

  25. <description>Driver class name for a JDBC metastore</description>

  26. </property>

  27. <property>

  28. <name>javax.jdo.PersistenceManagerFactoryClass</name>

  29. <value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value>

  30. <description>class implementing the jdo persistence</description>

  31. </property>

  32. <property>

  33. <name>hive.metastore.schema.verification</name>

  34. <value>false</value>

  35. <description/>

  36. </property>

  37. </configuration>

hive-env.sh

  1. #20.

  2. # The heap size of the jvm stared by hive shell script can be controlled via:

  3. #

  4. export HADOOP_HEAPSIZE=512

  5. #

  6. # Larger heap size may be required when running queries over large number of files or partitions.

  7. # By default hive shell scripts use a heap size of 256 (MB). Larger heap size would also be

  8. # appropriate for hive server (hwi etc).

  9. # Set HADOOP_HOME to point to a specific hadoop install directory

  10. export HADOOP_HOME=/usr/local/hadoop

  11. # Hive Configuration Directory can be controlled by:

  12. export HIVE_CONF_DIR=/usr/local/hive/conf

  13. # Folder containing extra ibraries required for hive compilation/execution can be controlled by:

  14. export HIVE_AUX_JARS_PATH=/usr/local/hive/lib/*.jar

在创建Hive metastore之前请更新hive-schema-2.3.0.derby.sql,否则iceberg将无法创建表,会有如下错误

  1. #21.

  2. ERROR metastore.RetryingHMSHandler: Retrying HMSHandler after 2000 ms (attempt 8 of 10) with error: javax.jdo.JDODataStoreException: Insert of object "org.apache.hadoop.hive.metastore.model.MTable@604201a0" using statement "INSERT INTO TBLS (TBL_ID,OWNER,CREATE_TIME,SD_ID,TBL_NAME,VIEW_EXPANDED_TEXT,LAST_ACCESS_TIME,DB_ID,RETENTION,VIEW_ORIGINAL_TEXT,TBL_TYPE) VALUES (?,?,?,?,?,?,?,?,?,?,?)" failed : Column 'IS_REWRITE_ENABLED' cannot accept a NULL value.

更新hive-schema-2.3.0.derby.sql

  1. #22.

  2. nano /usr/local/hive/scripts/metastore/upgrade/derby/hive-schema-2.3.0.derby.sql

  3. #update statement: "APP"."TBLS"

  4. CREATE TABLE "APP"."TBLS" ("TBL_ID" BIGINT NOT NULL, "CREATE_TIME" INTEGER NOT NULL, "DB_ID" BIGINT, "LAST_ACCESS_TIME" INTEGER NOT NULL, "OWNER" VARCHAR(767), "RETENTION" INTEGER NOT NULL, "SD_ID" BIGINT, "TBL_NAME" VARCHAR(256), "TBL_TYPE" VARCHAR(128), "VIEW_EXPANDED_TEXT" LONG VARCHAR, "VIEW_ORIGINAL_TEXT" LONG VARCHAR, "IS_REWRITE_ENABLED" CHAR(1) NOT NULL DEFAULT 'N');

更新后创建Hive metastore

  1. #23.

  2. schematool -initSchema -dbType derby --verbose

检查schema是否创建成功

  1. #24.

  2. ...

  3. beeline> Initialization script completed

  4. schemaTool completed

通过CLI创建Hive

  1. #25.

  2. hive -e "show databases"

2.4 Presto配置

创建config目录

  1. #26.

  2. mkdir -p /usr/local/presto/etc

创建配置文件 /usr/local/presto/etc/config.properties

  1. #27.

  2. coordinator=true

  3. node-scheduler.include-coordinator=true

  4. http-server.http.port=8080

  5. query.max-memory=5GB

  6. query.max-memory-per-node=1GB

  7. query.max-total-memory-per-node=2GB

  8. discovery-server.enabled=true

  9. discovery.uri=http://localhost:8080

创建JVM配置文件/usr/local/presto/etc/jvm.properties

  1. #28.

  2. -server

  3. -Xmx16G

  4. -XX:+UseG1GC

  5. -XX:G1HeapRegionSize=32M

  6. -XX:+UseGCOverheadLimit

  7. -XX:+ExplicitGCInvokesConcurrent

  8. -XX:+HeapDumpOnOutOfMemoryError

  9. -XX:+ExitOnOutOfMemoryError

创建节点配置文件 /usr/local/presto/etc/node.properties

  1. #29.

  2. node.environment=production

  3. node.id=ffffffff-ffff-ffff-ffff-ffffffffffff

  4. node.data-dir=/var/presto/data

创建相关目录

  1. #30.

  2. sudo mkdir -p /var/presto/data

  3. sudo chown spuser:spuser -h /var/presto

  4. sudo chown spuser:spuser -h /var/presto/data

创建catalog和hive配置文件 /usr/local/presto/etc/catalog/hive.properties

  1. #31.

  2. connector.name=hive-hadoop2

  3. hive.metastore.uri=thrift://localhost:9083

2.5 Spark相关配置

检查scala版本

  1. #32.

  2. scala -version

  3. #make sure that you can see something like:

  4. Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

  5. #otherwise get back to step #5.

切换至Spark conf目录

  1. #33.

  2. cd /usr/local/spark/conf

spark-env.sh

  1. #34.

  2. #add

  3. export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

  4. export SPARK_CONF_DIR=/usr/local/spark/conf

  5. export SPARK_LOCAL_IP=127.0.0.1

拷贝hive-site.xml,以便使用Hive和Presto测试delta,hudl,iceberg行为

  1. #35.

  2. cp /usr/local/hive/conf/hive-site.xml /usr/local/spark/conf/

下载所有的依赖

  1. #36.

  2. spark-shell --packages org.apache.iceberg:iceberg-spark-runtime:0.7.0-incubating,org.apache.hudi:hudi-spark-bundle:0.5.0-incubating,io.delta:delta-core_2.11:0.5.0 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

2.6 测试三个框架

Delta

  1. #37.

  2. import org.apache.spark.sql.SaveMode._

  3. spark.range(1000).toDF.write.format("delta").mode(Overwrite).save("/tmp/delta_tab01")

Hudi

  1. #38.

  2. import org.apache.spark.sql.SaveMode._

  3. import org.apache.hudi.DataSourceWriteOptions._

  4. import org.apache.hudi.config.HoodieWriteConfig._

  5. spark.range(1000).write.format("org.apache.hudi").option(TABLE_NAME, "hudi_tab01").option(PRECOMBINE_FIELD_OPT_KEY, "id").option(RECORDKEY_FIELD_OPT_KEY, "id").mode(Overwrite).save("/tmp/hudi_tab01")

Iceberg

  1. #39.

  2. import org.apache.iceberg.hive.HiveCatalog

  3. import org.apache.iceberg.catalog._

  4. import org.apache.iceberg.Schema

  5. import org.apache.iceberg.types.Types._

  6. import org.apache.iceberg.PartitionSpec

  7. import org.apache.iceberg.spark.SparkSchemaUtil

  8. import org.apache.iceberg.hadoop.HadoopTables

  9. val name = TableIdentifier.of("default","iceberg_tab01");

  10. val df1=spark.range(1000).toDF.withColumn("level",lit("1"))

  11. val df1_schema = SparkSchemaUtil.convert(df1.schema)

  12. val partition_spec=PartitionSpec.builderFor(df1_schema).identity("level").build

  13. val tables = new HadoopTables(spark.sessionState.newHadoopConf())

  14. val table = tables.create(df1_schema, partition_spec, "hdfs:/tmp/iceberg_tab01")

  15. df1.write.format("iceberg").mode("append").save("hdfs:/tmp/iceberg_tab01")

检查HDFS上结果

  1. #40.

  2. hdfs dfs -ls -h -R /tmp/delta* && hdfs dfs -ls -h -R /tmp/hudi* && hdfs dfs -ls -h -R /tmp/iceberg*

3. 总结

本篇文章展示了如何搭建测试三个数据湖环境所依赖的所有环境,以及进行了简单的测试,希望这对你有用。


文章转载自ApacheHudi,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论