Cloudera | 将MySQL数据迁移到Hadoop

Nephilim 2020-05-17

MySQL端：源表

首先，看看MySQL这边我们需要迁移到Hadoop中的源表：

mysql> select database();
+------------+
| database() |
+------------+
| adamhuan   |
+------------+
1 row in set (0.00 sec)




mysql> show tables;
+-------------------------------+
| Tables_in_adamhuan            |
+-------------------------------+
| wechat_record_detail_pingluns |
| wechat_records                |
| wechat_records_per_day        |
+-------------------------------+
3 rows in set (0.00 sec)




mysql> 
mysql> select * from wechat_records limit 4;
+-----------+---------------------+-----------------+----------------+----------------+------------+--------------+
| record_id | record_create_date  | counts_zhengwen | counts_dianzan | counts_pinglun | counts_pic | counts_video |
+-----------+---------------------+-----------------+----------------+----------------+------------+--------------+
|         3 | 2017-10-10 19:26:00 |               1 |              1 |              3 |          1 |            0 |
|         6 | 2017-10-10 19:29:00 |               1 |              0 |              0 |          0 |            0 |
|         9 | 2017-10-11 10:44:00 |               1 |              0 |              2 |          0 |            0 |
|        12 | 2017-10-11 11:18:00 |               1 |              0 |              0 |          0 |            0 |
+-----------+---------------------+-----------------+----------------+----------------+------------+--------------+
4 rows in set (0.01 sec)




mysql> 
mysql> select count(*) from wechat_records;
+----------+
| count(*) |
+----------+
|     5925 |
+----------+
1 row in set (0.04 sec)




mysql>

Cloudera：添加服务【Sqoop 1 client】

Sqoop，顾名思义：SQL[SQ] to Hadoop[OOP]

在Cloudera Manager的Web Portal中操作：

可以看到【Sqoop 1 Client】服务，添加成功。

Sqoop：MySQL Connector支持

在上面，我把【Sqoop】部署在了【节点5】的服务器上。

因此，这一步的操作应该在【节点5】上进行：

[root@cloudera5 ~]# find  -name "mysql-connector-java.jar"
/usr/share/java/mysql-connector-java.jar
[root@cloudera5 ~]# 
[root@cloudera5 ~]# ls -ltr var/lib/sqoop/
total 0
[root@cloudera5 ~]# 
[root@cloudera5 ~]# ln -s usr/share/java/mysql-connector-java.jar var/lib/sqoop/
[root@cloudera5 ~]# 
[root@cloudera5 ~]# ls -ltr var/lib/sqoop/
total 0
lrwxrwxrwx 1 root root 40 May 17 12:57 mysql-connector-java.jar -> /usr/share/java/mysql-connector-java.jar
[root@cloudera5 ~]#

HDFS：创建存放MySQL数据的目录

在HDFS上创建用于存放MySQL迁移数据的目录：

[root@cloudera5 ~]# export HADOOP_USER_NAME=hdfs
[root@cloudera5 ~]# 
[root@cloudera5 ~]# hdfs dfs -ls 
Found 2 items
drwxrwxrwt   - hdfs supergroup          0 2020-04-09 20:36 tmp
drwxr-xr-x   - hdfs supergroup          0 2020-04-10 17:20 user
[root@cloudera5 ~]# 
[root@cloudera5 ~]# hdfs dfs -mkdir mysql_data
[root@cloudera5 ~]# 
[root@cloudera5 ~]# hdfs dfs -ls 
Found 3 items
drwxr-xr-x   - hdfs supergroup          0 2020-05-17 13:03 mysql_data
drwxrwxrwt   - hdfs supergroup          0 2020-04-09 20:36 /tmp
drwxr-xr-x   - hdfs supergroup          0 2020-04-10 17:20 /user
[root@cloudera5 ~]#

Sqoop：导入MySQL数据

还是在【节点5】操作：

[root@cloudera5 ~]# sqoop import --connect jdbc:mysql://192.168.72.21:3306/adamhuan --table wechat_records --username root --password 你的口令 --target-dir /mysql_data/adamhuan
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/05/17 13:09:33 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/05/17 13:09:33 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/05/17 13:09:33 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/05/17 13:09:33 INFO tool.CodeGenTool: Beginning code generation
20/05/17 13:09:33 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `wechat_records` AS t LIMIT 1
20/05/17 13:09:33 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `wechat_records` AS t LIMIT 1
20/05/17 13:09:33 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/05/17 13:09:36 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/40299f03a1a7f2c66c1b0882a8c35b33/wechat_records.java to /root/./wechat_records.java. Error: Destination '/root/./wechat_records.java' already exists
20/05/17 13:09:36 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/40299f03a1a7f2c66c1b0882a8c35b33/wechat_records.jar
20/05/17 13:09:36 WARN manager.MySQLManager: It looks like you are importing from mysql.
20/05/17 13:09:36 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
20/05/17 13:09:36 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
20/05/17 13:09:36 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
20/05/17 13:09:36 INFO mapreduce.ImportJobBase: Beginning import of wechat_records
20/05/17 13:09:36 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/05/17 13:09:38 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/05/17 13:09:38 INFO client.RMProxy: Connecting to ResourceManager at cloudera2.adamhuan.com/192.168.72.12:8032
20/05/17 13:09:39 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/hdfs/.staging/job_1587192075886_0001
20/05/17 13:09:54 INFO db.DBInputFormat: Using read commited transaction isolation
20/05/17 13:09:54 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`record_id`), MAX(`record_id`) FROM `wechat_records`
20/05/17 13:09:54 INFO db.IntegerSplitter: Split size: 4443; Num splits: 4 from: 3 to: 17775
20/05/17 13:09:54 INFO mapreduce.JobSubmitter: number of splits:4
20/05/17 13:09:55 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/05/17 13:09:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1587192075886_0001
20/05/17 13:09:56 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/05/17 13:09:56 INFO conf.Configuration: resource-types.xml not found
20/05/17 13:09:56 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/05/17 13:09:57 INFO impl.YarnClientImpl: Submitted application application_1587192075886_0001
20/05/17 13:09:57 INFO mapreduce.Job: The url to track the job: http://cloudera2.adamhuan.com:8088/proxy/application_1587192075886_0001/
20/05/17 13:09:57 INFO mapreduce.Job: Running job: job_1587192075886_0001


20/05/17 13:11:20 INFO mapreduce.Job: Job job_1587192075886_0001 running in uber mode : false
20/05/17 13:11:20 INFO mapreduce.Job:  map 0% reduce 0%
20/05/17 13:11:55 INFO mapreduce.Job:  map 25% reduce 0%
20/05/17 13:13:19 INFO mapreduce.Job:  map 50% reduce 0%
20/05/17 13:13:22 INFO mapreduce.Job:  map 75% reduce 0%
20/05/17 13:13:23 INFO mapreduce.Job:  map 100% reduce 0%
20/05/17 13:13:25 INFO mapreduce.Job: Job job_1587192075886_0001 completed successfully
20/05/17 13:13:25 INFO mapreduce.Job: Counters: 33
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=974140
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=465
                HDFS: Number of bytes written=221551
                HDFS: Number of read operations=24
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=8
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=4
                Other local map tasks=4
                Total time spent by all maps in occupied slots (ms)=376413
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=376413
                Total vcore-milliseconds taken by all map tasks=376413
                Total megabyte-milliseconds taken by all map tasks=385446912
        Map-Reduce Framework
                Map input records=5925
                Map output records=5925
                Input split bytes=465
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=24640
                CPU time spent (ms)=51840
                Physical memory (bytes) snapshot=1084702720
                Virtual memory (bytes) snapshot=10365595648
                Total committed heap usage (bytes)=897581056
                Peak Map Physical memory (bytes)=274231296
                Peak Map Virtual memory (bytes)=2595545088
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=221551
20/05/17 13:13:25 INFO mapreduce.ImportJobBase: Transferred 216.3584 KB in 227.3224 seconds (974.6114 bytes/sec)
20/05/17 13:13:25 INFO mapreduce.ImportJobBase: Retrieved 5925 records.
[root@cloudera5 ~]#

这个过程，要等待一段时间。

这个过程中，可以在【YARN】的应用程序中看到后台的任务运行情况：

如上，Sqoop的导入，成功完成。

Sqoop导入成功后，查看HDFS的状态

这个时候，HDFS的状态：

[root@cloudera5 ~]# hdfs dfs -ls /mysql_data
Found 1 items
drwxr-xr-x   - hdfs supergroup          0 2020-05-17 13:13 /mysql_data/adamhuan
[root@cloudera5 ~]# 
[root@cloudera5 ~]# hdfs dfs -ls /mysql_data/adamhuan
Found 5 items
-rw-r--r--   3 hdfs supergroup          0 2020-05-17 13:13 /mysql_data/adamhuan/_SUCCESS
-rw-r--r--   3 hdfs supergroup      54448 2020-05-17 13:13 /mysql_data/adamhuan/part-m-00000
-rw-r--r--   3 hdfs supergroup      54815 2020-05-17 13:13 /mysql_data/adamhuan/part-m-00001
-rw-r--r--   3 hdfs supergroup      55935 2020-05-17 13:13 /mysql_data/adamhuan/part-m-00002
-rw-r--r--   3 hdfs supergroup      56353 2020-05-17 13:11 /mysql_data/adamhuan/part-m-00003
[root@cloudera5 ~]# 
[root@cloudera5 ~]# hdfs dfs -cat /mysql_data/adamhuan/part-m-00000 | head -n 8
3,2017-10-10 19:26:00.0,1,1,3,1,0
6,2017-10-10 19:29:00.0,1,0,0,0,0
9,2017-10-11 10:44:00.0,1,0,2,0,0
12,2017-10-11 11:18:00.0,1,0,0,0,0
15,2017-10-11 11:51:00.0,1,0,0,0,0
18,2017-10-11 11:55:00.0,1,1,0,1,0
21,2017-10-11 12:10:00.0,1,0,12,1,0
24,2017-10-11 13:03:00.0,1,1,1,0,0
[root@cloudera5 ~]# 
[root@cloudera5 ~]# hdfs dfs -cat /mysql_data/adamhuan/part-m-00000 | wc -l
1481
[root@cloudera5 ~]#

终了

至此，简单的从MySQL导入Hadoop就操作完成了。

文章转载自Nephilim，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

Cloudera | 将MySQL数据迁移到Hadoop

评论