Spark及生态圈概述

代码论道 2018-07-29

510

1.Spark产生背景

MapReduce局限性

代码繁琐
只能够支持map和reduce方法
执行效率低下
不适合迭代多次、交互式、流式处理

框架多样化，学习、运维成本高

批处理（离线）：MapReduce、Hive、Pig

流式处理（实时）：Storm、JStorm

交互式计算：Impala

Spark可以解决多框架带来的学习成本高、运维成本高的问题。

2.Spark概述及特点

官网：http://spark.apache.org

Apache Spark™ is a unified analytics engine for large-scale data processing.

快如闪电的分布式计算框架,特点如下：

speed

Ease of Use

Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.

wordcount案例spark实现：

df = spark.read.json("logs.json")

df.where("age > 21") .select("name.first").show()

Generality

Runs Everywhere

Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

3.Spark发展历史

4.Sprk survey

5.Spark对比Hadoop

Hadoop生态系统：

Spark生态系统：

Spark对比Hadoop:

Hadoop对比Spark:

MapReduce对比Spark:

6.Spark和Hadoop的协作性

Hadoop的优势

Spark优势

Hadoop+Spark

7.Spark开发语言

8.Spark运行模式

文章转载自代码论道，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

评论