Spark introduction and architecture

Apache Spark

1

Agenda

• Introduction• Features -

• Supports all major Big Data Environments• General platform for major Big Data tasks• Access diverse data sources • Speed• Ease of use

• Architecture• Resilient Distributed Datasets

• In-memory Computing• Performance

Apache Spark 2

Spark Introduction

• Apache Spark is an – open source, – parallel data processing framework – master slave model– complements Hadoop to make it easy to develop fast, unified Big Data applications

• Cloudera offers commercial support for Spark with Cloudera Enterprise.

• It has over 465 contributors making it most active Big Data Project in Apache Foundation, started at UC Berkley in 2009

Apache Spark 3

Supports All major BigData Environments

Runs Everywhere -• Standalone Cluster Mode• Hadoop Yarn• Apache Mesos• Amazon EC2

• Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS).

Apache Spark 4

General platform for all major Big Data tasks

• Common ETL (Sqoop)• SQL and Analytics (Pig and Hive)• Real Time Streaming (Storm)• Machine Learning (Mahout)• Graphs (Data Visualization) • Both Interactive and Batch mode processing• Reuse the same code for batch and stream processing, even joining

streaming data to historical data

Apache Spark 5

Access Diverse Data Sources

Read and write anywhere -• HDFS• Cassandra• Hbase• TextFile• RDBMS• Kafka

Apache Spark 6

Speed

• Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

• Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.

Logistic regression in Hadoop and Spark

Apache Spark 7

Ease of Use

• Write application in – Java– Scala– Python

• 80 high-level operators that make it easy to build parallel apps

Apache Spark 8

Unified Analytics with Cloudera’s Enterprise Data Hub

Faster Decisions (Interactive) Better Decisions (Batch) Real-Time Action (Streaming and Applications)

Web Security

Why is my website slow? What are the common causes of performance issues?

How can I detect and block malicious attacks in real-time?

Retail

What are our top selling items across channels?

What products and services to customers buy together?

How can I deliver relevant promotions to buyers at the point of sale?

Financial Services

Who opened multiple accounts in the past 6 months?

What are the leading indicators of fraudulent activity?

How can I protect my customers from identity theft in real-time?

Apache Spark 9

ARCHITECTURE

Apache Spark 10

Apache Spark 11

Resilient Distributed Datasets

• A read-only collection of objects partitioned across a set of machines

• Immutable. You can modify an RDD with a transformation but the transformation returns you a new RDD whereas the original RDD remains the same.

• Can be rebuilt if a partition is lost.• Fault tolerant because they can be recreated & recomputed

RDD supports two types of operations:• Transformation• Action

Apache Spark 12

• Transformation: don't return a single value, they return a new RDD. Nothing gets evaluated when you call a Transformation function, it just takes an RDD and return a new RDD.– Some of the Transformation functions are map, filter, flatMap,

groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.• Action: operation evaluates and returns a new value. When an

Action function is called on a RDD object, all the data processing queries are computed at that time and the result value is returned.– Some of the Action operations are reduce, collect, count, first, take,

countByKey, and foreach.

Apache Spark 13

How in-memory computing Improve Performance

• Intermediate data is not spilled to disk

Apache Spark 14

Performance improvement in DAG executions

• New shuffle implementation, sort-based shuffle uses single buffer which reduces substantial memory overhead and can support huge workloads in a single stage. Earlier concurrent buffers were used

• The revamped network module in Spark maintains its own pool of memory, thus bypassing JVM’s memory allocator, reducing the impact of garbage collection.

• New external shuffle service ensures that Spark can still serve shuffle files even when the executors are in GC pauses.

Apache Spark 15

• Timsort, which is a derivation of quicksort and mergesort. It works better than earlier quicksort in partially ordered datasets.

• Exploiting cache locality reduced sorting by factor of 5. sort_key record sort_key location

The Spark cluster was able to sustain 3GB/s/node I/O activity during the map phase, and 1.1 GB/s/node network activity during the reduce phase, saturating the 10Gbps link available on these machines.

Apache Spark 16

10b 10b100b 4b

• The memory consumed by RDDs can be optimized per size of data• Performance can be tuned by caching• MR supports only Map and Reduce operations and everything (join,

groupby etc) has to be fit into the Map and Reduce model, which might not be the efficient way. Spark supports 80 other transformations and actions.

Apache Spark 17

References

• http://spark.apache.org/• http://en.wikipedia.org/wiki/Apache_Spark• http://opensource.com/business/15/1/apache-spark-new-world-recor

d• http://

www.cloudera.com/content/cloudera/en/products-and-services/cdh/spark.html

• http://vision.cloudera.com/mapreduce-spark/• http://

thesohiljain.blogspot.in/2015/03/apache-spark-introduction.html

Apache Spark 18

http://spark.apache.org/

http://en.wikipedia.org/wiki/Apache_Spark

http://opensource.com/business/15/1/apache-spark-new-world-record

http://opensource.com/business/15/1/apache-spark-new-world-record

http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/spark.html



http://vision.cloudera.com/mapreduce-spark/

http://vision.cloudera.com/mapreduce-spark/

http://thesohiljain.blogspot.in/2015/03/apache-spark-introduction.html

http://thesohiljain.blogspot.in/2015/03/apache-spark-introduction.html

Thank You

Apache Spark 19

Data & Analytics

Spark introduction and architecture