Learning spark ch01 - Introduction to Data Analysis with Spark

CHAPTER 01 : INTRODUCTION TO DATA ANALYSIS WITH SPARK

Learning Sparkby Holden Karau et. al.

Overview: Introduction to Data Analysis with SPARK

What Is Apache Spark? A Unified Stack

Spark Core Spark SQL Spark Streaming MLlib GraphX

Cluster ManagersWho Uses Spark, and for What?

Data Science Tasks Data Processing Applications

A Brief History of Spark Spark Versions and Releases Storage Layers for Spark

1.1 What Is Apache Spark?

Apache Spark is a cluster computing platform Spark extends MapReduce model to support

Different computations batch applications, iterative algorithms, interactive queries, and streaming

Run computations in memory Highly Accessible

simple APIs in Python, Java, Scala, and SQL rich built-in libraries accessing Hadoop Clusters/Data

Sources

Edx and Coursera Courses

Introduction to Big Data with Apache SparkSpark Fundamentals IFunctional Programming Principles in Scala

http://ouo.io/Mqc8L5

1.2 A Unified Stack

1.2.1 A Unified Stack: Core, SQL, Streaming

Spark Core Task Scheduling Memory management Fault recovery Storage system interaction API that defines resilient Distributed Dataset (RDD)

Spark SQL Provide SQL interface to Spark Allow programmatic data manipulations mix with SQL

Spark Streaming Enables processing of live stream data e.g. web logs

1.2.2 A Unified Stack: MLlib, GraphX, ClusterM

MLlib Contains common machine learning (ML) modules Classification, Regression, Clustering, Collaborative

Filtering Model evaluation, Data Import, Lower-level ML

primitivesGraphX

Extends Spark RDD APIs just like Spark SQL/Streaming

Contains graph algorithmsCluster Managers

Hadoop YARN, Apache Mesos Default: Standalone scheduler

1.3 Who Uses Spark, and for What ?

General-purpose framework for cluster computing Data Scientists Engineers

Data Scientists Analyze and Model data SQL, Statistics, Predictive Model (ML) using Python, R Use Cases: Interactive shells with Python, Scala, SparkSQL

supporting MLlib libraries calling out Matlab/REngineers

Data Processing Applications Principles of SW engineering (Encapsulation, OOP,

Interface design)

1.4 A Brief History of Spark

2009: UC Berkeley RAD lab became AMPlab Start with Hadoop MapReduce was inefficient for interactive

computing jobs designed for interactive and iterative query performance

In-memory storage Efficient fault recovery 10-20X times faster than MapReduce

Early Adopters Spark PoweredBy page Spark Meetups Spark Summit

2011 Berkeley Data Analytics Stacks (BDAS)

1.5 Spark Versions and Releases

May 2014 Spark 1.1.0April 2015 Spark 1.3.1 Spark Documentation

https://spark.apache.org/releases/spark-release-1-3-1.html

https://spark.apache.org/documentation.html

1.6 Storage Layers for Spark

Spark can create distributed datasets from HDFS Supported by Hadoop API

Local Filesystem Amazon S3 Cassandra Hive Hbase …etc

Supports others Text file Sequence file Arvo Parquet Hadoop InputFormat

Learn More about Apache Spark

http://ouo.io/Mqc8L5

Education

Learning spark ch01 - Introduction to Data Analysis with Spark