12
CHAPTER 01: INTRODUCTION TO DATA ANALYSIS WITH SPARK Learning Spark by Holden Karau et. al.

Learning spark ch01 - Introduction to Data Analysis with Spark

Embed Size (px)

Citation preview

Page 1: Learning spark ch01 - Introduction to Data Analysis with Spark

CHAPTER 01 : INTRODUCTION TO DATA ANALYSIS WITH SPARK

Learning Sparkby Holden Karau et. al.

Page 2: Learning spark ch01 - Introduction to Data Analysis with Spark

Overview: Introduction to Data Analysis with SPARK

What Is Apache Spark? A Unified Stack

Spark Core Spark SQL Spark Streaming MLlib GraphX

Cluster ManagersWho Uses Spark, and for What?

Data Science Tasks Data Processing Applications

A Brief History of Spark Spark Versions and Releases Storage Layers for Spark

Page 3: Learning spark ch01 - Introduction to Data Analysis with Spark

1.1 What Is Apache Spark?

Apache Spark is a cluster computing platform Spark extends MapReduce model to support

Different computations batch applications, iterative algorithms, interactive queries, and streaming

Run computations in memory Highly Accessible

simple APIs in Python, Java, Scala, and SQL rich built-in libraries accessing Hadoop Clusters/Data

Sources

Page 4: Learning spark ch01 - Introduction to Data Analysis with Spark

Edx and Coursera Courses

Introduction to Big Data with Apache SparkSpark Fundamentals IFunctional Programming Principles in Scala

Page 5: Learning spark ch01 - Introduction to Data Analysis with Spark

1.2 A Unified Stack

Page 6: Learning spark ch01 - Introduction to Data Analysis with Spark

1.2.1 A Unified Stack: Core, SQL, Streaming

Spark Core Task Scheduling Memory management Fault recovery Storage system interaction API that defines resilient Distributed Dataset (RDD)

Spark SQL Provide SQL interface to Spark Allow programmatic data manipulations mix with SQL

Spark Streaming Enables processing of live stream data e.g. web logs

Page 7: Learning spark ch01 - Introduction to Data Analysis with Spark

1.2.2 A Unified Stack: MLlib, GraphX, ClusterM

MLlib Contains common machine learning (ML) modules Classification, Regression, Clustering, Collaborative

Filtering Model evaluation, Data Import, Lower-level ML

primitivesGraphX

Extends Spark RDD APIs just like Spark SQL/Streaming

Contains graph algorithmsCluster Managers

Hadoop YARN, Apache Mesos Default: Standalone scheduler

Page 8: Learning spark ch01 - Introduction to Data Analysis with Spark

1.3 Who Uses Spark, and for What ?

General-purpose framework for cluster computing Data Scientists Engineers

Data Scientists Analyze and Model data SQL, Statistics, Predictive Model (ML) using Python, R Use Cases: Interactive shells with Python, Scala, SparkSQL

supporting MLlib libraries calling out Matlab/REngineers

Data Processing Applications Principles of SW engineering (Encapsulation, OOP,

Interface design)

Page 9: Learning spark ch01 - Introduction to Data Analysis with Spark

1.4 A Brief History of Spark

2009: UC Berkeley RAD lab became AMPlab Start with Hadoop MapReduce was inefficient for interactive

computing jobs designed for interactive and iterative query performance

In-memory storage Efficient fault recovery 10-20X times faster than MapReduce

Early Adopters Spark PoweredBy page Spark Meetups Spark Summit

2011 Berkeley Data Analytics Stacks (BDAS)

Page 10: Learning spark ch01 - Introduction to Data Analysis with Spark

1.5 Spark Versions and Releases

May 2014 Spark 1.1.0April 2015 Spark 1.3.1 Spark Documentation

Page 11: Learning spark ch01 - Introduction to Data Analysis with Spark

1.6 Storage Layers for Spark

Spark can create distributed datasets from HDFS Supported by Hadoop API

Local Filesystem Amazon S3 Cassandra Hive Hbase …etc

Supports others Text file Sequence file Arvo Parquet Hadoop InputFormat

Page 12: Learning spark ch01 - Introduction to Data Analysis with Spark

Learn More about Apache Spark