57
Alex Zeltov Solutions Engineer @azeltov http://tiny.cc/sparkmeetup Intro to Big Data Analytics using Apache Spark & Zeppelin

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Embed Size (px)

Citation preview

Page 1: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Alex ZeltovSolutions Engineer@azeltov

http://tiny.cc/sparkmeetup

Intro to Big Data Analytics using Apache Spark & Zeppelin

Page 2: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

In this workshop• Introduction to HDP and Spark

• Spark Programming: Scala, Python, R- Core Spark: working with RDDs- Spark SQL: structured data access- Spark MlLib: predictive analytics- Spark Streaming: real time data processing

• Conclusion and Further Reading, Q/A

Page 3: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Spark Background

Page 4: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Spark?

Apache Open Source Project - originally developed at AMPLab (University of California Berkeley)

Data Processing Engine - focused on in-memory distributed computing use-cases

API - Scala, Python, Java and R

Page 5: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Ecosystem

Spark Core

Spark SQL Spark Streaming MLLib GraphX

Page 6: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Spark?

Elegant Developer APIs– Single environment for data munging and Machine Learning (ML)

In-memory computation model – Fast!– Effective for iterative computations and ML

Machine Learning– Implementation of distributed ML algorithms– Pipeline API (Spark ML)

Page 7: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Generality

• Combine SQL, streaming, and complex analytics.• Spark powers a stack of libraries including SQL and DataFrames, MLlib for

machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Runs Everywhere:

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3, WASB

Page 8: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Emerging Spark Patterns

Spark as query federation engine• Bring data from multiple sources to join/query in Spark

Use multiple Spark libraries together• Common to see Core, ML & Sql used together

Use Spark with various Hadoop ecosystem projects• Use Spark & Hive together• Spark & HBase together• Spark & SOLR, etc...

Page 9: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

More Data Sources APIs

Page 10: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Hadoop?

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

The core of Apache Hadoop consists of a storage part Hadoop Distributed File System (HDFS) and a processing part (MapReduce) and YARN ResourceManager for allocating resources and scheduling applications

Page 11: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Access patterns enabled by YARN

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS Hadoop Distributed File System

Interactive Real-TimeBatch

Applications BatchNeeds to happen but, no timeframe limitations

Interactive Needs to happen at Human time

Real-Time Needs to happen at Machine Execution time.

Page 12: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Spark on YARN?

Utilize existing HDP cluster infrastructure Resource management

– share Spark workloads with other workloads like PIG, HIVE, etc.

Scheduling and queues

Spark Driver

ClientSpark

Application Master

YARN container

Spark Executor

YARN container

Task Task

Spark Executor

YARN container

Task Task

Spark Executor

YARN container

Task Task

Page 13: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why HDFS?Fault Tolerant Distributed Storage• Divide files into big blocks and distribute 3 copies randomly across the cluster• Processing Data Locality

• Not Just storage but computation

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

Page 14: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark is certified as YARN Ready and is a part of HDP.

Hortonworks Data Platform 2.4

GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS

YARN: Data Operating System(Cluster Resource Management)

Map

Red

uce

Apache Falcon

Apache Sqoop

Apache Flume

Apache Kafka Apac

he H

ive

Apac

he P

ig

Apac

he H

Base

Apac

he A

ccum

ulo

Apac

he S

olr

Apac

he S

park

Apac

he S

torm

1 • • • • • • • • • • •

• • • • • • • • • • • •

HDFS (Hadoop Distributed File System)

Apache Ambari

Apache ZooKeeper

Apache Oozie

Deployment ChoiceLinux Windows On-premises Cloud

Apache Atlas

Cloudbreak

SECURITY

Apache Ranger

Apache Knox

Apache Atlas

HDFS Encryption

ISV

Engi

nes

Page 15: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hortonworks Commitment to Spark

Hortonworks is focused on making Apache Spark enterprise ready so you can depend on it for mission critical applications

YARN: Data Operating System

SEC

UR

ITY

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

GO

VER

NA

NC

E&

INTE

GR

ATIO

N

OPE

RAT

ION

S

Script

Pig

Search

Solr

SQL

Hive HCatalog

NoSQL

HBaseAccumulo

Stream

Storm

Other ISVs

TezTez

In-Memory

1. YARN enable Spark to co-exist with other enginesSpark is “YARN Ready” so its memory & CPU intensive apps can work with predictable performance along side other engines all on the same set(s) of data.

2. Extend Spark with enterprise capabilities Ensure Spark can be managed, secured and governed all via a single set of frameworks to ensure consistency. Ensure reliability and quality of service of Spark along side other engines.

3. Actively collaborate within the open community As with everything we do at Hortonworks we work entirely within the open community across Spark and all related projects to improve this key Hadoop technology.

Page 16: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interacting with Spark

Page 17: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interacting with Spark

• Spark’s interactive REPL shell (in Python or Scala)

• Web-based Notebooks:• Zeppelin: A web-based notebook that enables interactive data

analytics. • Jupyter: Evolved from the IPython Project• SparkNotebook: forked from the scala-notebook• RStudio: for Spark R , Zeppelin support coming soon

https://community.hortonworks.com/articles/25558/running-sparkr-in-rstudio-using-hdp-24.html

Page 18: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin• A web-based notebook that enables interactive data

analytics. • Multiple language backend• Multi-purpose Notebook is the place for all your

needs Data Ingestion Data Discovery Data Analytics Data Visualization Collaboration

Page 19: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin- Multiple language backendScala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.

Page 20: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin – Dependency Management

• Load libraries recursively from Maven repository• Load libraries from local filesystem

%dep

// add maven repository z.addRepo("RepoName").url("RepoURL”)

// add artifact from filesystem z.load("/path/to.jar")

// add artifact from maven repository, with no dependency z.load("groupId:artifactId:version").excludeAll()

Page 21: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 21

Community Plugins

• 100+ connectors

http://spark-packages.org/

Page 22: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Spark Basics

Page 23: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

How Does Spark Work?

• RDD• Your data is loaded in parallel into structured collections

• Actions• Manipulate the state of the working model by forming new RDDs and

performing calculations upon them• Persistence

• Long-term storage of an RDD’s state

Page 24: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

RDD - Resilient Distributed Dataset Primary abstraction in Spark

– An Immutable collection of objects (or records, or elements) that can be operated on in parallel

Distributed– Collection of elements partitioned across nodes in a cluster– Each RDD is composed of one or more partitions– User can control the number of partitions– More partitions => more parallelism

Resilient– Recover from node failures– An RDD keeps its lineage information -> it can be recreated from parent RDDs

May be persisted in memory for efficient reuse across parallel operations (caching)

Page 25: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

item-1 item-2 item-3 item-4 item-5

item-6 item-7 item-8 item-9 item-10

item-11 item-12 item-13 item-14 item-15

item-16 item-17 item-18 item-19 item-20

item-21 item-22 item-23 item-24 item-25

more partitions = more parallelism

Worker Spark

executor Worker Spark

executor Worker Sparkexecutor

Programmer specifies number of partitions for an RDD (Default value used if unspecified)

RDD split into 5 partitions

RDD – Resilient Distributed Dataset

Page 26: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

RDDs •  Two types of operations:transformations and actions

•  Transformations are lazy (not computed immediately) •  Transformed RDD is executed when action runs on it •  Persist (cache) RDDs in memory or disk

Page 27: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Example RDD Transformations

map(func) filter(func) distinct(func)

• All create a new DataSet from an existing one• Do not create the DataSet until an action is performed (Lazy)• Each element in an RDD is passed to the target function and the

result forms a new RDD

Page 28: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Example Action Operations

count() reduce(func) collect() take()

• Either:• Returns a value to the driver program• Exports state to external system

Page 29: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Example Persistence Operations

persist() -- takes options cache() -- only one option: in-memory

• Stores RDD Values• in memory (what doesn’t fit is recalculated when necessary)

• Replication is an option for in-memory

• to disk• blended

Page 30: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Applications

Are a definition in code of • RDD creation• Actions• Persistence

Results in the creation of a DAG (Directed Acyclic Graph) [workflow]• Each DAG is compiled into stages• Each Stage is executed as a series of Tasks• Each Task operates in parallel on assigned partitions

Page 31: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Context

Main entry point for Spark functionality Represents a connection to a Spark cluster Represented as sc in your code

What is it?

Page 32: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Context

• A Spark program first creates a SparkContext object

• Tells Spark how and where to access a cluster

• Use SparkContext to create RDDs

• SparkContext, SQLContext, ZeppelinContext:• are automatically created and exposed as variable names 'sc', 'sqlContext' and 'z',

respectively, both in scala and python environments using Zeppelin• iPython and programs must use a constructor to create a new SparkContext

Note: that scala / python environment shares the same SparkContext, SQLContext, ZeppelinContext instance.

Page 33: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Processing A File in Scala

//Load the file:

val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv")

//Trim away any empty rows:

val fltr = file.filter(_.length > 0)//Print out the remaining rows:

fltr.foreach(println)

33

Page 34: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

A Word on Anonymous FunctionsScala programmers make great use of anonymous functions as can be seen in the code:

flatMap( line => line.split(" ") )

34

Argument to the function

Body of the function

Page 35: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Scala Functions Come In a Variety of StylesflatMap( line => line.split(" ") )

flatMap((line:String) => line.split(" "))

flatMap(_.split(" "))

35

Argument to the function (type inferred)

Body of the function

Argument to the function (explicit type)

Body of the function

No Argument to the function declared (placeholder) instead

Body of the function includes placeholder _ which allows for exactly one use of one arg for each _ present. _ essentially means ‘whatever you pass me’

Page 36: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

And Finally – the Formal ‘def’def myFunc(line:String): Array[String]={

return line.split(",")

}

//and now that it has a name:

myFunc("Hi Mom, I’m home.").foreach(println)

Return type of the function)

Body of the function

Argument to the function)

Page 37: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Lab Spark RDD – Philly Crime Dataset

Page 38: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL

Page 39: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Overview

Spark module for structured data processing (e.g. DB tables, JSON files) Three ways to manipulate data:

– DataFrames API– SQL queries– Datasets API

Same execution engine for all three Spark SQL interfaces provide more information about both structure and computation

being performed than basic Spark RDD API

Page 40: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFrames

Conceptually equivalent to a table in relational DB or data frame in R/Python API available in Scala, Java, Python, and R Richer optimizations (significantly faster than RDDs) Distributed collection of data organized into named columns Underneath is an RDD Catalyst Optimizer is used underhood

Page 41: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFramesCSVAvro

HIVE

Spark SQL

Text

Col1 Col2 … … ColN

DataFrame(with RDD underneath)

Column

Row

Created from Various Sources

DataFrames from HIVE:– Reading and writing HIVE tables,

including ORC

DataFrames from files:– Built-in: JSON, JDBC, ORC, Parquet, HDFS– External plug-in: CSV, HBASE, Avro

DataFrames from existing RDDs– with toDF()function

Data is described as a DataFrame with rows, columns and a schema

Page 42: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Writing a DataFrameval df = sqlContext.jsonFile("/tmp/people.json")

df.show()df.printSchema()df.select ("First Name").show()df.select("First Name","Age").show()df.filter(df("age")>40).show()df.groupBy("age").count().show()

Page 43: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Examples

Page 44: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SQL Context and Hive Context

Entry point into all functionality in Spark SQL All you need is SparkContextval sqlContext = SQLContext(sc)

SQLContext

Superset of functionality provided by basic SQLContext– Read data from Hive tables– Access to Hive Functions UDFs

HiveContext

val hc = HiveContext(sc)

Use when your data resides in

Hive

Page 45: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFrame Example

val df = sqlContext.table("flightsTbl")

df.select("Origin", "Dest", "DepDelay").show(5)

Reading Data From Table

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 8|| IAD| TPA| 19|| IND| BWI| 8|| IND| BWI| -4|| IND| BWI| 34|+------+----+--------+

Page 46: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFrame Example

df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)

Using DataFrame API to Filter Data (show delays more than 15 min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

Page 47: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SQL Example

// Register Temporary Table

df.registerTempTable("flights")

// Use SQL to Query Dataset

sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT

5").show

Using SQL to Query and Filter Data (again, show delays more than 15 min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

Page 48: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

RDD vs. DataFrame

Page 49: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

RDDs vs. DataFrames

RDD

DataFrame

Lower-level API (more control) Lots of existing code & users Compile-time type-safety

Higher-level API (faster development) Faster sorting, hashing, and serialization More opportunities for automatic optimization Lower memory pressure

Page 50: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Frames are Intuitive

RDD Example

Equivalent Data Frame Example

dept name ageBio H Smith 48CS A Turing 54Bio B Jones 43Phys E Witten 61

Find average age by department?

Page 51: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Optimizations Spark SQL uses an underlying optimization engine (Catalyst)

– Catalyst can perform intelligent optimization since it understands the schema

Spark SQL does not materialize all the columns (as with RDD) only what’s needed

Page 52: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Lab DataFrames – Federated Spark SQL

Page 53: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hortonworks Community Connection

Page 54: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

community.hortonworks.com

Page 55: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

community.hortonworks.com

Page 56: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HCC DS, Analytics, and Spark Related Questions Sample

Page 57: Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Thank you!community.hortonworks.com