Upload
alex-zeltov
View
778
Download
1
Embed Size (px)
Citation preview
Alex ZeltovSolutions Engineer@azeltov
http://tiny.cc/sparkmeetup
Intro to Big Data Analytics using Apache Spark & Zeppelin
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
In this workshop• Introduction to HDP and Spark
• Spark Programming: Scala, Python, R- Core Spark: working with RDDs- Spark SQL: structured data access- Spark MlLib: predictive analytics- Spark Streaming: real time data processing
• Conclusion and Further Reading, Q/A
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Background
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Spark?
Apache Open Source Project - originally developed at AMPLab (University of California Berkeley)
Data Processing Engine - focused on in-memory distributed computing use-cases
API - Scala, Python, Java and R
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Ecosystem
Spark Core
Spark SQL Spark Streaming MLLib GraphX
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Spark?
Elegant Developer APIs– Single environment for data munging and Machine Learning (ML)
In-memory computation model – Fast!– Effective for iterative computations and ML
Machine Learning– Implementation of distributed ML algorithms– Pipeline API (Spark ML)
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Generality
• Combine SQL, streaming, and complex analytics.• Spark powers a stack of libraries including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Runs Everywhere:
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3, WASB
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Emerging Spark Patterns
Spark as query federation engine• Bring data from multiple sources to join/query in Spark
Use multiple Spark libraries together• Common to see Core, ML & Sql used together
Use Spark with various Hadoop ecosystem projects• Use Spark & Hive together• Spark & HBase together• Spark & SOLR, etc...
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
More Data Sources APIs
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Hadoop?
Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
The core of Apache Hadoop consists of a storage part Hadoop Distributed File System (HDFS) and a processing part (MapReduce) and YARN ResourceManager for allocating resources and scheduling applications
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS Hadoop Distributed File System
Interactive Real-TimeBatch
Applications BatchNeeds to happen but, no timeframe limitations
Interactive Needs to happen at Human time
Real-Time Needs to happen at Machine Execution time.
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Spark on YARN?
Utilize existing HDP cluster infrastructure Resource management
– share Spark workloads with other workloads like PIG, HIVE, etc.
Scheduling and queues
Spark Driver
ClientSpark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why HDFS?Fault Tolerant Distributed Storage• Divide files into big blocks and distribute 3 copies randomly across the cluster• Processing Data Locality
• Not Just storage but computation
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark is certified as YARN Ready and is a part of HDP.
Hortonworks Data Platform 2.4
GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
YARN: Data Operating System(Cluster Resource Management)
Map
Red
uce
Apache Falcon
Apache Sqoop
Apache Flume
Apache Kafka Apac
he H
ive
Apac
he P
ig
Apac
he H
Base
Apac
he A
ccum
ulo
Apac
he S
olr
Apac
he S
park
Apac
he S
torm
1 • • • • • • • • • • •
• • • • • • • • • • • •
HDFS (Hadoop Distributed File System)
Apache Ambari
Apache ZooKeeper
Apache Oozie
Deployment ChoiceLinux Windows On-premises Cloud
Apache Atlas
Cloudbreak
SECURITY
Apache Ranger
Apache Knox
Apache Atlas
HDFS Encryption
ISV
Engi
nes
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Commitment to Spark
Hortonworks is focused on making Apache Spark enterprise ready so you can depend on it for mission critical applications
YARN: Data Operating System
SEC
UR
ITY
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
GO
VER
NA
NC
E&
INTE
GR
ATIO
N
OPE
RAT
ION
S
Script
Pig
Search
Solr
SQL
Hive HCatalog
NoSQL
HBaseAccumulo
Stream
Storm
Other ISVs
TezTez
In-Memory
1. YARN enable Spark to co-exist with other enginesSpark is “YARN Ready” so its memory & CPU intensive apps can work with predictable performance along side other engines all on the same set(s) of data.
2. Extend Spark with enterprise capabilities Ensure Spark can be managed, secured and governed all via a single set of frameworks to ensure consistency. Ensure reliability and quality of service of Spark along side other engines.
3. Actively collaborate within the open community As with everything we do at Hortonworks we work entirely within the open community across Spark and all related projects to improve this key Hadoop technology.
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with Spark
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with Spark
• Spark’s interactive REPL shell (in Python or Scala)
• Web-based Notebooks:• Zeppelin: A web-based notebook that enables interactive data
analytics. • Jupyter: Evolved from the IPython Project• SparkNotebook: forked from the scala-notebook• RStudio: for Spark R , Zeppelin support coming soon
https://community.hortonworks.com/articles/25558/running-sparkr-in-rstudio-using-hdp-24.html
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin• A web-based notebook that enables interactive data
analytics. • Multiple language backend• Multi-purpose Notebook is the place for all your
needs Data Ingestion Data Discovery Data Analytics Data Visualization Collaboration
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin- Multiple language backendScala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin – Dependency Management
• Load libraries recursively from Maven repository• Load libraries from local filesystem
%dep
// add maven repository z.addRepo("RepoName").url("RepoURL”)
// add artifact from filesystem z.load("/path/to.jar")
// add artifact from maven repository, with no dependency z.load("groupId:artifactId:version").excludeAll()
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 21
Community Plugins
• 100+ connectors
http://spark-packages.org/
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Basics
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How Does Spark Work?
• RDD• Your data is loaded in parallel into structured collections
• Actions• Manipulate the state of the working model by forming new RDDs and
performing calculations upon them• Persistence
• Long-term storage of an RDD’s state
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD - Resilient Distributed Dataset Primary abstraction in Spark
– An Immutable collection of objects (or records, or elements) that can be operated on in parallel
Distributed– Collection of elements partitioned across nodes in a cluster– Each RDD is composed of one or more partitions– User can control the number of partitions– More partitions => more parallelism
Resilient– Recover from node failures– An RDD keeps its lineage information -> it can be recreated from parent RDDs
May be persisted in memory for efficient reuse across parallel operations (caching)
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
item-1 item-2 item-3 item-4 item-5
item-6 item-7 item-8 item-9 item-10
item-11 item-12 item-13 item-14 item-15
item-16 item-17 item-18 item-19 item-20
item-21 item-22 item-23 item-24 item-25
more partitions = more parallelism
Worker Spark
executor Worker Spark
executor Worker Sparkexecutor
Programmer specifies number of partitions for an RDD (Default value used if unspecified)
RDD split into 5 partitions
RDD – Resilient Distributed Dataset
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDDs • Two types of operations:transformations and actions
• Transformations are lazy (not computed immediately) • Transformed RDD is executed when action runs on it • Persist (cache) RDDs in memory or disk
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example RDD Transformations
map(func) filter(func) distinct(func)
• All create a new DataSet from an existing one• Do not create the DataSet until an action is performed (Lazy)• Each element in an RDD is passed to the target function and the
result forms a new RDD
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example Action Operations
count() reduce(func) collect() take()
• Either:• Returns a value to the driver program• Exports state to external system
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example Persistence Operations
persist() -- takes options cache() -- only one option: in-memory
• Stores RDD Values• in memory (what doesn’t fit is recalculated when necessary)
• Replication is an option for in-memory
• to disk• blended
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Applications
Are a definition in code of • RDD creation• Actions• Persistence
Results in the creation of a DAG (Directed Acyclic Graph) [workflow]• Each DAG is compiled into stages• Each Stage is executed as a series of Tasks• Each Task operates in parallel on assigned partitions
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Context
Main entry point for Spark functionality Represents a connection to a Spark cluster Represented as sc in your code
What is it?
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Context
• A Spark program first creates a SparkContext object
• Tells Spark how and where to access a cluster
• Use SparkContext to create RDDs
• SparkContext, SQLContext, ZeppelinContext:• are automatically created and exposed as variable names 'sc', 'sqlContext' and 'z',
respectively, both in scala and python environments using Zeppelin• iPython and programs must use a constructor to create a new SparkContext
Note: that scala / python environment shares the same SparkContext, SQLContext, ZeppelinContext instance.
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Processing A File in Scala
//Load the file:
val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv")
//Trim away any empty rows:
val fltr = file.filter(_.length > 0)//Print out the remaining rows:
fltr.foreach(println)
33
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
A Word on Anonymous FunctionsScala programmers make great use of anonymous functions as can be seen in the code:
flatMap( line => line.split(" ") )
34
Argument to the function
Body of the function
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scala Functions Come In a Variety of StylesflatMap( line => line.split(" ") )
flatMap((line:String) => line.split(" "))
flatMap(_.split(" "))
35
Argument to the function (type inferred)
Body of the function
Argument to the function (explicit type)
Body of the function
No Argument to the function declared (placeholder) instead
Body of the function includes placeholder _ which allows for exactly one use of one arg for each _ present. _ essentially means ‘whatever you pass me’
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
And Finally – the Formal ‘def’def myFunc(line:String): Array[String]={
return line.split(",")
}
//and now that it has a name:
myFunc("Hi Mom, I’m home.").foreach(println)
Return type of the function)
Body of the function
Argument to the function)
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lab Spark RDD – Philly Crime Dataset
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Overview
Spark module for structured data processing (e.g. DB tables, JSON files) Three ways to manipulate data:
– DataFrames API– SQL queries– Datasets API
Same execution engine for all three Spark SQL interfaces provide more information about both structure and computation
being performed than basic Spark RDD API
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
Conceptually equivalent to a table in relational DB or data frame in R/Python API available in Scala, Java, Python, and R Richer optimizations (significantly faster than RDDs) Distributed collection of data organized into named columns Underneath is an RDD Catalyst Optimizer is used underhood
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFramesCSVAvro
HIVE
Spark SQL
Text
Col1 Col2 … … ColN
DataFrame(with RDD underneath)
Column
Row
Created from Various Sources
DataFrames from HIVE:– Reading and writing HIVE tables,
including ORC
DataFrames from files:– Built-in: JSON, JDBC, ORC, Parquet, HDFS– External plug-in: CSV, HBASE, Avro
DataFrames from existing RDDs– with toDF()function
Data is described as a DataFrame with rows, columns and a schema
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Writing a DataFrameval df = sqlContext.jsonFile("/tmp/people.json")
df.show()df.printSchema()df.select ("First Name").show()df.select("First Name","Age").show()df.filter(df("age")>40).show()df.groupBy("age").count().show()
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Examples
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Context and Hive Context
Entry point into all functionality in Spark SQL All you need is SparkContextval sqlContext = SQLContext(sc)
SQLContext
Superset of functionality provided by basic SQLContext– Read data from Hive tables– Access to Hive Functions UDFs
HiveContext
val hc = HiveContext(sc)
Use when your data resides in
Hive
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay").show(5)
Reading Data From Table
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 8|| IAD| TPA| 19|| IND| BWI| 8|| IND| BWI| -4|| IND| BWI| 34|+------+----+--------+
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)
Using DataFrame API to Filter Data (show delays more than 15 min)
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Example
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT
5").show
Using SQL to Query and Filter Data (again, show delays more than 15 min)
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD vs. DataFrame
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDDs vs. DataFrames
RDD
DataFrame
Lower-level API (more control) Lots of existing code & users Compile-time type-safety
Higher-level API (faster development) Faster sorting, hashing, and serialization More opportunities for automatic optimization Lower memory pressure
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Frames are Intuitive
RDD Example
Equivalent Data Frame Example
dept name ageBio H Smith 48CS A Turing 54Bio B Jones 43Phys E Witten 61
Find average age by department?
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Optimizations Spark SQL uses an underlying optimization engine (Catalyst)
– Catalyst can perform intelligent optimization since it understands the schema
Spark SQL does not materialize all the columns (as with RDD) only what’s needed
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lab DataFrames – Federated Spark SQL
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Community Connection
54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
community.hortonworks.com
55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
community.hortonworks.com
56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HCC DS, Analytics, and Spark Related Questions Sample
Thank you!community.hortonworks.com