32
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Spark At Scale

Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

  • Upload
    others

  • View
    7

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enterprise Spark At Scale

Page 2: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agile Analytics with Enterprise Apache Spark at Scale

S PA R K   ON   YA RN

OPERATIONS SECURITY

GOVERNANCE

STORAGE

STORAGE

Powering Agile Analyticsvia data science notebooks and automation for most common analytics (including Geospatial analysis and entity resolution)

Seamless Data Accessthat brings together as many data types as possible

Unmatched Economicscombining the speed of in‐memory processing with HDP’s cost efficiencies at scale

Ready for the Enterprisewith robust security, governance and operations coordinated centrally by Apache Hadoop and YARN

Page 3: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What Is Apache Spark?

Apache open source project originally developed at AMPLab(University of California Berkeley)

Unified data processing engine that operates across varied data workloads and platforms

Page 4: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Apache Spark?

Elegant Developer APIs

– Single environment for data munging and Machine Learning (ML)

In‐memory computation model – Fast!

– Effective for iterative computations and ML

Machine Learning

– Implementation of distributed ML algorithms

– Pipeline API (Spark ML)

Page 5: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Apache Spark on YARN?

Resource management 

– Share Spark workloads with other workloads (PIG, HIVE, etc.)

Utilizes existing HDP cluster infrastructure

Scheduling and queues

Spark Driver

ClientSpark

Application Master

YARN container

Spark Executor

YARN container

Task Task

Spark Executor

YARN container

Task Task

Spark Executor

YARN container

Task Task

Page 6: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Emerging Apache Spark Patterns on HDP 

Spark as query federation processing engine and caching tool

– Bring data from multiple sources to join/query in Spark

Use multiple Spark libraries together

– Common to see Core, ML & Sql used together

Use Spark with various Hadoop ecosystem projects

– Hive, Hbase, SOLR, etc.

– HDFS for long running secure clusters/TDE

– Secure Kafka connection in Kerberos clusters

Page 7: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Structured Streaming with Apache Spark 

Single, high‐level streaming API on DataFrames

Scalable, high‐throughput, fault‐tolerant stream processing of live data streams

Machine Learning Support batch and interactive queries 

– Aggregate Data in a stream, then serve using JDBC 

– Build and process ML models on the stream

Page 8: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Processing in Apache Spark

Single, high‐level, seamless mix of SQL queries with Spark Programs

Connect to any data source the same way

– Hive, Avro, Parquet, Hbase, ORC, JSON and JDBC, etc.

Machine Learning Connect through JDBC or ODBC 

Page 9: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Spark Use Cases 

Page 10: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Massive Volumes of Weblogs Fueled Webtrends Growth—but also its Skyrocketing Storage Costs 

Webtrends provides digital marketing solutions for more than 2,000 companies in 60 countries – processing 13 billion daily online events

Data used to be processed in relational databases, stored on large NAS appliances, which were not economical at scale

Processing occurred on‐premises, without cloud‐based capabilities

Diseconomies of scale hampered the company objective to help its customers predict optimal online ad placement

Webtrends’ Journey

Page 11: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Webtrends’ Journey

Petabytes of Weblogs Analyzed with Sparkat Scale

Data streams from a vast array of desktop and mobile devices

13 billion daily events collected in fewer than milliseconds per event

No data cleansing necessary prior to analysis with Apache Spark

Two clusters consolidated into one YARN‐based HDP cluster

Launched new product Webtrends Explore™ – powered  by HDP 

Innovate

RenovatePersonalized Online Ads

“We’re able to…look at this data set and process it and do predictions, behavioral analysis. 

We can do things that allow us to determine ROI for different actions and behavioral patterns.”

Peter Crossley, Chief Architect

A C T I V EA R C H I V E

D A T AD I S C O V E R Y

SINGLEVIEW

P R E D I C T I V EA N A L Y T I C S

P R E D I C T I V EA N A L Y T I C S

D A T AD I S C O V E R Y

Per‐Customer Click Path

Web LogAnalysis

SQL Server Offload

Behavioral Segmentation

Ad Click Predictions

LCV Analysis

Page 12: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Customer Use Cases with Apache Spark

Web Analytics for Marketing

– Ingesting 13 Billion events/Day

– Use Spark Streaming for Data Ingest

– Extremely low latency: 40 milliseconds

– Need more metrics for Spark Streaming

– Wants 2 way SSL for Kafka Spark receiver

Optimize Advertising

– Monitor channel changes with Spark Streaming

– Correlate changes with Ads/Programming

– Allocate Ads real time: Show ads to user who are watching a show and will stay for > over 20 seconds

– How to optimize Spark App development

Web Analytics Cable Company

Real time Fraud Detection

– Monitor ATM with NiFi

– Log Aggregation & fraud detection

Smart Meters

– Now getting data every 15 minutes

– Improve theft/fraud detection 

– Text customer on power outage

Bank/Credit Card Utility Company

Page 13: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interacting with Apache Spark

Page 14: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interacting with Apache Spark

Spark Thrift Server

Driver

REST Server

Driver

Spark Shell

Driver

Zeppelin

Driver

Spark on YARN

Page 15: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 16: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin GA:  The Data Science Notebook

Web‐based data science notebook

Interactive data ingestion and data exploration 

Easy sharing and collaboration 

Secure with single sign‐on and encryption  

Page 17: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

How Apache Zeppelin Works

Notebook Author

Collaborators/Report Viewers

HDP ClusterSpark | Hive | HBase | SOLR

Any of 30+ back ends

Zeppelin

Page 18: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Bringing Multitenancy to Apache Zeppelin

Page 19: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Introducing Livy

Livy is the open source REST interface for interacting with Apache Spark from anywhere 

Installed as Spark Ambari Service, not yet exposed outside of Zeppelin

Livy Client

HTTP HTTP (RPC)

Spark Interactive SessionSparkContext

Spark Batch SessionSparkContext

Livy Server

Page 20: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Security Across Zeppelin‐Livy‐Spark

Shiro

Ispark Group Interpreter

SPNego: Kerberos Kerberos

Livy APIs

Spark on YARN

Zeppelin

Driver

LDAP

Livy Server

Page 21: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Reasons to Integrate with Livy

Bring Sessions to Apache Zeppelin

– Isolation

– Session sharing 

Enable efficient cluster resource utilization

– Default Spark interpreter keeps YARN/Spark job running forever

– Livy interpreter recycled after 60 minutes of inactivity (controlled by livy.server.session.timeout )

To Identity Propagation

– Send user identity from Zeppelin  > Livy  > Spark on YARN

Page 22: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Livy Server

SparkContext Sharing

Session‐2

Session‐1

SparkSession‐1SparkContext

SparkSession‐2SparkContext

Client 1

Client 2

Client 3

Session‐1

Session‐1

Session‐2

Page 23: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

New Features of Spark in HDP 2.5

Page 24: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Spark 2.0 Technical Preview

API Improvements

– SparkSession – new entry point

– Unified DataFrame & DataSet API

– Structured Streaming/Continuous Application 

Performance Improvements

– Tungsten Phase 2 ‐ Multi stage code gen

Machine Learning 

– ML pipeline the new API, MLlib deprecated

– Distributed R algorithms (GLM, Naïve Bayes, K‐Means, Survival Regression)

SparkSQL

– More SQL support (new ANSI SQL parser, subquery support)

First Hadoop distribution with Spark 2.0

Page 25: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Side‐by‐Side Apache Spark Installs within HDP 2.5 

Can install Spark 1.6.2 & 2.0 on the same cluster/on same nodes

Spark 1.6 & Spark 2.0 are separate Ambari services

– Each service gets its own Spark History Server, Thrift Server, Spark Clients 

– Each Service configuration is independent

Spark 1.6 Jobs history only goes to Spark 1.6 History Server

Spark 2.0 Jobs history only goes to Spark 2.0 History Server

How to experiment with Spark 2.0 TP

Page 26: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Spark + HBase Connector – GA within HDP 2.5

Brings DataFrame based Spark analytics for Hbase

See blog for usage patterns:  http://bit.ly/sparkhbaseconnector

YARNContainer

Spark Executor

Task Task

YARN Container

Spark Executor

Task Task

YARN Container

Spark Executor

Task Task

YARN Container

Spark Executor

Task Task

Driver

Region Server Region Server Region Server Region Server

Scans BulkGets

Page 27: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key Features: Apache Spark Column Security with LLAP

Fine‐Grained Column Level Access Control for SparkSQL

Fully dynamic policies per user ‐ doesn’t require views

Use Standard Ranger policies and tools to control access and masking policies

Flow: 1. SparkSQL gets data locations known 

as “splits” from HiveServer and plans query

2. HiveServer2 authorizes access using Ranger; per‐user policies like row filtering are applied

3. Spark gets a modified query plan based on dynamic security policy

4. Spark reads data from LLAP;filtering/masking guaranteed by LLAP server

HiveServer2

Authorization

Hive Metastore

Data Locations

View Definitions

LLAP

Data Read

Filter Pushdown

Ranger Server

Dynamic Policies

Spark Client

12

4

3

Page 28: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Example: Per‐User Row Filtering by Region in SparkSQL

Spark User 2

(East Region)

Spark User 1

(West Region)

Original Query:

SELECT * from CUSTOMERS

WHERE total_spend > 10000

Query Rewrites based on

Dynamic Ranger Policies

LLAP Data Access

User ID Region Total Spend

1 East 5,131

2 East 27,828

3 West 55,493

4 West 7,193

5 East 18,193

Dynamic Rewrite:

SELECT * from CUSTOMERS

WHERE total_spend > 10000

AND region = “east”

Dynamic Rewrite:

SELECT * from CUSTOMERS

WHERE total_spend > 10000

AND region = “west”

Page 29: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin Security: Authentication + SSL

Tommy Callahan

Zeppelin Spark on YARN

LDAP

SSL

Firewall

1

2

3

Page 30: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin + Livy End‐to‐End Security

Ispark Group Interpreter

SPNego: Kerberos Kerberos/RPC

Livy APIs

Spark on YARN

Zeppelin

LDAP

Livy ServerJob runs as

Tommy Callahan

Tommy Callahan

Page 31: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Spark & Zeppelin Timeline from Hortonworks 

HDP 2.2.4

Spark 1.2.1

GA

HDP 2.3.2

Spark 1.4.1

GA

HDP 2.3.0

Spark 1.3.1

GA

HDP 2.3.4

Spark 1.5.2*

GA

Spark

Spark 1.3.1

TP

May 2015

Spark 1.4.1TP

Aug 2015

Spark 1.5.1TP

Nov 2015

Zeppelin

TP #1

Oct 2015

Zeppelin

Zeppelin 

TP #2

Mar 2016

Dec 2015

HDP 2.4.0

Spark 1.6

GA

Zeppelin 

Final TP

Apr 2016

Spark 1.6TP

Jan 2015

Mar 2016

HDP 2.4.2

Spark 1.6.1

GA

Spark 1.6.2 (GA)

+ Spark 2.0 (TP)

Hortonworks

First Zeppelin Contribution

Mar 2015

Zeppelin TLP 

May 2016

HDP 2.5

Zeppelin 

GA

Aug 2016

Page 32: Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo