Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enterprise Spark At Scale


Agile Analytics with Enterprise Apache Spark at Scale

S PA R K ON YA RN

OPERATIONS SECURITY

GOVERNANCE

STORAGE

STORAGE

Powering Agile Analyticsvia data science notebooks and automation for most common analytics (including Geospatial analysis and entity resolution)

Seamless Data Accessthat brings together as many data types as possible

Unmatched Economicscombining the speed of in‐memory processing with HDP’s cost efficiencies at scale

Ready for the Enterprisewith robust security, governance and operations coordinated centrally by Apache Hadoop and YARN


What Is Apache Spark?

Apache open source project originally developed at AMPLab(University of California Berkeley)

Unified data processing engine that operates across varied data workloads and platforms


Why Apache Spark?

Elegant Developer APIs

– Single environment for data munging and Machine Learning (ML)

In‐memory computation model – Fast!

– Effective for iterative computations and ML

Machine Learning

– Implementation of distributed ML algorithms

– Pipeline API (Spark ML)


Why Apache Spark on YARN?

Resource management

– Share Spark workloads with other workloads (PIG, HIVE, etc.)

Utilizes existing HDP cluster infrastructure

Scheduling and queues

Spark Driver

ClientSpark

Application Master

YARN container

Spark Executor

YARN container

Task Task

Spark Executor

YARN container

Task Task

Spark Executor

YARN container

Task Task


Emerging Apache Spark Patterns on HDP

Spark as query federation processing engine and caching tool

– Bring data from multiple sources to join/query in Spark

Use multiple Spark libraries together

– Common to see Core, ML & Sql used together

Use Spark with various Hadoop ecosystem projects

– Hive, Hbase, SOLR, etc.

– HDFS for long running secure clusters/TDE

– Secure Kafka connection in Kerberos clusters


Structured Streaming with Apache Spark

Single, high‐level streaming API on DataFrames

Scalable, high‐throughput, fault‐tolerant stream processing of live data streams

Machine Learning Support batch and interactive queries

– Aggregate Data in a stream, then serve using JDBC

– Build and process ML models on the stream


Data Processing in Apache Spark

Single, high‐level, seamless mix of SQL queries with Spark Programs

Connect to any data source the same way

– Hive, Avro, Parquet, Hbase, ORC, JSON and JDBC, etc.

Machine Learning Connect through JDBC or ODBC


Apache Spark Use Cases


Massive Volumes of Weblogs Fueled Webtrends Growth—but also its Skyrocketing Storage Costs

Webtrends provides digital marketing solutions for more than 2,000 companies in 60 countries – processing 13 billion daily online events

Data used to be processed in relational databases, stored on large NAS appliances, which were not economical at scale

Processing occurred on‐premises, without cloud‐based capabilities

Diseconomies of scale hampered the company objective to help its customers predict optimal online ad placement

Webtrends’ Journey


Webtrends’ Journey

Petabytes of Weblogs Analyzed with Sparkat Scale

Data streams from a vast array of desktop and mobile devices

13 billion daily events collected in fewer than milliseconds per event

No data cleansing necessary prior to analysis with Apache Spark

Two clusters consolidated into one YARN‐based HDP cluster

Launched new product Webtrends Explore™ – powered by HDP

Innovate

RenovatePersonalized Online Ads

“We’re able to…look at this data set and process it and do predictions, behavioral analysis.

We can do things that allow us to determine ROI for different actions and behavioral patterns.”

Peter Crossley, Chief Architect

A C T I V EA R C H I V E

D A T AD I S C O V E R Y

SINGLEVIEW

P R E D I C T I V EA N A L Y T I C S

P R E D I C T I V EA N A L Y T I C S

D A T AD I S C O V E R Y

Per‐Customer Click Path

Web LogAnalysis

SQL Server Offload

Behavioral Segmentation

Ad Click Predictions

LCV Analysis


Customer Use Cases with Apache Spark

Web Analytics for Marketing

– Ingesting 13 Billion events/Day

– Use Spark Streaming for Data Ingest

– Extremely low latency: 40 milliseconds

– Need more metrics for Spark Streaming

– Wants 2 way SSL for Kafka Spark receiver

Optimize Advertising

– Monitor channel changes with Spark Streaming

– Correlate changes with Ads/Programming

– Allocate Ads real time: Show ads to user who are watching a show and will stay for > over 20 seconds

– How to optimize Spark App development

Web Analytics Cable Company

Real time Fraud Detection

– Monitor ATM with NiFi

– Log Aggregation & fraud detection

Smart Meters

– Now getting data every 15 minutes

– Improve theft/fraud detection

– Text customer on power outage

Bank/Credit Card Utility Company


Interacting with Apache Spark


Interacting with Apache Spark

Spark Thrift Server

Driver

REST Server

Driver

Spark Shell

Driver

Zeppelin

Driver

Spark on YARN



Apache Zeppelin GA: The Data Science Notebook

Web‐based data science notebook

Interactive data ingestion and data exploration

Easy sharing and collaboration

Secure with single sign‐on and encryption


How Apache Zeppelin Works

Notebook Author

Collaborators/Report Viewers

HDP ClusterSpark | Hive | HBase | SOLR

Any of 30+ back ends

Zeppelin


Bringing Multitenancy to Apache Zeppelin


Introducing Livy

Livy is the open source REST interface for interacting with Apache Spark from anywhere

Installed as Spark Ambari Service, not yet exposed outside of Zeppelin

Livy Client

HTTP HTTP (RPC)

Spark Interactive SessionSparkContext

Spark Batch SessionSparkContext

Livy Server


Security Across Zeppelin‐Livy‐Spark

Shiro

Ispark Group Interpreter

SPNego: Kerberos Kerberos

Livy APIs

Spark on YARN

Zeppelin

Driver

LDAP

Livy Server


Reasons to Integrate with Livy

Bring Sessions to Apache Zeppelin

– Isolation

– Session sharing

Enable efficient cluster resource utilization

– Default Spark interpreter keeps YARN/Spark job running forever

– Livy interpreter recycled after 60 minutes of inactivity (controlled by livy.server.session.timeout )

To Identity Propagation

– Send user identity from Zeppelin > Livy > Spark on YARN


Livy Server

SparkContext Sharing

Session‐2

Session‐1

SparkSession‐1SparkContext

SparkSession‐2SparkContext

Client 1

Client 2

Client 3

Session‐1

Session‐1

Session‐2


New Features of Spark in HDP 2.5


Apache Spark 2.0 Technical Preview

API Improvements

– SparkSession – new entry point

– Unified DataFrame & DataSet API

– Structured Streaming/Continuous Application

Performance Improvements

– Tungsten Phase 2 ‐ Multi stage code gen

Machine Learning

– ML pipeline the new API, MLlib deprecated

– Distributed R algorithms (GLM, Naïve Bayes, K‐Means, Survival Regression)

SparkSQL

– More SQL support (new ANSI SQL parser, subquery support)

First Hadoop distribution with Spark 2.0


Side‐by‐Side Apache Spark Installs within HDP 2.5

Can install Spark 1.6.2 & 2.0 on the same cluster/on same nodes

Spark 1.6 & Spark 2.0 are separate Ambari services

– Each service gets its own Spark History Server, Thrift Server, Spark Clients

– Each Service configuration is independent

Spark 1.6 Jobs history only goes to Spark 1.6 History Server

Spark 2.0 Jobs history only goes to Spark 2.0 History Server

How to experiment with Spark 2.0 TP


Apache Spark + HBase Connector – GA within HDP 2.5

Brings DataFrame based Spark analytics for Hbase

See blog for usage patterns: http://bit.ly/sparkhbaseconnector

YARNContainer

Spark Executor

Task Task

YARN Container

Spark Executor

Task Task

YARN Container

Spark Executor

Task Task

YARN Container

Spark Executor

Task Task

Driver

Region Server Region Server Region Server Region Server

Scans BulkGets


Key Features: Apache Spark Column Security with LLAP

Fine‐Grained Column Level Access Control for SparkSQL

Fully dynamic policies per user ‐ doesn’t require views

Use Standard Ranger policies and tools to control access and masking policies

Flow: 1. SparkSQL gets data locations known

as “splits” from HiveServer and plans query

2. HiveServer2 authorizes access using Ranger; per‐user policies like row filtering are applied

3. Spark gets a modified query plan based on dynamic security policy

4. Spark reads data from LLAP;filtering/masking guaranteed by LLAP server

HiveServer2

Authorization

Hive Metastore

Data Locations

View Definitions

LLAP

Data Read

Filter Pushdown

Ranger Server

Dynamic Policies

Spark Client

12

4

3


Example: Per‐User Row Filtering by Region in SparkSQL

Spark User 2

(East Region)

Spark User 1

(West Region)

Original Query:

SELECT * from CUSTOMERS

WHERE total_spend > 10000

Query Rewrites based on

Dynamic Ranger Policies

LLAP Data Access

User ID Region Total Spend

1 East 5,131

2 East 27,828

3 West 55,493

4 West 7,193

5 East 18,193

Dynamic Rewrite:



AND region = “east”

Dynamic Rewrite:



AND region = “west”


Apache Zeppelin Security: Authentication + SSL

Tommy Callahan

Zeppelin Spark on YARN

LDAP

SSL

Firewall

1

2

3


Apache Zeppelin + Livy End‐to‐End Security

Ispark Group Interpreter

SPNego: Kerberos Kerberos/RPC

Livy APIs

Spark on YARN

Zeppelin

LDAP

Livy ServerJob runs as

Tommy Callahan

Tommy Callahan


Apache Spark & Zeppelin Timeline from Hortonworks

HDP 2.2.4

Spark 1.2.1

GA

HDP 2.3.2

Spark 1.4.1

GA

HDP 2.3.0

Spark 1.3.1

GA

HDP 2.3.4

Spark 1.5.2*

GA

Spark

Spark 1.3.1

TP

May 2015

Spark 1.4.1TP

Aug 2015

Spark 1.5.1TP

Nov 2015

Zeppelin

TP #1

Oct 2015

Zeppelin

Zeppelin

TP #2

Mar 2016

Dec 2015

HDP 2.4.0

Spark 1.6

GA

Zeppelin

Final TP

Apr 2016

Spark 1.6TP

Jan 2015

Mar 2016

HDP 2.4.2

Spark 1.6.1

GA

Spark 1.6.2 (GA)

+ Spark 2.0 (TP)

Hortonworks

First Zeppelin Contribution

Mar 2015

Zeppelin TLP

May 2016

HDP 2.5

Zeppelin

GA

Aug 2016


Demo

Documents

Enterprise Spark At Scalehortonworks.com/wp-content/uploads/2016/09/5-Tech-Track-Enterprise-Spark-at-Scale...1 © Hortonworks Inc. 2011 –2016. All Rights Reserved Enterprise Spark