61
Things You Didn't Know You Could Do in Spark Version 0.1 | © Snappydata Inc 2017 Strategic Advisor @ Snappydata Assistant Prof. @Univ. of Mich Barzan Mozafari CTO | Co-founder @ Snappydata Jags Ramnarayan COO | Co-founder @ Snappydata Sudhir Menon

Thing you didn't know you could do in Spark

Embed Size (px)

Citation preview

Page 1: Thing you didn't know you could do in Spark

Things You Didn't Know You Could Do in Spark

Version 0.1 | © Snappydata Inc 2017

Strategic Advisor @ SnappydataAssistant Prof. @Univ. of Mich

Barzan Mozafari

CTO | Co-founder @ Snappydata

Jags RamnarayanCOO | Co-founder @ Snappydata

Sudhir Menon

Page 2: Thing you didn't know you could do in Spark

2© SnappyData Inc. 2017

CurrentMarket Trends

Analytic workloads are moving to the cloudScale-out infrastructure ->Increasingly ExpensiveExcessive data shuffling -> Longer response times

Real-time operational analytics are on the riseMixed Workloads combine streams, operational data and Historical dataReal-time MattersConcurrency is increasingShared infrustrure (10’s of users and apps, 100’s of queries)Exploratory analyticsBI and visualization toolsConcurrent queries

Data volumes are growing rapidlyIndustrial IoT (Internet of Things)Data growing faster than Moore’s law

Page 3: Thing you didn't know you could do in Spark

3

OurObservatio

ns

Acheiving interactive analytics is hardAnalysts want response time <2 secondsHard with concurrent queries in the clusterHard on an analyst’s laptop Hard on large datasetsHard when there are network shuffles

After 99.9%- accurate early results, the final answer is no longer needed200x query speedup200x reduction of cluster load, network shuffles, and I/O waits

Lambda architecture is extremely complex (and slow!)Dealing with multiple data formats, disparate product APIsWasted resources

Querying multiple remote data sources is slowCurated data from multiple sources is slow

© SnappyData Inc. 2017

Page 4: Thing you didn't know you could do in Spark

Agenda• Mixed workloads Structured Streaming in Spark 2.0 State Management in Spark 2.0

• Interactive Analytics in Spark 2.0

• The SnappyData Solution

• Q&A

Page 5: Thing you didn't know you could do in Spark

StreamProcessing

Mixed workloads are way too common

Transactions(point lookups, small updates)

Interactive Analytics

Analytics on mutating data

Maintaining state or counters while processingstreams

Correlating and joining streams w/ large histories

Page 6: Thing you didn't know you could do in Spark

Mixed Workload in IoT Analytics

- Map sensors to tags- Monitor temperature- Send alerts

Correlate current temperature trend with history….

Anomaly detection – score against models

Interact using dynamic queries

Page 7: Thing you didn't know you could do in Spark

How Mixed Workloads are Supported Today?

Lambda Architecture

Page 8: Thing you didn't know you could do in Spark

Scalable StoreHDFS, MPP DB

Data-in-motion

Analytics

Application

Streams

Alerts

Enrich

Reference DB

Lamba Architecure is Complex

Enrich – requires reference data join

State – manage windows with mutating stateAnalytics – Joins to history, trend patternsModel maintenance

Models

Transaction writes

Interactive OLAP queries

Updates

Interactive queries

Page 9: Thing you didn't know you could do in Spark

Scalable StoreUpdates

HDFS, MPP DB

Data-in-motion

Analytics

Application

Streams

Alerts

Enrich

Reference DB

Lamba Architecure is Complex

Storm, Spark streaming, Samza…

NoSQL – Cassandra, Redis, Hbase ….

Models

Enterprise DB – Oracle, Postgres..

OLAP SQL DBInteractiv

e queries

Page 10: Thing you didn't know you could do in Spark

Lamba Architecure is Complex

• Complexity: learn and master multiple products data models, disparate APIs, configs

• Slower

• Wasted resources

Page 11: Thing you didn't know you could do in Spark

Can we simplify & optimize?How about a single clustered DB that can manage stream state,

transactional data and run OLAP queries?

RDB

TablesTxn

Framework for stream processing, etc

Stream processing

HDFS

MPP DB

Scalable writes, point reads, OLAP queries

Apps

Page 12: Thing you didn't know you could do in Spark

One idea: Start from Apache Spark

Spark Core

Spark Streaming

Spark ML

Spark SQL

GraphX

Stream program can Leverage ML, SQL, Graph

Why? • Unified programming across

streaming, interactive, batch• Multiple data formats, data sources • Rich library for Streaming, SQL,

Machine Learning, …

But: Spark is a computational engine not a store or a database

Page 13: Thing you didn't know you could do in Spark

Structured Streaming in Spark 2.0

Page 14: Thing you didn't know you could do in Spark

Source: Databricks presentation

Page 15: Thing you didn't know you could do in Spark

Source: Databricks presentation

read...stream() creates a streaming DataFrame, does not start any of the computation

write...startStream() defines where & how to output the data and starts the processing

Page 16: Thing you didn't know you could do in Spark

Source: Databricks presentation

Page 17: Thing you didn't know you could do in Spark

New State Management API in Spark 2.0 – KV Store

• Preserve state across streaming aggregations across multiple batches

• Fetch, store KV pairs

• Transactional

• Supports versioning (batch)

• Built in store that persists to HDFS

Page 18: Thing you didn't know you could do in Spark

State Management in Spark 2.0

Page 19: Thing you didn't know you could do in Spark

Deeper Look – Ad Impression AnalyticsAd Network Architecture – Analyze log impressions in real time

Page 20: Thing you didn't know you could do in Spark

Ad Impression Analytics

Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/

Page 21: Thing you didn't know you could do in Spark

Bottlenecks in the write pathStream micro batches in

parallel from Kafka to each Spark executor

Filter and collect event for 1 minute. Reduce to 1 event per Publisher,

Geo every minute

Execute GROUP BY … Expensive Spark Shuffle …

Shuffle again in DB cluster … data format changes …

serialization costs

val input :DataFrame= sqlContext.read .options(kafkaOptions).format("kafkaSource") .stream()

val result :DataFrame = input .where("geo != 'UnknownGeo'") .groupBy( window("event-time", "1min"), Publisher", "geo") .agg(avg("bid"), count("geo"),countDistinct("cookie"))

val query = result.write.format("org.apache.spark.sql.cassandra") .outputMode("append") .startStream("dest-path”)

Page 22: Thing you didn't know you could do in Spark

Bottlenecks in the Write Path

• Aggregations – GroupBy, MapReduce• Joins with other streams, Reference data

Shuffle Costs (Copying, Serialization) Excessive copying in Java based Scale out stores

Page 23: Thing you didn't know you could do in Spark

Avoid bottleneck - Localize process with State

Spark Executor

Spark Executor

Kafka queue

KV

KV

Publisher: 1,3,5Geo: 2,4,6

Avoid shuffle by localizing writes

Input queue, Stream, KV

Store all share the

same partitioning

strategy

Publisher: 2,4,6Geo: 1,3,5

Publisher: 1,3,5Geo: 2,4,6

Publisher: 2,4,6Geo: 1,3,5

Spark 2.0 StateStore

?

Page 24: Thing you didn't know you could do in Spark

Impedance mismatch with KV stores

We want to run interactive “scan”-intensive queries:

- Find total uniques for Ads grouped on geography

- Impression trends for Advertisers(group by query)

Two alternatives: row-stores vs. column stores

Page 25: Thing you didn't know you could do in Spark

Goal – Localize processingFast key-based lookup

But, too slow to run aggregations, scan based interactive queries

Fast scans, aggregations

Updates and random writes are very difficult

Consume too much memory

Page 26: Thing you didn't know you could do in Spark

Why columnar storage in-memory?

Page 27: Thing you didn't know you could do in Spark

Interactive Analyticsin Spark 2.0

Page 28: Thing you didn't know you could do in Spark

But, are in-memory column-stores fast enough?

SELECT SUBSTR(sourceIP, 1, X),

SUM(adRevenue)

FROM uservisits

GROUP BY SUBSTR(sourceIP, 1, X)

Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB

Page 29: Thing you didn't know you could do in Spark

Why distributed, in-memory and columnar formats not enough for interactive speeds?

1. Distributed shuffles can take minutes

2. Software overheads, e.g., (de)serialization

3. Concurrent queries competing for the same resources

So, how can we ensure interactive-speed analytics?

Page 30: Thing you didn't know you could do in Spark

• Most apps happy to tradeoff 1% accuracy for 200x speedup!

• Can usually get a 99.9% accurate answer by only looking at a tiny fraction of data!

• Often can make perfectly accurate decisions without having perfectly accurate answers!

• A/B Testing, visualization, ... • The data itself is usually noisy

• Processing entire data doesn’t necessarily mean exact answers!

• Inference is probabilistic anyway

Use statistical techniques to shrink data!

Page 31: Thing you didn't know you could do in Spark

Our Solution: SnappyData

Page 32: Thing you didn't know you could do in Spark

SnappyData: A New Approach

A Single Unified Cluster: OLTP + OLAP + Streaming

for real-time analytics

Batch design, high throughput

Real-time designLow latency, HA,

concurrency

Vision: Drastically reduce the cost and complexity in modern big data

Rapidly Maturing Matured over 13 years

Page 33: Thing you didn't know you could do in Spark

SnappyData: A New ApproachSingle unified HA cluster: OLTP + OLAP +

Stream for real-time analytics

Batch design, high throughput

Real-time operational Analytics – TBs in memory

RDB

Rows

TxnColumn

ar

API

Stream processing

ODBC, JDBC, REST

Spark - Scala, Java, Python, R

HDFSAQP

First commercial product with Approximate Query Processing (AQP)

MPP DB

Index

Page 34: Thing you didn't know you could do in Spark

SnappyData Data Model

Rows

Columnar

Stream processing

ODBCJDBCProbabilistic

data

Snappy Data Server – Spark Executor + Store

Index

Spark Program SQL

Parser

RDD Table

Hybrid Storage

RDD

Cluster Mgr

Scheduler

High Latency/OLAP

Low Latency

Add/remove Servers

Catalog

Page 35: Thing you didn't know you could do in Spark

Features

- Deeply integrated database for Spark - 100% compatible with Spark - Extensions for Transactions (updates), SQL stream processing - Extensions for High Availability - Approximate query processing for interactive OLAP - OLTP+OLAP Store - Replicated and partitioned tables - Tables can be Row or Column oriented (in-memory & on-disk) - SQL extensions for compatibility with SQL Standard - create table, view, indexes, constraints, etc

Page 36: Thing you didn't know you could do in Spark

Time Series Analytics (kparc.com)http://kparc.com/q4/readme.txt Trade and Quote Analytics

Page 37: Thing you didn't know you could do in Spark

Time Series Analytics – kparc.comhttp://kparc.com/q4/readme.txt

- Find last quoted price for Top 100 instruments for given date"select quote.sym, avg(bid) from quote join S on (quote.sym = S.sym) where date='$d' group by quote.sym” // Replaced with AVG as MemSQL doesn’t support LAST

- Max price for Top 100 Instruments across Exchanges for given date"select trade.sym, ex, max(price) from trade join S on (trade.sym = S.sym) where date='$d' group by trade.sym, ex”

- Find Average size of Trade in each hour for top 100 Instruments for given date

"select trade.sym, hour(time), avg(size) from trade join S on (trade.sym = S.sym) where date='$d' group by trade.sym, hour(time)"

Page 38: Thing you didn't know you could do in Spark

Results for Small day (40million), Big day(640 million)

Q1 Q2 Q30

500

1000

1500

2000

2500

3000

3500

Query Response time in Mil-lis

(Smaller is better)

Axis Title

SnappyData MemSQL Impala Greenplum0

10000

20000

30000

40000

50000

60000

Query Response time in Millis (Smaller is better)

Axis Title

Page 39: Thing you didn't know you could do in Spark

TPC-H: 10X-20X faster than Spark 2.0

Page 40: Thing you didn't know you could do in Spark

Synopses Data Engine

Page 41: Thing you didn't know you could do in Spark

Use Case Patterns1. Analytics cache

- Caching for Analytics over any “big data” store (esp. MPP)- Federate query between samples and backend’

2. Stream analytics for Spark Process streams, transform, real-time scoring, store, query

3. In-memory transactional store Highly concurrent apps, SQL cache, OLTP + OLAP

Page 42: Thing you didn't know you could do in Spark

Interactive-Speed Analytic Queries – Exact or Approximate

Select avg(Bid), Advertiser from T1 group by AdvertiserSelect avg(Bid), Advertiser from T1 group by Advertiser with error 0.1

Speed/Accuracy tradeoffEr

ror (

%)

30 mins

Time to Execute onEntire Dataset

InteractiveQueries

2 secExecution Time 42

100 secs2 secs

1% Error

Page 43: Thing you didn't know you could do in Spark

Approximate Query Processing Features• Support for uniform sampling• Support for stratified sampling

- Solutions exist for stored data (BlinkDB)- SnappyData works for infinite streams of

data too• Support for exponentially decaying windows over

time• Support for synopses

- Top-K queries, heavy hitters, outliers, ...• [future] Support for joins• Workload mining (http://CliffGuard.org)

Page 44: Thing you didn't know you could do in Spark

Uniform (Random) Sampling

ID Advertiser Geo Bid

1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001

Uniform SampleID Advertiser Geo Bid Sampling

Rate3 adv20 NY 0.0002 1/35 adv20 NY 0.0001 1/3

SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’

Original Table

Page 45: Thing you didn't know you could do in Spark

Uniform (Random) Sampling

ID Advertiser Geo Bid

1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001

Uniform SampleID Advertiser Geo Bid Sampling

Rate3 adv20 NY 0.0002 2/35 adv20 NY 0.0001 2/31 adv10 NY 0.0001 2/32 adv10 VT 0.0005 2/3

SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’

Original Table Larger

Page 46: Thing you didn't know you could do in Spark

Stratified Sampling

ID Advertiser Geo Bid

1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001

Stratified Sample on GeoID Advertiser Geo Bid Sampling

Rate3 adv20 NY 0.0002 1/42 adv10 VT 0.0005 1/2

SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’

Original Table

Page 47: Thing you didn't know you could do in Spark

Sketching techniques● Sampling not effective for outlier detection

○ MAX/MIN etc● Other probabilistic structures like CMS, heavy hitters, etc● SnappyData implements Hokusai

○ Capturing item frequencies in timeseries● Design permits TopK queries over arbitrary trim intervals

(Top100 popular URLs)SELECT pageURL, count(*) frequency FROM TableWHERE …. GROUP BY ….ORDER BY frequency DESCLIMIT 100

Page 48: Thing you didn't know you could do in Spark

Airline Analytics Demo

Zeppelin Spark

Interpreter(Driver)

Zeppelin Server

Row cacheColumnarcompressed

Spark Executor JVM

Row cacheColumnarcompressed

Spark Executor JVM

Row cacheColumnarcompressed

Spark Executor JVM

Page 49: Thing you didn't know you could do in Spark

Free Cloud trial service – Project iSight● Free AWS/Azure credits for anyone to try out

SnappyData

● One click launch of private SnappyData cluster with Zeppelin

● Multiple notebooks with comprehensive description of concepts and value

● Bring your own data sets to try ‘Immediate visualization’ using Synopses datahttp://www.snappydata.io/cloudbuilder --- URL for AWS Service

Send email to [email protected] to be notified. Release expected next week

Page 50: Thing you didn't know you could do in Spark

How SnappyData Extends Spark

Page 51: Thing you didn't know you could do in Spark

Snappy Spark Cluster Deployment topologies • Snappy store and Spark

Executor share the JVM memory

• Reference based access – zero copy

• SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput

Unified Cluster

Split Cluster

Page 52: Thing you didn't know you could do in Spark

Simple API – Spark Compatible

● Access Table as DataFrameCatalog is automatically recovered

● Store RDD[T]/DataFrame can be stored in SnappyData tables

● Access from Remote SQL clients

● Addtional API for updates, inserts, deletes

//Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props );

//save using DataFrame APIdataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");

val impressionLogs: DataFrame = context.table(colTable)val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable)<… Now use any of DataFrame APIs … >

Page 53: Thing you didn't know you could do in Spark

Extends SparkCREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",

// Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",….. [AS select_statement];

Page 54: Thing you didn't know you could do in Spark

Simple to Ingest Streams using SQL

Consume from stream

Transform raw data

Continuous Analytics

Ingest into in-memory Store

Overflow table to HDFS

Create stream table AdImpressionLog(<Columns>) using directkafka_stream options (

<socket endpoints> "topics 'adnetwork-topic’ “,

"rowConverter ’ AdImpressionLogAvroDecoder’ )

streamingContext.registerCQ(

"select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds)where geo != 'unknown' group by publisher, geo”)// Register CQ

.foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Append)

.saveAsTable("adImpressions")

Page 55: Thing you didn't know you could do in Spark

AdImpression DemoSpark, SQL Code Walkthrough, interactive SQL

Page 56: Thing you didn't know you could do in Spark

AdImpression Ingest Performance

Loading from parquet files using 4 cores Stream ingest (Kafka + Spark streaming to store using 1 core)

Page 57: Thing you didn't know you could do in Spark

Unified Cluster Architecture

Page 58: Thing you didn't know you could do in Spark

How do we extend Spark for Real Time?• Spark Executors are long

running. Driver failure doesn’t shutdown Executors

• Driver HA – Drivers run “Managed” with standby secondary

• Data HA – Consensus based clustering integrated for eager replication

Page 59: Thing you didn't know you could do in Spark

How do we extend Spark for Real Time?• By pass scheduler for low

latency SQL

• Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc

• Full SQL support – Persistent Catalog, Transaction, DML

Page 60: Thing you didn't know you could do in Spark

Unified OLAP/OLTP streaming w/ Spark

● Far fewer resources: TB problem becomes GB.○ CPU contention drops

● Far less complex○ single cluster for stream ingestion, continuous queries,

interactive queries and machine learning● Much faster

○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive

Page 61: Thing you didn't know you could do in Spark

www.snappydata.io

SnappyData is Open Source● Ad Analytics example/benchmark - https://github.com/

SnappyDataInc/snappy-poc

● https://github.com/SnappyDataInc/snappydata

● Learn more www.snappydata.io/blog● Connect:

○ twitter: www.twitter.com/snappydata○ facebook: www.facebook.com/snappydata○ slack: http://snappydata-slackin.herokuapp.com