Thing you didn't know you could do in Spark

Things You Didn't Know You Could Do in Spark

Version 0.1 | © Snappydata Inc 2017

Strategic Advisor @ SnappydataAssistant Prof. @Univ. of Mich

Barzan Mozafari

CTO | Co-founder @ Snappydata

Jags RamnarayanCOO | Co-founder @ Snappydata

Sudhir Menon

2© SnappyData Inc. 2017

CurrentMarket Trends

Analytic workloads are moving to the cloudScale-out infrastructure ->Increasingly ExpensiveExcessive data shuffling -> Longer response times

Real-time operational analytics are on the riseMixed Workloads combine streams, operational data and Historical dataReal-time MattersConcurrency is increasingShared infrustrure (10’s of users and apps, 100’s of queries)Exploratory analyticsBI and visualization toolsConcurrent queries

Data volumes are growing rapidlyIndustrial IoT (Internet of Things)Data growing faster than Moore’s law

3

OurObservatio

ns

Acheiving interactive analytics is hardAnalysts want response time <2 secondsHard with concurrent queries in the clusterHard on an analyst’s laptop Hard on large datasetsHard when there are network shuffles

After 99.9%- accurate early results, the final answer is no longer needed200x query speedup200x reduction of cluster load, network shuffles, and I/O waits

Lambda architecture is extremely complex (and slow!)Dealing with multiple data formats, disparate product APIsWasted resources

Querying multiple remote data sources is slowCurated data from multiple sources is slow

© SnappyData Inc. 2017

Agenda• Mixed workloads Structured Streaming in Spark 2.0 State Management in Spark 2.0

• Interactive Analytics in Spark 2.0

• The SnappyData Solution

• Q&A

StreamProcessing

Mixed workloads are way too common

Transactions(point lookups, small updates)

Interactive Analytics

Analytics on mutating data

Maintaining state or counters while processingstreams

Correlating and joining streams w/ large histories

Mixed Workload in IoT Analytics

- Map sensors to tags- Monitor temperature- Send alerts

Correlate current temperature trend with history….

Anomaly detection – score against models

Interact using dynamic queries

How Mixed Workloads are Supported Today?

Lambda Architecture

Scalable StoreHDFS, MPP DB

Data-in-motion

Analytics

Application

Streams

Alerts

Enrich

Reference DB

Lamba Architecure is Complex

Enrich – requires reference data join

State – manage windows with mutating stateAnalytics – Joins to history, trend patternsModel maintenance

Models

Transaction writes

Interactive OLAP queries

Updates

Interactive queries

Scalable StoreUpdates

HDFS, MPP DB

Data-in-motion

Analytics

Application

Streams

Alerts

Enrich

Reference DB


Storm, Spark streaming, Samza…

NoSQL – Cassandra, Redis, Hbase ….

Models

Enterprise DB – Oracle, Postgres..

OLAP SQL DBInteractiv

e queries


• Complexity: learn and master multiple products data models, disparate APIs, configs

• Slower

• Wasted resources

Can we simplify & optimize?How about a single clustered DB that can manage stream state,

transactional data and run OLAP queries?

RDB

TablesTxn

Framework for stream processing, etc

Stream processing

HDFS

MPP DB

Scalable writes, point reads, OLAP queries

Apps

One idea: Start from Apache Spark

Spark Core

Spark Streaming

Spark ML

Spark SQL

GraphX

Stream program can Leverage ML, SQL, Graph

Why? • Unified programming across

streaming, interactive, batch• Multiple data formats, data sources • Rich library for Streaming, SQL,

Machine Learning, …

But: Spark is a computational engine not a store or a database

Structured Streaming in Spark 2.0

Source: Databricks presentation


read...stream() creates a streaming DataFrame, does not start any of the computation

write...startStream() defines where & how to output the data and starts the processing


New State Management API in Spark 2.0 – KV Store

• Preserve state across streaming aggregations across multiple batches

• Fetch, store KV pairs

• Transactional

• Supports versioning (batch)

• Built in store that persists to HDFS

State Management in Spark 2.0

Deeper Look – Ad Impression AnalyticsAd Network Architecture – Analyze log impressions in real time

Ad Impression Analytics

Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/

Bottlenecks in the write pathStream micro batches in

parallel from Kafka to each Spark executor

Filter and collect event for 1 minute. Reduce to 1 event per Publisher,

Geo every minute

Execute GROUP BY … Expensive Spark Shuffle …

Shuffle again in DB cluster … data format changes …

serialization costs

val input :DataFrame= sqlContext.read .options(kafkaOptions).format("kafkaSource") .stream()

val result :DataFrame = input .where("geo != 'UnknownGeo'") .groupBy( window("event-time", "1min"), Publisher", "geo") .agg(avg("bid"), count("geo"),countDistinct("cookie"))

val query = result.write.format("org.apache.spark.sql.cassandra") .outputMode("append") .startStream("dest-path”)

Bottlenecks in the Write Path

• Aggregations – GroupBy, MapReduce• Joins with other streams, Reference data

Shuffle Costs (Copying, Serialization) Excessive copying in Java based Scale out stores

Avoid bottleneck - Localize process with State

Spark Executor

Spark Executor

Kafka queue

KV

KV

Publisher: 1,3,5Geo: 2,4,6

Avoid shuffle by localizing writes

Input queue, Stream, KV

Store all share the

same partitioning

strategy




Spark 2.0 StateStore

?

Impedance mismatch with KV stores

We want to run interactive “scan”-intensive queries:

- Find total uniques for Ads grouped on geography

- Impression trends for Advertisers(group by query)

Two alternatives: row-stores vs. column stores

Goal – Localize processingFast key-based lookup

But, too slow to run aggregations, scan based interactive queries

Fast scans, aggregations

Updates and random writes are very difficult

Consume too much memory

Why columnar storage in-memory?

Interactive Analyticsin Spark 2.0

But, are in-memory column-stores fast enough?

SELECT SUBSTR(sourceIP, 1, X),

SUM(adRevenue)

FROM uservisits

GROUP BY SUBSTR(sourceIP, 1, X)

Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB

Why distributed, in-memory and columnar formats not enough for interactive speeds?

1. Distributed shuffles can take minutes

2. Software overheads, e.g., (de)serialization

3. Concurrent queries competing for the same resources

So, how can we ensure interactive-speed analytics?

• Most apps happy to tradeoff 1% accuracy for 200x speedup!

• Can usually get a 99.9% accurate answer by only looking at a tiny fraction of data!

• Often can make perfectly accurate decisions without having perfectly accurate answers!

• A/B Testing, visualization, ... • The data itself is usually noisy

• Processing entire data doesn’t necessarily mean exact answers!

• Inference is probabilistic anyway

Use statistical techniques to shrink data!

Our Solution: SnappyData

SnappyData: A New Approach

A Single Unified Cluster: OLTP + OLAP + Streaming

for real-time analytics

Batch design, high throughput

Real-time designLow latency, HA,

concurrency

Vision: Drastically reduce the cost and complexity in modern big data

Rapidly Maturing Matured over 13 years

SnappyData: A New ApproachSingle unified HA cluster: OLTP + OLAP +

Stream for real-time analytics

Batch design, high throughput

Real-time operational Analytics – TBs in memory

RDB

Rows

TxnColumn

ar

API

Stream processing

ODBC, JDBC, REST

Spark - Scala, Java, Python, R

HDFSAQP

First commercial product with Approximate Query Processing (AQP)

MPP DB

Index

SnappyData Data Model

Rows

Columnar

Stream processing

ODBCJDBCProbabilistic

data

Snappy Data Server – Spark Executor + Store

Index

Spark Program SQL

Parser

RDD Table

Hybrid Storage

RDD

Cluster Mgr

Scheduler

High Latency/OLAP

Low Latency

Add/remove Servers

Catalog

Features

- Deeply integrated database for Spark - 100% compatible with Spark - Extensions for Transactions (updates), SQL stream processing - Extensions for High Availability - Approximate query processing for interactive OLAP - OLTP+OLAP Store - Replicated and partitioned tables - Tables can be Row or Column oriented (in-memory & on-disk) - SQL extensions for compatibility with SQL Standard - create table, view, indexes, constraints, etc

Time Series Analytics (kparc.com)http://kparc.com/q4/readme.txt Trade and Quote Analytics

Time Series Analytics – kparc.comhttp://kparc.com/q4/readme.txt

- Find last quoted price for Top 100 instruments for given date"select quote.sym, avg(bid) from quote join S on (quote.sym = S.sym) where date='$d' group by quote.sym” // Replaced with AVG as MemSQL doesn’t support LAST

- Max price for Top 100 Instruments across Exchanges for given date"select trade.sym, ex, max(price) from trade join S on (trade.sym = S.sym) where date='$d' group by trade.sym, ex”

- Find Average size of Trade in each hour for top 100 Instruments for given date

"select trade.sym, hour(time), avg(size) from trade join S on (trade.sym = S.sym) where date='$d' group by trade.sym, hour(time)"

Results for Small day (40million), Big day(640 million)

Q1 Q2 Q30

500

1000

1500

2000

2500

3000

3500

Query Response time in Mil-lis

(Smaller is better)

Axis Title

SnappyData MemSQL Impala Greenplum0

10000

20000

30000

40000

50000

60000

Query Response time in Millis (Smaller is better)

Axis Title

TPC-H: 10X-20X faster than Spark 2.0

Synopses Data Engine

Use Case Patterns1. Analytics cache

- Caching for Analytics over any “big data” store (esp. MPP)- Federate query between samples and backend’

2. Stream analytics for Spark Process streams, transform, real-time scoring, store, query

3. In-memory transactional store Highly concurrent apps, SQL cache, OLTP + OLAP

Interactive-Speed Analytic Queries – Exact or Approximate

Select avg(Bid), Advertiser from T1 group by AdvertiserSelect avg(Bid), Advertiser from T1 group by Advertiser with error 0.1

Speed/Accuracy tradeoffEr

ror (

%)

30 mins

Time to Execute onEntire Dataset

InteractiveQueries

2 secExecution Time 42

100 secs2 secs

1% Error

Approximate Query Processing Features• Support for uniform sampling• Support for stratified sampling

- Solutions exist for stored data (BlinkDB)- SnappyData works for infinite streams of

data too• Support for exponentially decaying windows over

time• Support for synopses

- Top-K queries, heavy hitters, outliers, ...• [future] Support for joins• Workload mining (http://CliffGuard.org)

http://CliffGuard.org/

Uniform (Random) Sampling

ID Advertiser Geo Bid

1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001

Uniform SampleID Advertiser Geo Bid Sampling

Rate3 adv20 NY 0.0002 1/35 adv20 NY 0.0001 1/3

SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’

Original Table

Uniform (Random) Sampling



Uniform SampleID Advertiser Geo Bid Sampling

Rate3 adv20 NY 0.0002 2/35 adv20 NY 0.0001 2/31 adv10 NY 0.0001 2/32 adv10 VT 0.0005 2/3


Original Table Larger

Stratified Sampling



Stratified Sample on GeoID Advertiser Geo Bid Sampling

Rate3 adv20 NY 0.0002 1/42 adv10 VT 0.0005 1/2


Original Table

Sketching techniques● Sampling not effective for outlier detection

○ MAX/MIN etc● Other probabilistic structures like CMS, heavy hitters, etc● SnappyData implements Hokusai

○ Capturing item frequencies in timeseries● Design permits TopK queries over arbitrary trim intervals

(Top100 popular URLs)SELECT pageURL, count(*) frequency FROM TableWHERE …. GROUP BY ….ORDER BY frequency DESCLIMIT 100

Airline Analytics Demo

Zeppelin Spark

Interpreter(Driver)

Zeppelin Server

Row cacheColumnarcompressed

Spark Executor JVM


Spark Executor JVM


Spark Executor JVM

Free Cloud trial service – Project iSight● Free AWS/Azure credits for anyone to try out

SnappyData

● One click launch of private SnappyData cluster with Zeppelin

● Multiple notebooks with comprehensive description of concepts and value

● Bring your own data sets to try ‘Immediate visualization’ using Synopses datahttp://www.snappydata.io/cloudbuilder --- URL for AWS Service

Send email to [email protected] to be notified. Release expected next week

http://www.snappydata.io/cloudbuilder





mailto:[email protected]

How SnappyData Extends Spark

Snappy Spark Cluster Deployment topologies • Snappy store and Spark

Executor share the JVM memory

• Reference based access – zero copy

• SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput

Unified Cluster

Split Cluster

Simple API – Spark Compatible

● Access Table as DataFrameCatalog is automatically recovered

● Store RDD[T]/DataFrame can be stored in SnappyData tables

● Access from Remote SQL clients

● Addtional API for updates, inserts, deletes

//Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props );

//save using DataFrame APIdataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");

val impressionLogs: DataFrame = context.table(colTable)val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable)<… Now use any of DataFrame APIs … >

Extends SparkCREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",

// Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",….. [AS select_statement];

Simple to Ingest Streams using SQL

Consume from stream

Transform raw data

Continuous Analytics

Ingest into in-memory Store

Overflow table to HDFS

Create stream table AdImpressionLog(<Columns>) using directkafka_stream options (

<socket endpoints> "topics 'adnetwork-topic’ “,

"rowConverter ’ AdImpressionLogAvroDecoder’ )

streamingContext.registerCQ(

"select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds)where geo != 'unknown' group by publisher, geo”)// Register CQ

.foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Append)

.saveAsTable("adImpressions")

AdImpression DemoSpark, SQL Code Walkthrough, interactive SQL

AdImpression Ingest Performance

Loading from parquet files using 4 cores Stream ingest (Kafka + Spark streaming to store using 1 core)

Unified Cluster Architecture

How do we extend Spark for Real Time?• Spark Executors are long

running. Driver failure doesn’t shutdown Executors

• Driver HA – Drivers run “Managed” with standby secondary

• Data HA – Consensus based clustering integrated for eager replication

How do we extend Spark for Real Time?• By pass scheduler for low

latency SQL

• Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc

• Full SQL support – Persistent Catalog, Transaction, DML

Unified OLAP/OLTP streaming w/ Spark

● Far fewer resources: TB problem becomes GB.○ CPU contention drops

● Far less complex○ single cluster for stream ingestion, continuous queries,

interactive queries and machine learning● Much faster

○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive

www.snappydata.io

SnappyData is Open Source● Ad Analytics example/benchmark - https://github.com/

SnappyDataInc/snappy-poc

● https://github.com/SnappyDataInc/snappydata

● Learn more www.snappydata.io/blog● Connect:

○ twitter: www.twitter.com/snappydata○ facebook: www.facebook.com/snappydata○ slack: http://snappydata-slackin.herokuapp.com

https://github.com/SnappyDataInc/snappydata

https://github.com/SnappyDataInc/snappy-poc

https://github.com/SnappyDataInc/snappy-poc




Data & Analytics

Thing you didn't know you could do in Spark