Intro to SnappyData Webinar

SnappyDataUnified Stream, interactive

analytics in a single in-memory cluster with Spark

www.snappydata.io

Jags RamnarayanCTO, Co-founder

SnappyDataFeb 2016

SnappyData - an EMC/Pivotal spin out● New Spark-based open source project started by

Pivotal GemFire founders+engineers● Decades of in-memory data management experience● Focus on real-time, operational analytics: Spark inside

an OLTP+OLAP database

www.snappydata.io

Lambda Architecture (LA) for Analytics

Is Lambda Complex ?

Impala, HBase

Cassandra, Redis, ..

Application

Is Lambda Complex ?

Impala, HBase

Cassandra, Redis, ..

Application

• Application has to federate queries?• Complex Application – deal with multiple

data models? Disparate APIs?• Slow?• Expensive to maintain?

Can we simplify, optimize?Can the batch and speed layer use a single, unified serving DB?

Deeper Look – Ad Impression Analytics

Ad Impression Analytics

Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/

Bottlenecks in the write pathStream micro batches in

parallel from Kafka to each Spark executor

Emit Key-Value pairs. So, we can group By

Publisher, Geo

Execute GROUP BY … Expensive Spark Shuffle …

Shuffle again in DB cluster … data format changes …

serialization costs

Bottlenecks in the Write Path

• Aggregations – GroupBy, MapReduce

• Joins with other streams, Reference data

• Replication (HA) in Fast data store

Shuffle Costs• Data models

across Spark and FastData store are different

• Data flows through too many processes

• In-JVM copying

Copying, Serialization Excessive copying in Java based Scale out stores

Goal – Localize processing• Can we Localize processing with state, avoid shuffle. • Can Kafka, Spark partitions and the data store share the same partitioning

policy? ( partition by Advertiser,Geo )

-- Show cassandra, memSQL ingestion performance (maybe later section?)

• Embedded State : Spark’s native MapWithState, Apache Samza, Flink? Good, scalable KV stores but IS THIS ENOUGH?

Impedance mismatch for Interactive Queries

On Ingested Ad Impressions we want to run queries like: - Find total uniques for a certain AD grouped on geography, month - Impression trends for advertisers, or, detect outliers in the bidding price

Unfortunately, in most KV oriented stores this is very inefficient - Not suited for scans, aggregations, distributed joins on large volumes - Memory utilization in KV store is poor

Why columnar storage?

In-memory Columns but still slow?

SELECT SUBSTR(sourceIP, 1, X),

SUM(adRevenue)

FROM uservisits

GROUP BY SUBSTR(sourceIP, 1, X)

Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB

Can we use Statistical methods to shrink data?• It is not always possible to store the data in full

Many applications (telecoms, ISPs, search engines) can’t keep everything

• It is inconvenient to work with data in full Just because we can, doesn’t mean we should

• It is faster to work with a compact summary Better to explore data on a laptop than a cluster

Ref: Graham Cormode - Sampling for Big Data

Can we use statistical techniques to understand data, synthesize something very small but still answer Analytical queries?

SnappyData: A new approach

Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics

Batch design, high throughput

Real-time design center- Low latency, HA,

concurrent

Vision: Drastically reduce the cost and complexity in modern big data

Rapidly Maturing Matured over 13 years

SnappyData: A new approachSingle unified HA cluster: OLTP + OLAP +

Stream for real-time analytics

Batch design, high throughput

Real time operational Analytics – TBs in memory

RDB

Rows

TxnColumn

ar

API

Stream processing

ODBC, JDBC, REST

Spark - Scala, Java, Python, R

HDFSAQP

First commercial project on Approximate Query Processing(AQP)

MPP DB

Index

Tenets, Guiding Principles● Memory is the new bottleneck for speed

● Memory densities will follow Moore’s law

● 100% Spark compatible - powerful, concise. Simplify runtime

● Aim for Google Search like speed for analytic queries

● Dramatically reduce costs associated with Analytics Distributed systems are expensive – Esp. in production

Use Case Patterns• Stream ingestion database for spark• Process streams, transform, real-time scoring, store,

query

• In-memory database for apps • Highly concurrent apps, SQL cache, OLTP + OLAP

• Analytic caching pattern• Caching for Analytics over any “Big data” store (esp

MPP)• Federate query between samples and backend

Why Spark – Confluence of Streaming, interactive, batch

Unifies batch, streaming, interactive comp.Easy to build sophisticated applicationsSupport iterative, graph-parallel algorithmsPowerful APIs in Scala, Python, Java

Spark

SparkStreamin

g SQLBlinkDB GraphX MLlib

Streaming

Batch, Interactiv

e

Batch, InteractiveInteractive

Data-parallel,Iterative

Sophisticated algos.

Source: Spark summit presentation from Ion Stoica

Spark Cluster

Snappy Spark Cluster Deployment topologies • Snappy store and Spark

Executor share the JVM memory

• Reference based access – zero copy

• SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput

Unified Cluster

Split Cluster

Simple API – Spark Compatible

● Access Table as DataFrameCatalog is automatically recovered

● Store RDD[T]/DataFrame can be stored in SnappyData tables

● Access from Remote SQL clients

● Addtional API for updates, inserts, deletes

//Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props );

//save using DataFrame APIdataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");

val impressionLogs: DataFrame = context.table(colTable)val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable)<… Now use any of DataFrame APIs … >

Extends SparkCREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",

// Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",….. [AS select_statement];

Simple to Ingest Streams using SQL

Consume from stream

Transform raw data

Continuous Analytics

Ingest into in-memory Store

Overflow table to HDFS

Create stream table AdImpressionLog(<Columns>) using directkafka_stream options (

<socket endpoints> "topics 'adnetwork-topic’ “,

"rowConverter ’ AdImpressionLogAvroDecoder’ )

streamingContext.registerCQ(

"select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds)where geo != 'unknown' group by publisher, geo”)// Register CQ

.foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Append)

.saveAsTable("adImpressions")

AdImpression DemoSpark, SQL Code Walkthrough, interactive SQL

AdImpression DemoPerformance comparison to Cassandra,

MemSQL - TBD

AdImpression Ingest Performance

Loading from parquet files using 4 cores Stream ingest (Kafka + Spark streaming to store using 1 core)

Why is ingestion, querying fast?Linearly scale with partition pruning

Input queue, Stream, IMDB all share the same partitioning strategy

How does it scale with concurrency?● Parallel query engine skips Spark SQL

scheduling for low latency queries

● Column tables For fast scan/aggregations

Also automatically compressed

● Row tablesFast key based or selective queries Can have any number of secondary indices

How does it scale with concurrency?● Distributed shuffle for joins, ordering ..

expensive● Two techniques to minimize this

1) Colocation of related tables. All related records are collocated.

For instance, ‘Users’ table partitioned across nodes and all related ‘Ad impressions” can be collocated on same partition. The parallel query would execute the Join locally on each partition

2) Replication of ‘Dimension’ tables Joins to dimension tables are always localized

Low latency Interactive Analytic Queries – Exact or Approximate

Select avg(volume),symbol from T1 where <time range> group by symbol Select avg(volume), symbol from T1 where <time range> group by symbol

With error 0.1 confidence 0.8

Speed/Accuracy tradeoffEr

ror

30 mins

Time to Execute onEntire Dataset

InteractiveQueries

2 sec

Execution Time (Sample Size)32

100 secs

2 secs

Key feature: Synopses Data● Maintain stratified samples

○ Intelligent sampling to keep error bounds low● Probabilistic data

○ TopK for time series (using time aggregation CMS, item aggregation)

○ Histograms, HyperLogLog, Bloom Filters, WaveletsCREATE SAMPLE TABLE sample-table-name USING columnar OPTIONS (

BASETABLE ‘table_name’ // source column table or stream table[ SAMPLINGMETHOD "stratified | uniform" ]STRATA name (

QCS (“comma-separated-column-names”)[ FRACTION “frac” ]

),+ // one or more QCS

Unified Cluster Architecture

How do we extend Spark for Real Time?• Spark Executors are long

running. Driver failure doesn’t shutdown Executors

• Driver HA – Drivers run “Managed” with standby secondary

• Data HA – Consensus based clustering integrated for eager replication

How do we extend Spark for Real Time?• By pass scheduler for low

latency SQL

• Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc

• Full SQL support – Persistent Catalog, Transaction, DML

Performance – Spark vs Snappy (TPC-H)

See ACM Sigmod 2016 paper for detailsAvailable on snappydata.io blogs

Performance – Snappy vs in-memoryDB (YCSB)

Unified OLAP/OLTP streaming w/ Spark

● Far fewer resources: TB problem becomes GB.○ CPU contention drops

● Far less complex○ single cluster for stream ingestion, continuous queries,

interactive queries and machine learning● Much faster

○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive

www.snappydata.io

SnappyData is Open Source● Available for download on Github today● https://github.com/SnappyDataInc/snappydata

● Learn more www.snappydata.io/blog● Connect:

○ twitter: www.twitter.com/snappydata○ facebook: www.facebook.com/snappydata○ linkedin: www.linkedin.com/snappydata○ slack: http://snappydata-slackin.herokuapp.com

https://github.com/SnappyDataInc/snappydata

https://github.com/SnappyDataInc/snappydata

EXTRAS

www.snappydata.io

Colocated row/column Tables in Spark

RowTable

ColumnTable

Spark ExecutorTASK

Spark Block Manager

Stream processing

RowTable

ColumnTable

Spark ExecutorTASK

Spark Block Manager

Stream processing

RowTable

ColumnTable

Spark ExecutorTASK

Spark Block Manager

Stream processing

● Spark Executors are long lived and shared across multiple apps

● Snappy Memory Mgr and Spark Block Mgr integrated

Table can be partitioned or replicated

ReplicatedTable

Partitioned Table(Buckets A-H)

ReplicatedTable

Partitioned Table(Buckets I-P)

consistent replica on each node

PartitionReplica(Buckets A-H)

ReplicatedTable

Partitioned Table(Buckets Q-W)

PartitionReplica(Buckets I-P)

Data partitioned with one or more replicas

Use partitioned tables for large fact tables, Replicated for small dimension tables

Spark Core Micro-batch Streaming

Spark SQL CatalystTXN

OLAP Job

Scheduler OLAP Query

OLTP QueryAQP

P2P Cluster Replication Svc

RowColumnTable Inde

xSampleTable

Shared nothing logs

Spark program

JDBC

ODBC

SnappyData Components

Technology

Intro to SnappyData Webinar