Upload
snappydata
View
208
Download
0
Embed Size (px)
Citation preview
Things You Didn't Know You Could Do in Spark
Version 0.1 | © Snappydata Inc 2017
Strategic Advisor @ SnappydataAssistant Prof. @Univ. of Mich
Barzan Mozafari
CTO | Co-founder @ Snappydata
Jags RamnarayanCOO | Co-founder @ Snappydata
Sudhir Menon
2© SnappyData Inc. 2017
CurrentMarket Trends
Analytic workloads are moving to the cloudScale-out infrastructure ->Increasingly ExpensiveExcessive data shuffling -> Longer response times
Real-time operational analytics are on the riseMixed Workloads combine streams, operational data and Historical dataReal-time MattersConcurrency is increasingShared infrustrure (10’s of users and apps, 100’s of queries)Exploratory analyticsBI and visualization toolsConcurrent queries
Data volumes are growing rapidlyIndustrial IoT (Internet of Things)Data growing faster than Moore’s law
3
OurObservatio
ns
Acheiving interactive analytics is hardAnalysts want response time <2 secondsHard with concurrent queries in the clusterHard on an analyst’s laptop Hard on large datasetsHard when there are network shuffles
After 99.9%- accurate early results, the final answer is no longer needed200x query speedup200x reduction of cluster load, network shuffles, and I/O waits
Lambda architecture is extremely complex (and slow!)Dealing with multiple data formats, disparate product APIsWasted resources
Querying multiple remote data sources is slowCurated data from multiple sources is slow
© SnappyData Inc. 2017
Agenda• Mixed workloads Structured Streaming in Spark 2.0 State Management in Spark 2.0
• Interactive Analytics in Spark 2.0
• The SnappyData Solution
• Q&A
StreamProcessing
Mixed workloads are way too common
Transactions(point lookups, small updates)
Interactive Analytics
Analytics on mutating data
Maintaining state or counters while processingstreams
Correlating and joining streams w/ large histories
Mixed Workload in IoT Analytics
- Map sensors to tags- Monitor temperature- Send alerts
Correlate current temperature trend with history….
Anomaly detection – score against models
Interact using dynamic queries
How Mixed Workloads are Supported Today?
Lambda Architecture
Scalable StoreHDFS, MPP DB
Data-in-motion
Analytics
Application
Streams
Alerts
Enrich
Reference DB
Lamba Architecure is Complex
Enrich – requires reference data join
State – manage windows with mutating stateAnalytics – Joins to history, trend patternsModel maintenance
Models
Transaction writes
Interactive OLAP queries
Updates
Interactive queries
Scalable StoreUpdates
HDFS, MPP DB
Data-in-motion
Analytics
Application
Streams
Alerts
Enrich
Reference DB
Lamba Architecure is Complex
Storm, Spark streaming, Samza…
NoSQL – Cassandra, Redis, Hbase ….
Models
Enterprise DB – Oracle, Postgres..
OLAP SQL DBInteractiv
e queries
Lamba Architecure is Complex
• Complexity: learn and master multiple products data models, disparate APIs, configs
• Slower
• Wasted resources
Can we simplify & optimize?How about a single clustered DB that can manage stream state,
transactional data and run OLAP queries?
RDB
TablesTxn
Framework for stream processing, etc
Stream processing
HDFS
MPP DB
Scalable writes, point reads, OLAP queries
Apps
One idea: Start from Apache Spark
Spark Core
Spark Streaming
Spark ML
Spark SQL
GraphX
Stream program can Leverage ML, SQL, Graph
Why? • Unified programming across
streaming, interactive, batch• Multiple data formats, data sources • Rich library for Streaming, SQL,
Machine Learning, …
But: Spark is a computational engine not a store or a database
Structured Streaming in Spark 2.0
Source: Databricks presentation
Source: Databricks presentation
read...stream() creates a streaming DataFrame, does not start any of the computation
write...startStream() defines where & how to output the data and starts the processing
Source: Databricks presentation
New State Management API in Spark 2.0 – KV Store
• Preserve state across streaming aggregations across multiple batches
• Fetch, store KV pairs
• Transactional
• Supports versioning (batch)
• Built in store that persists to HDFS
State Management in Spark 2.0
Deeper Look – Ad Impression AnalyticsAd Network Architecture – Analyze log impressions in real time
Ad Impression Analytics
Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/
Bottlenecks in the write pathStream micro batches in
parallel from Kafka to each Spark executor
Filter and collect event for 1 minute. Reduce to 1 event per Publisher,
Geo every minute
Execute GROUP BY … Expensive Spark Shuffle …
Shuffle again in DB cluster … data format changes …
serialization costs
val input :DataFrame= sqlContext.read .options(kafkaOptions).format("kafkaSource") .stream()
val result :DataFrame = input .where("geo != 'UnknownGeo'") .groupBy( window("event-time", "1min"), Publisher", "geo") .agg(avg("bid"), count("geo"),countDistinct("cookie"))
val query = result.write.format("org.apache.spark.sql.cassandra") .outputMode("append") .startStream("dest-path”)
Bottlenecks in the Write Path
• Aggregations – GroupBy, MapReduce• Joins with other streams, Reference data
Shuffle Costs (Copying, Serialization) Excessive copying in Java based Scale out stores
Avoid bottleneck - Localize process with State
Spark Executor
Spark Executor
Kafka queue
KV
KV
Publisher: 1,3,5Geo: 2,4,6
Avoid shuffle by localizing writes
Input queue, Stream, KV
Store all share the
same partitioning
strategy
Publisher: 2,4,6Geo: 1,3,5
Publisher: 1,3,5Geo: 2,4,6
Publisher: 2,4,6Geo: 1,3,5
Spark 2.0 StateStore
?
Impedance mismatch with KV stores
We want to run interactive “scan”-intensive queries:
- Find total uniques for Ads grouped on geography
- Impression trends for Advertisers(group by query)
Two alternatives: row-stores vs. column stores
Goal – Localize processingFast key-based lookup
But, too slow to run aggregations, scan based interactive queries
Fast scans, aggregations
Updates and random writes are very difficult
Consume too much memory
Why columnar storage in-memory?
Interactive Analyticsin Spark 2.0
But, are in-memory column-stores fast enough?
SELECT SUBSTR(sourceIP, 1, X),
SUM(adRevenue)
FROM uservisits
GROUP BY SUBSTR(sourceIP, 1, X)
Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB
Why distributed, in-memory and columnar formats not enough for interactive speeds?
1. Distributed shuffles can take minutes
2. Software overheads, e.g., (de)serialization
3. Concurrent queries competing for the same resources
So, how can we ensure interactive-speed analytics?
• Most apps happy to tradeoff 1% accuracy for 200x speedup!
• Can usually get a 99.9% accurate answer by only looking at a tiny fraction of data!
• Often can make perfectly accurate decisions without having perfectly accurate answers!
• A/B Testing, visualization, ... • The data itself is usually noisy
• Processing entire data doesn’t necessarily mean exact answers!
• Inference is probabilistic anyway
Use statistical techniques to shrink data!
Our Solution: SnappyData
SnappyData: A New Approach
A Single Unified Cluster: OLTP + OLAP + Streaming
for real-time analytics
Batch design, high throughput
Real-time designLow latency, HA,
concurrency
Vision: Drastically reduce the cost and complexity in modern big data
Rapidly Maturing Matured over 13 years
SnappyData: A New ApproachSingle unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real-time operational Analytics – TBs in memory
RDB
Rows
TxnColumn
ar
API
Stream processing
ODBC, JDBC, REST
Spark - Scala, Java, Python, R
HDFSAQP
First commercial product with Approximate Query Processing (AQP)
MPP DB
Index
SnappyData Data Model
Rows
Columnar
Stream processing
ODBCJDBCProbabilistic
data
Snappy Data Server – Spark Executor + Store
Index
Spark Program SQL
Parser
RDD Table
Hybrid Storage
RDD
Cluster Mgr
Scheduler
High Latency/OLAP
Low Latency
Add/remove Servers
Catalog
Features
- Deeply integrated database for Spark - 100% compatible with Spark - Extensions for Transactions (updates), SQL stream processing - Extensions for High Availability - Approximate query processing for interactive OLAP - OLTP+OLAP Store - Replicated and partitioned tables - Tables can be Row or Column oriented (in-memory & on-disk) - SQL extensions for compatibility with SQL Standard - create table, view, indexes, constraints, etc
Time Series Analytics (kparc.com)http://kparc.com/q4/readme.txt Trade and Quote Analytics
Time Series Analytics – kparc.comhttp://kparc.com/q4/readme.txt
- Find last quoted price for Top 100 instruments for given date"select quote.sym, avg(bid) from quote join S on (quote.sym = S.sym) where date='$d' group by quote.sym” // Replaced with AVG as MemSQL doesn’t support LAST
- Max price for Top 100 Instruments across Exchanges for given date"select trade.sym, ex, max(price) from trade join S on (trade.sym = S.sym) where date='$d' group by trade.sym, ex”
- Find Average size of Trade in each hour for top 100 Instruments for given date
"select trade.sym, hour(time), avg(size) from trade join S on (trade.sym = S.sym) where date='$d' group by trade.sym, hour(time)"
Results for Small day (40million), Big day(640 million)
Q1 Q2 Q30
500
1000
1500
2000
2500
3000
3500
Query Response time in Mil-lis
(Smaller is better)
Axis Title
SnappyData MemSQL Impala Greenplum0
10000
20000
30000
40000
50000
60000
Query Response time in Millis (Smaller is better)
Axis Title
TPC-H: 10X-20X faster than Spark 2.0
Synopses Data Engine
Use Case Patterns1. Analytics cache
- Caching for Analytics over any “big data” store (esp. MPP)- Federate query between samples and backend’
2. Stream analytics for Spark Process streams, transform, real-time scoring, store, query
3. In-memory transactional store Highly concurrent apps, SQL cache, OLTP + OLAP
Interactive-Speed Analytic Queries – Exact or Approximate
Select avg(Bid), Advertiser from T1 group by AdvertiserSelect avg(Bid), Advertiser from T1 group by Advertiser with error 0.1
Speed/Accuracy tradeoffEr
ror (
%)
30 mins
Time to Execute onEntire Dataset
InteractiveQueries
2 secExecution Time 42
100 secs2 secs
1% Error
Approximate Query Processing Features• Support for uniform sampling• Support for stratified sampling
- Solutions exist for stored data (BlinkDB)- SnappyData works for infinite streams of
data too• Support for exponentially decaying windows over
time• Support for synopses
- Top-K queries, heavy hitters, outliers, ...• [future] Support for joins• Workload mining (http://CliffGuard.org)
Uniform (Random) Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001
Uniform SampleID Advertiser Geo Bid Sampling
Rate3 adv20 NY 0.0002 1/35 adv20 NY 0.0001 1/3
SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’
Original Table
Uniform (Random) Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001
Uniform SampleID Advertiser Geo Bid Sampling
Rate3 adv20 NY 0.0002 2/35 adv20 NY 0.0001 2/31 adv10 NY 0.0001 2/32 adv10 VT 0.0005 2/3
SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’
Original Table Larger
Stratified Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001
Stratified Sample on GeoID Advertiser Geo Bid Sampling
Rate3 adv20 NY 0.0002 1/42 adv10 VT 0.0005 1/2
SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’
Original Table
Sketching techniques● Sampling not effective for outlier detection
○ MAX/MIN etc● Other probabilistic structures like CMS, heavy hitters, etc● SnappyData implements Hokusai
○ Capturing item frequencies in timeseries● Design permits TopK queries over arbitrary trim intervals
(Top100 popular URLs)SELECT pageURL, count(*) frequency FROM TableWHERE …. GROUP BY ….ORDER BY frequency DESCLIMIT 100
Airline Analytics Demo
Zeppelin Spark
Interpreter(Driver)
Zeppelin Server
Row cacheColumnarcompressed
Spark Executor JVM
Row cacheColumnarcompressed
Spark Executor JVM
Row cacheColumnarcompressed
Spark Executor JVM
Free Cloud trial service – Project iSight● Free AWS/Azure credits for anyone to try out
SnappyData
● One click launch of private SnappyData cluster with Zeppelin
● Multiple notebooks with comprehensive description of concepts and value
● Bring your own data sets to try ‘Immediate visualization’ using Synopses datahttp://www.snappydata.io/cloudbuilder --- URL for AWS Service
Send email to [email protected] to be notified. Release expected next week
How SnappyData Extends Spark
Snappy Spark Cluster Deployment topologies • Snappy store and Spark
Executor share the JVM memory
• Reference based access – zero copy
• SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput
Unified Cluster
Split Cluster
Simple API – Spark Compatible
● Access Table as DataFrameCatalog is automatically recovered
● Store RDD[T]/DataFrame can be stored in SnappyData tables
● Access from Remote SQL clients
● Addtional API for updates, inserts, deletes
//Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props );
//save using DataFrame APIdataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");
val impressionLogs: DataFrame = context.table(colTable)val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable)<… Now use any of DataFrame APIs … >
Extends SparkCREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",….. [AS select_statement];
Simple to Ingest Streams using SQL
Consume from stream
Transform raw data
Continuous Analytics
Ingest into in-memory Store
Overflow table to HDFS
Create stream table AdImpressionLog(<Columns>) using directkafka_stream options (
<socket endpoints> "topics 'adnetwork-topic’ “,
"rowConverter ’ AdImpressionLogAvroDecoder’ )
streamingContext.registerCQ(
"select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds)where geo != 'unknown' group by publisher, geo”)// Register CQ
.foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Append)
.saveAsTable("adImpressions")
AdImpression DemoSpark, SQL Code Walkthrough, interactive SQL
AdImpression Ingest Performance
Loading from parquet files using 4 cores Stream ingest (Kafka + Spark streaming to store using 1 core)
Unified Cluster Architecture
How do we extend Spark for Real Time?• Spark Executors are long
running. Driver failure doesn’t shutdown Executors
• Driver HA – Drivers run “Managed” with standby secondary
• Data HA – Consensus based clustering integrated for eager replication
How do we extend Spark for Real Time?• By pass scheduler for low
latency SQL
• Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc
• Full SQL support – Persistent Catalog, Transaction, DML
Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.○ CPU contention drops
● Far less complex○ single cluster for stream ingestion, continuous queries,
interactive queries and machine learning● Much faster
○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
www.snappydata.io
SnappyData is Open Source● Ad Analytics example/benchmark - https://github.com/
SnappyDataInc/snappy-poc
● https://github.com/SnappyDataInc/snappydata
● Learn more www.snappydata.io/blog● Connect:
○ twitter: www.twitter.com/snappydata○ facebook: www.facebook.com/snappydata○ slack: http://snappydata-slackin.herokuapp.com