Upload
snappydata
View
874
Download
3
Embed Size (px)
Citation preview
SnappyDataUnified Stream, interactive
analytics in a single in-memory cluster with Spark
www.snappydata.io
Jags RamnarayanCTO, Co-founder
SnappyDataFeb 2016
SnappyData - an EMC/Pivotal spin out● New Spark-based open source project started by
Pivotal GemFire founders+engineers● Decades of in-memory data management experience● Focus on real-time, operational analytics: Spark inside
an OLTP+OLAP database
www.snappydata.io
Lambda Architecture (LA) for Analytics
Is Lambda Complex ?
Impala, HBase
Cassandra, Redis, ..
Application
Is Lambda Complex ?
Impala, HBase
Cassandra, Redis, ..
Application
• Application has to federate queries?• Complex Application – deal with multiple
data models? Disparate APIs?• Slow?• Expensive to maintain?
Can we simplify, optimize?Can the batch and speed layer use a single, unified serving DB?
Deeper Look – Ad Impression Analytics
Ad Impression Analytics
Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/
Bottlenecks in the write pathStream micro batches in
parallel from Kafka to each Spark executor
Emit Key-Value pairs. So, we can group By
Publisher, Geo
Execute GROUP BY … Expensive Spark Shuffle …
Shuffle again in DB cluster … data format changes …
serialization costs
Bottlenecks in the Write Path
• Aggregations – GroupBy, MapReduce
• Joins with other streams, Reference data
• Replication (HA) in Fast data store
Shuffle Costs• Data models
across Spark and FastData store are different
• Data flows through too many processes
• In-JVM copying
Copying, Serialization Excessive copying in Java based Scale out stores
Goal – Localize processing• Can we Localize processing with state, avoid shuffle. • Can Kafka, Spark partitions and the data store share the same partitioning
policy? ( partition by Advertiser,Geo )
-- Show cassandra, memSQL ingestion performance (maybe later section?)
• Embedded State : Spark’s native MapWithState, Apache Samza, Flink? Good, scalable KV stores but IS THIS ENOUGH?
Impedance mismatch for Interactive Queries
On Ingested Ad Impressions we want to run queries like: - Find total uniques for a certain AD grouped on geography, month - Impression trends for advertisers, or, detect outliers in the bidding price
Unfortunately, in most KV oriented stores this is very inefficient - Not suited for scans, aggregations, distributed joins on large volumes - Memory utilization in KV store is poor
Why columnar storage?
In-memory Columns but still slow?
SELECT SUBSTR(sourceIP, 1, X),
SUM(adRevenue)
FROM uservisits
GROUP BY SUBSTR(sourceIP, 1, X)
Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB
Can we use Statistical methods to shrink data?• It is not always possible to store the data in full
Many applications (telecoms, ISPs, search engines) can’t keep everything
• It is inconvenient to work with data in full Just because we can, doesn’t mean we should
• It is faster to work with a compact summary Better to explore data on a laptop than a cluster
Ref: Graham Cormode - Sampling for Big Data
Can we use statistical techniques to understand data, synthesize something very small but still answer Analytical queries?
SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics
Batch design, high throughput
Real-time design center- Low latency, HA,
concurrent
Vision: Drastically reduce the cost and complexity in modern big data
Rapidly Maturing Matured over 13 years
SnappyData: A new approachSingle unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real time operational Analytics – TBs in memory
RDB
Rows
TxnColumn
ar
API
Stream processing
ODBC, JDBC, REST
Spark - Scala, Java, Python, R
HDFSAQP
First commercial project on Approximate Query Processing(AQP)
MPP DB
Index
Tenets, Guiding Principles● Memory is the new bottleneck for speed
● Memory densities will follow Moore’s law
● 100% Spark compatible - powerful, concise. Simplify runtime
● Aim for Google Search like speed for analytic queries
● Dramatically reduce costs associated with Analytics Distributed systems are expensive – Esp. in production
Use Case Patterns• Stream ingestion database for spark• Process streams, transform, real-time scoring, store,
query
• In-memory database for apps • Highly concurrent apps, SQL cache, OLTP + OLAP
• Analytic caching pattern• Caching for Analytics over any “Big data” store (esp
MPP)• Federate query between samples and backend
Why Spark – Confluence of Streaming, interactive, batch
Unifies batch, streaming, interactive comp.Easy to build sophisticated applicationsSupport iterative, graph-parallel algorithmsPowerful APIs in Scala, Python, Java
Spark
SparkStreamin
g SQLBlinkDB GraphX MLlib
Streaming
Batch, Interactiv
e
Batch, InteractiveInteractive
Data-parallel,Iterative
Sophisticated algos.
Source: Spark summit presentation from Ion Stoica
Spark Cluster
Snappy Spark Cluster Deployment topologies • Snappy store and Spark
Executor share the JVM memory
• Reference based access – zero copy
• SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput
Unified Cluster
Split Cluster
Simple API – Spark Compatible
● Access Table as DataFrameCatalog is automatically recovered
● Store RDD[T]/DataFrame can be stored in SnappyData tables
● Access from Remote SQL clients
● Addtional API for updates, inserts, deletes
//Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props );
//save using DataFrame APIdataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");
val impressionLogs: DataFrame = context.table(colTable)val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable)<… Now use any of DataFrame APIs … >
Extends SparkCREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",….. [AS select_statement];
Simple to Ingest Streams using SQL
Consume from stream
Transform raw data
Continuous Analytics
Ingest into in-memory Store
Overflow table to HDFS
Create stream table AdImpressionLog(<Columns>) using directkafka_stream options (
<socket endpoints> "topics 'adnetwork-topic’ “,
"rowConverter ’ AdImpressionLogAvroDecoder’ )
streamingContext.registerCQ(
"select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds)where geo != 'unknown' group by publisher, geo”)// Register CQ
.foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Append)
.saveAsTable("adImpressions")
AdImpression DemoSpark, SQL Code Walkthrough, interactive SQL
AdImpression DemoPerformance comparison to Cassandra,
MemSQL - TBD
AdImpression Ingest Performance
Loading from parquet files using 4 cores Stream ingest (Kafka + Spark streaming to store using 1 core)
Why is ingestion, querying fast?Linearly scale with partition pruning
Input queue, Stream, IMDB all share the same partitioning strategy
How does it scale with concurrency?● Parallel query engine skips Spark SQL
scheduling for low latency queries
● Column tables For fast scan/aggregations
Also automatically compressed
● Row tablesFast key based or selective queries Can have any number of secondary indices
How does it scale with concurrency?● Distributed shuffle for joins, ordering ..
expensive● Two techniques to minimize this
1) Colocation of related tables. All related records are collocated.
For instance, ‘Users’ table partitioned across nodes and all related ‘Ad impressions” can be collocated on same partition. The parallel query would execute the Join locally on each partition
2) Replication of ‘Dimension’ tables Joins to dimension tables are always localized
Low latency Interactive Analytic Queries – Exact or Approximate
Select avg(volume),symbol from T1 where <time range> group by symbol Select avg(volume), symbol from T1 where <time range> group by symbol
With error 0.1 confidence 0.8
Speed/Accuracy tradeoffEr
ror
30 mins
Time to Execute onEntire Dataset
InteractiveQueries
2 sec
Execution Time (Sample Size)32
100 secs
2 secs
Key feature: Synopses Data● Maintain stratified samples
○ Intelligent sampling to keep error bounds low● Probabilistic data
○ TopK for time series (using time aggregation CMS, item aggregation)
○ Histograms, HyperLogLog, Bloom Filters, WaveletsCREATE SAMPLE TABLE sample-table-name USING columnar OPTIONS (
BASETABLE ‘table_name’ // source column table or stream table[ SAMPLINGMETHOD "stratified | uniform" ]STRATA name (
QCS (“comma-separated-column-names”)[ FRACTION “frac” ]
),+ // one or more QCS
Unified Cluster Architecture
How do we extend Spark for Real Time?• Spark Executors are long
running. Driver failure doesn’t shutdown Executors
• Driver HA – Drivers run “Managed” with standby secondary
• Data HA – Consensus based clustering integrated for eager replication
How do we extend Spark for Real Time?• By pass scheduler for low
latency SQL
• Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc
• Full SQL support – Persistent Catalog, Transaction, DML
Performance – Spark vs Snappy (TPC-H)
See ACM Sigmod 2016 paper for detailsAvailable on snappydata.io blogs
Performance – Snappy vs in-memoryDB (YCSB)
Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.○ CPU contention drops
● Far less complex○ single cluster for stream ingestion, continuous queries,
interactive queries and machine learning● Much faster
○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
www.snappydata.io
SnappyData is Open Source● Available for download on Github today● https://github.com/SnappyDataInc/snappydata
● Learn more www.snappydata.io/blog● Connect:
○ twitter: www.twitter.com/snappydata○ facebook: www.facebook.com/snappydata○ linkedin: www.linkedin.com/snappydata○ slack: http://snappydata-slackin.herokuapp.com
EXTRAS
www.snappydata.io
Colocated row/column Tables in Spark
RowTable
ColumnTable
Spark ExecutorTASK
Spark Block Manager
Stream processing
RowTable
ColumnTable
Spark ExecutorTASK
Spark Block Manager
Stream processing
RowTable
ColumnTable
Spark ExecutorTASK
Spark Block Manager
Stream processing
● Spark Executors are long lived and shared across multiple apps
● Snappy Memory Mgr and Spark Block Mgr integrated
Table can be partitioned or replicated
ReplicatedTable
Partitioned Table(Buckets A-H)
ReplicatedTable
Partitioned Table(Buckets I-P)
consistent replica on each node
PartitionReplica(Buckets A-H)
ReplicatedTable
Partitioned Table(Buckets Q-W)
PartitionReplica(Buckets I-P)
Data partitioned with one or more replicas
Use partitioned tables for large fact tables, Replicated for small dimension tables
Spark Core Micro-batch Streaming
Spark SQL CatalystTXN
OLAP Job
Scheduler OLAP Query
OLTP QueryAQP
P2P Cluster Replication Svc
RowColumnTable Inde
xSampleTable
Shared nothing logs
Spark program
JDBC
ODBC
SnappyData Components