31
Streaming Analytics on AWS Dmitri Tchikatilov AdTech BD, AWS [email protected]

Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Embed Size (px)

Citation preview

Page 1: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Streaming Analytics on AWSDmitri TchikatilovAdTech BD, [email protected]

Page 2: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Agenda

1. Streaming principles 2. Streaming analytics on AWS3. Kinesis and Apache Spark on EMR 4. Querying and Scaling 5. Best Practices

Page 3: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Batch vs. Stream

Batch Processing Stream Processing

Data scopeQueries or processing over all or most of the

data

Queries or processing over data on rolling window or most recent data record

Data size Large batches of data Individual records or micro batches of few records

Performance Latencies in minutes to hours.

Requires latency in the order of seconds or milliseconds.

Analytics Complex analytics.Simple response functions,

aggregates, and rolling metrics.

Page 4: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Streaming App Challenges

Simple & Flexible Analytics

Elastic - adapt to input surges and back

pressure

Fast ~ 1s to 100ms for the majority of apps

Scalable ~ 1M records/secAvailable - low tolerance

for record losses

Usability Performance

Page 5: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

“We are our choices...”

J.P. Sartre

Page 6: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Stream Processing Choices on AWSOperations Analytics

Storm Zookeeper/Nimbus for HA SQL - 3rd party, roll your own

Kafka Zookeeper (failure detection, partitioning, replication) SQL - 3rd party, roll your own

Druid Zookeeper, multiple node roles scale independently

OLAP engine (JSON) on denormalized data, real time indexing

Kinesis AWS Service SQL - Kinesis Analytics (in development)

Spark Streaming

EMR bootstraps latest 1.6, Yarn, Monitoring

SparkSQL on DataFrames, Joins, Zeppelin notebooks

Page 7: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Components

Storage layerIngest (record storing, ordering, strong consistency and replayable reads)

Storage Processing

Processing layerAnalytics (consume data from storage layer, run computations, removal from storage)

Page 8: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Real-Time Streaming Data Ingestion

Custom-built Streaming Applications(KCL)

Inexpensive: $0.014 per 1,000,000 PUT Payload Units

Storage - Amazon Kinesis Streams

Kinesis Stream1 Shard< 1MB-in / 2MB-outEach record < 1 MBPutRecords() < 500 (5MB)Increased retention 7 days

Page 9: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Processing - Spark Streaming

RE

CE

IVE

RS

Input data streams

SPARK Job

Results published to destinations

DStream

RDD = Resilient Distributed DatasetDStream = Collection of RDDs

Page 10: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Spark Steaming – Long Running Spark App

Driver Program

StreamingContext

SparkContext

Spark jobs toprocess

received data

Worker Node

Executor

Long Task Receiver

Worker Node

Executor

Task Task Task

Input stream

Worker Node processes the

data

Output Batch

Page 11: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Analytics - DataFrames on Streaming Data

• KCL – Kinesis Client Library (helps take data off Kinesis)• Spark Streaming uses KCL - reads data from Kinesis

and forms a DStream (Pull Mechanism)• Creates DataFrame in Spark Streaming

Kinesis KCL

Page 12: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Kinesis and Spark Streaming

EMRKinesis

Page 13: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Full Kinesis + Spark Pipeline

Page 14: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

What About Analytics?

What operations are possible?Filter, GroupBy, Join, Window Operations

Not all queries make sense to run on the stream.Large joins on RDDs in DStreams can be expensive

Page 15: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Spark Streaming – Operations on DStreamsWindow Operations

Page 16: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Query the Data in DStreams?

This is all great, but I’d like to query my data!

StreamingContext > DStream (RDDs) > DataFrame

DataFrame converted to temp. table and query with SQL through HiveContext

Page 17: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Example: Querying DStreams with SQL

CourtesyAmo Abeyarante

AWS Big Data Blog

Page 18: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Setup

1. Kinesis Stream with data provided by Python script2. KCL Scala app launched as spark-job

• Checks the number of Shards and instantiates the same number of Streams

• Receives data from Kinesis in small batches• Creates DataFrame, registers as temp table • Creates HiveContext

3. Use Hive app to query the data

Page 19: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Demo – Querying Streams

Page 20: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Analytics – Choosing Where to Join Data

Join the data in a custom KCL app – denormalize and publish to another Kinesis Stream

Storage Processing

Join the streaming data using DStreams

Page 21: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Amazon Kinesis + Spark on EMR

Producer 1

Producer 2

Producer N

Shard1

Shard2

Kinesis

Receiver 1

KCL Worker 1Yarn Executor 1

RecordProcessor 1

RecordProcessor 2

EMR

Yarn Executor 2

Page 22: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Create DStream to Scale Out

from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream

kinesisStream = KinesisUtils.createStream(streamingContext, [Kinesis app name], [Kinesis stream name], [endpoint URL], [region name], [initial position], [checkpoint interval], StorageLevel.MEMORY_AND_DISK_2)

Page 23: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Amazon Kinesis + Spark on EMR

Producer 1

Producer 2

Producer N

Shard1

Shard2

Kinesis

Receiver 1

KCL Worker 1Yarn Executor 1

RecordProcessor 1

EMR

Yarn Executor 2KCL Worker 2

Receiver 2

RecordProcessor 2

Page 24: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Scaling KinesisKinesis • Can accumulate data at any rate, but need input batching

for high rates of small messages to optimize cost• Scales inputs by splitting shards • Never “pressures” Spark – Spark and KCL is pulling data

Page 25: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Scaling EMR/SparkEMR/Spark• Scales by adding task nodes – can be EC2 Spot instances• Yarn can be configured for “dynamic resource allocation”

with variable number of executors per app. New default for the upcoming EMR 4.4 release Works well for batch – but not always for Streaming

• Automatic – same number of Receivers (in case of a shard split/merge operations)

• Manual (app restart) – if you need to change the number of Receivers

Page 26: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Stability in Spark Streaming

2s 2s 2s

0s 4s 8s

Tb (batch) = 4s Tp (process) = 2s

5s 5s

0s 4s 8s

Tb (batch) = 4s Tp (process) = 5s

Stable Tb <= Tp

Unstable Tb > Tp

Unstable state – increase in scheduling delay

Scheduling delay

5s

Page 27: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Spark Backpressure Feature

After every micro batch finishes – statistic used to estimate processing rate

PID controller (proportional-integral-derivative) – estimates what the maximum rate of ingest for the system (rows/sec)

PID controller limits the ingestSparkConfspark.streaming.backpressure.enabled = true

Page 28: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Analytics on Streaming Data

Is here today, but requires some work. Major advancements soon in Kinesis Analytics, Spark 2.0.

A lot of analytics can be done simply in a custom KCL app (moving averages, joins, filters, etc).

FLEXIBILITYPERFORMANCE

Page 29: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Streaming Best Practices Summary

1. Total Processing time is less than Batch interval (Tp < Tb)2. Load is well balanced - # of Receivers is a multiple of # of Executors3. Spark Streaming reading from Kinesis defaults to 1 sec.4. Enable Spark Checkpoints for reliable (at-least-once) semantics. Use Spark 1.6 with EMRFS for S3. 5. Streaming apps using different names to avoid using same DynamoDB table

Page 30: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Dmitri TchikatilovDigital Advertising [email protected]

Page 31: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv