Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

Preview:

Citation preview

Streaming Analytics on AWSDmitri TchikatilovAdTech BD, AWSdmitrit@amazon.com

Agenda

1. Streaming principles 2. Streaming analytics on AWS3. Kinesis and Apache Spark on EMR 4. Querying and Scaling 5. Best Practices

Batch vs. Stream

Batch Processing Stream Processing

Data scopeQueries or processing over all or most of the

data

Queries or processing over data on rolling window or most recent data record

Data size Large batches of data Individual records or micro batches of few records

Performance Latencies in minutes to hours.

Requires latency in the order of seconds or milliseconds.

Analytics Complex analytics.Simple response functions,

aggregates, and rolling metrics.

Streaming App Challenges

Simple & Flexible Analytics

Elastic - adapt to input surges and back

pressure

Fast ~ 1s to 100ms for the majority of apps

Scalable ~ 1M records/secAvailable - low tolerance

for record losses

Usability Performance

“We are our choices...”

J.P. Sartre

Stream Processing Choices on AWSOperations Analytics

Storm Zookeeper/Nimbus for HA SQL - 3rd party, roll your own

Kafka Zookeeper (failure detection, partitioning, replication) SQL - 3rd party, roll your own

Druid Zookeeper, multiple node roles scale independently

OLAP engine (JSON) on denormalized data, real time indexing

Kinesis AWS Service SQL - Kinesis Analytics (in development)

Spark Streaming

EMR bootstraps latest 1.6, Yarn, Monitoring

SparkSQL on DataFrames, Joins, Zeppelin notebooks

Components

Storage layerIngest (record storing, ordering, strong consistency and replayable reads)

Storage Processing

Processing layerAnalytics (consume data from storage layer, run computations, removal from storage)

Real-Time Streaming Data Ingestion

Custom-built Streaming Applications(KCL)

Inexpensive: $0.014 per 1,000,000 PUT Payload Units

Storage - Amazon Kinesis Streams

Kinesis Stream1 Shard< 1MB-in / 2MB-outEach record < 1 MBPutRecords() < 500 (5MB)Increased retention 7 days

Processing - Spark Streaming

RE

CE

IVE

RS

Input data streams

SPARK Job

Results published to destinations

DStream

RDD = Resilient Distributed DatasetDStream = Collection of RDDs

Spark Steaming – Long Running Spark App

Driver Program

StreamingContext

SparkContext

Spark jobs toprocess

received data

Worker Node

Executor

Long Task Receiver

Worker Node

Executor

Task Task Task

Input stream

Worker Node processes the

data

Output Batch

Analytics - DataFrames on Streaming Data

• KCL – Kinesis Client Library (helps take data off Kinesis)• Spark Streaming uses KCL - reads data from Kinesis

and forms a DStream (Pull Mechanism)• Creates DataFrame in Spark Streaming

Kinesis KCL

Kinesis and Spark Streaming

EMRKinesis

Full Kinesis + Spark Pipeline

What About Analytics?

What operations are possible?Filter, GroupBy, Join, Window Operations

Not all queries make sense to run on the stream.Large joins on RDDs in DStreams can be expensive

Spark Streaming – Operations on DStreamsWindow Operations

Query the Data in DStreams?

This is all great, but I’d like to query my data!

StreamingContext > DStream (RDDs) > DataFrame

DataFrame converted to temp. table and query with SQL through HiveContext

Example: Querying DStreams with SQL

CourtesyAmo Abeyarante

AWS Big Data Blog

Setup

1. Kinesis Stream with data provided by Python script2. KCL Scala app launched as spark-job

• Checks the number of Shards and instantiates the same number of Streams

• Receives data from Kinesis in small batches• Creates DataFrame, registers as temp table • Creates HiveContext

3. Use Hive app to query the data

Demo – Querying Streams

Analytics – Choosing Where to Join Data

Join the data in a custom KCL app – denormalize and publish to another Kinesis Stream

Storage Processing

Join the streaming data using DStreams

Amazon Kinesis + Spark on EMR

Producer 1

Producer 2

Producer N

Shard1

Shard2

Kinesis

Receiver 1

KCL Worker 1Yarn Executor 1

RecordProcessor 1

RecordProcessor 2

EMR

Yarn Executor 2

Create DStream to Scale Out

from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream

kinesisStream = KinesisUtils.createStream(streamingContext, [Kinesis app name], [Kinesis stream name], [endpoint URL], [region name], [initial position], [checkpoint interval], StorageLevel.MEMORY_AND_DISK_2)

Amazon Kinesis + Spark on EMR

Producer 1

Producer 2

Producer N

Shard1

Shard2

Kinesis

Receiver 1

KCL Worker 1Yarn Executor 1

RecordProcessor 1

EMR

Yarn Executor 2KCL Worker 2

Receiver 2

RecordProcessor 2

Scaling KinesisKinesis • Can accumulate data at any rate, but need input batching

for high rates of small messages to optimize cost• Scales inputs by splitting shards • Never “pressures” Spark – Spark and KCL is pulling data

Scaling EMR/SparkEMR/Spark• Scales by adding task nodes – can be EC2 Spot instances• Yarn can be configured for “dynamic resource allocation”

with variable number of executors per app. New default for the upcoming EMR 4.4 release Works well for batch – but not always for Streaming

• Automatic – same number of Receivers (in case of a shard split/merge operations)

• Manual (app restart) – if you need to change the number of Receivers

Stability in Spark Streaming

2s 2s 2s

0s 4s 8s

Tb (batch) = 4s Tp (process) = 2s

5s 5s

0s 4s 8s

Tb (batch) = 4s Tp (process) = 5s

Stable Tb <= Tp

Unstable Tb > Tp

Unstable state – increase in scheduling delay

Scheduling delay

5s

Spark Backpressure Feature

After every micro batch finishes – statistic used to estimate processing rate

PID controller (proportional-integral-derivative) – estimates what the maximum rate of ingest for the system (rows/sec)

PID controller limits the ingestSparkConfspark.streaming.backpressure.enabled = true

Analytics on Streaming Data

Is here today, but requires some work. Major advancements soon in Kinesis Analytics, Spark 2.0.

A lot of analytics can be done simply in a custom KCL app (moving averages, joins, filters, etc).

FLEXIBILITYPERFORMANCE

Streaming Best Practices Summary

1. Total Processing time is less than Batch interval (Tp < Tb)2. Load is well balanced - # of Receivers is a multiple of # of Executors3. Spark Streaming reading from Kinesis defaults to 1 sec.4. Enable Spark Checkpoints for reliable (at-least-once) semantics. Use Spark 1.6 with EMRFS for S3. 5. Streaming apps using different names to avoid using same DynamoDB table

Dmitri TchikatilovDigital Advertising BDdmitrit@amazon.com