26
Introduction to stream processing with Apache Flink Seif Haridi KTH/ SICS

SICS: Apache Flink Streaming

Embed Size (px)

Citation preview

Introduction to stream processing with Apache Flink

Seif HaridiKTH/SICS

Stream processing

2Data Science Summit 2015

Why streaming

3

Data Warehouse

Batch

Data availability Streaming

2008 20152000

- Which data?- When?- Who?

Data Science Summit 2015 S. Haridi

3 Parts of a Streaming Infrastructure

4

Gathering

Broker Analysis

Sensors

Transactionlogs …

Server Logs

Data Science Summit 2015 S. Haridi

Example: Bouygues Telecom

5Data Science Summit 2015 S. Haridi

• Network and subscriber data gathered

• Added to Broker in raw format• Transformed and analyzed by

streaming engine• Stored back for further procesing

http://data-artisans.com/flink-at-bouygues.html

What is Apache Flink?

6Data Science Summit 2015

1 year of Flink - code

April 2014 April 2015

Data Science Summit 2015 S. Haridi 7

What is Apache Flink

8

Distributed Data Flow Processing System

▪Focused on large-scale data analytics

▪Unified real-time stream and batch processing

▪Expressive and rich APIs in Java / Scala (+ Python)

▪Robust and fast execution backend

Reduce

Join

Filter

Reduce

Map

Iterate

Source

Sink

Source

Data Science Summit 2015 S. Haridi

Flink Stack

9

Gelly

Table

ML

SA

MO

A

DataSet (Java/Scala)DataStream (Java/Scala)

Hadoop M

/R

Local Cluster Yarn

Tez

Em

bedded

Data

flow

Data

flow

Table

Streaming dataflow runtime

Sto

rm

Zeppelin

Data Science Summit 2015 S. Haridi

Stream Processing with Flink

10Data Science Summit 2015

What is Flink Streaming

11

Native, low-latency stream processor Expressive functional API Flexible operator state, iterations,

windows Exactly-once processing semantics

Data Science Summit 2015 S. Haridi

Native vs non-native streaming

12

Streamdiscretizer

Job Job Job Jobwhile (true) { // get next few records // issue batch computation}

Non-native streaming

while (true) { // process next record}

Long-standing operators

Native streaming

Data Science Summit 2015 S. Haridi

Stream processing in Flink Continuous Streaming model Low processing latency O(1) state updates per operator Exactly once semantics for state

operators

Data Science Summit 2015 S. Haridi 13

DataStream API

14Data Science Summit 2015

15

Overview of the API

Data Science Summit 2015 S. Haridi

Windowing Semantics

16

• Trigger and Eviction policies

• window(<eviction>).every(<trigger>)

• Built-in policies:– Time: Time.of(length, TimeUnit/Custom timestamp)

– window(Time.of(20, SECONDS))

– Count: Count.of(windowSize)

– window(Count.of(20)).every(Count.of(10))

– Delta: Delta.of(Threshold, Distance function, Start value)

– window(Delta.of(0.1, priceDistanceFun, initPrice)

Data Science Summit 2015 S. Haridi

17

Word count in Batch and Streaming

case class Word (word: String, frequency: Int)

val lines: DataStream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .keyBy("word”).window(Time.of(5,SECONDS))

.every(Time.of(1,SECONDS)).sum("frequency") .print()

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()

DataSet API (batch):

DataStream API (streaming):

Data Science Summit 2015 S. Haridi

Flexible windows

18

More at: http://flink.apache.org/news/2015/02/09/streaming-example.html

Keyed StreamWindowed StreamData Stream Keyed StreamWindowed Stream

Stream of stocks Trigger warning if price fluctuates by 5% Count the number of warnings per stock

in 30 second (tumbling) window Do it continuously

Data Science Summit 2015 S. Haridi

StockStream

Delta 5% of price Warning Count

30 sec window Sum

keyBy

symbol

keyBy

symbol

Flexible windows

19More at: http://flink.apache.org/news/2015/02/09/streaming-example.html

case class Count(symbol: String, count: Int)val defaultPrice = StockPrice(“”, 1000)val priceWarnings = stockStream.keyBy(“symbol”) .window(Delta.of(0.05, priceChange, defaultPrice)

.mapWindow(sendWarning _)

Use delta policy to createchange warnings

Count number of warning per stock every half a minute

val warningPerStock = priceWarnings.flatten()

.map(Count(_, 1))

.keyBy(“symbol”)

.window(Time.of(30, SECONDS))

.sum(“count”) Data Science Summit 2015 S. Haridi

StockStream

Delta 5% of price Warning Count 30 sec

window Sum

keyBysymbol

keyBysymbol

Iterative stream processing

20

Motivation Many applications require cyclic

streams Machine learning applications

(parallel model training, evaluation)

Iterations in Flink Streaming Native support for cyclic dataflows Integrated with functional API High performance and expressivity

Input

Train

Evaluate

Data Science Summit 2015 S. Haridi

Fault tolerance

21Data Science Summit 2015

Exactly-once processing in for operator state

22

Based on consistent global snapshots Low runtime overhead, stateful

exactly-once semantics

Data Science Summit 2015 S. Haridi

Checkpointing / Recovery

23

Detailed algorithm: Lightweight Asynchronous Snapshots for Distributed DataflowsData Science Summit 2015 S. Haridi

Fault tolerance Check-pointing and recovery of operator

state is very fast• Data processing does not block

Executions based on CPU/operator time are not idempotent

Other execution modes are based on timestamps of input streams (Event/Ingress time) • Allows idempotent executions • End-to-End exactly-once semantics• In Flink version 0.10

24Data Science Summit 2015 S. Haridi

Streaming in Apache Flink

True streaming over stateful distributed dataflow engine

Expressive Streaming API in Java/Scala• Flexible window semantics• Iterative computation

Low streaming latency, exactly-once semantics depending on execution mode, and low overhead for recovery

25Data Science Summit 2015 S. Haridi

Special Thanks to

Gyula Fora, SICSParis Carbone, KTHKostas Tzoumas, Data ArtisansStephan Ewen, Data ArtisansVolker Markl, TU-Berlin

26Data Science Summit 2015