Upload
dato-inc
View
43
Download
0
Embed Size (px)
Citation preview
Why streaming
3
Data Warehouse
Batch
Data availability Streaming
2008 20152000
- Which data?- When?- Who?
Data Science Summit 2015 S. Haridi
3 Parts of a Streaming Infrastructure
4
Gathering
Broker Analysis
Sensors
Transactionlogs …
Server Logs
Data Science Summit 2015 S. Haridi
Example: Bouygues Telecom
5Data Science Summit 2015 S. Haridi
• Network and subscriber data gathered
• Added to Broker in raw format• Transformed and analyzed by
streaming engine• Stored back for further procesing
http://data-artisans.com/flink-at-bouygues.html
What is Apache Flink
8
Distributed Data Flow Processing System
▪Focused on large-scale data analytics
▪Unified real-time stream and batch processing
▪Expressive and rich APIs in Java / Scala (+ Python)
▪Robust and fast execution backend
Reduce
Join
Filter
Reduce
Map
Iterate
Source
Sink
Source
Data Science Summit 2015 S. Haridi
Flink Stack
9
Gelly
Table
ML
SA
MO
A
DataSet (Java/Scala)DataStream (Java/Scala)
Hadoop M
/R
Local Cluster Yarn
Tez
Em
bedded
Data
flow
Data
flow
Table
Streaming dataflow runtime
Sto
rm
Zeppelin
Data Science Summit 2015 S. Haridi
What is Flink Streaming
11
Native, low-latency stream processor Expressive functional API Flexible operator state, iterations,
windows Exactly-once processing semantics
Data Science Summit 2015 S. Haridi
Native vs non-native streaming
12
Streamdiscretizer
Job Job Job Jobwhile (true) { // get next few records // issue batch computation}
Non-native streaming
while (true) { // process next record}
Long-standing operators
Native streaming
Data Science Summit 2015 S. Haridi
Stream processing in Flink Continuous Streaming model Low processing latency O(1) state updates per operator Exactly once semantics for state
operators
Data Science Summit 2015 S. Haridi 13
Windowing Semantics
16
• Trigger and Eviction policies
• window(<eviction>).every(<trigger>)
• Built-in policies:– Time: Time.of(length, TimeUnit/Custom timestamp)
– window(Time.of(20, SECONDS))
– Count: Count.of(windowSize)
– window(Count.of(20)).every(Count.of(10))
– Delta: Delta.of(Threshold, Distance function, Start value)
– window(Delta.of(0.1, priceDistanceFun, initPrice)
Data Science Summit 2015 S. Haridi
17
Word count in Batch and Streaming
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .keyBy("word”).window(Time.of(5,SECONDS))
.every(Time.of(1,SECONDS)).sum("frequency") .print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()
DataSet API (batch):
DataStream API (streaming):
Data Science Summit 2015 S. Haridi
Flexible windows
18
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
Keyed StreamWindowed StreamData Stream Keyed StreamWindowed Stream
Stream of stocks Trigger warning if price fluctuates by 5% Count the number of warnings per stock
in 30 second (tumbling) window Do it continuously
Data Science Summit 2015 S. Haridi
StockStream
Delta 5% of price Warning Count
30 sec window Sum
keyBy
symbol
keyBy
symbol
Flexible windows
19More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
case class Count(symbol: String, count: Int)val defaultPrice = StockPrice(“”, 1000)val priceWarnings = stockStream.keyBy(“symbol”) .window(Delta.of(0.05, priceChange, defaultPrice)
.mapWindow(sendWarning _)
Use delta policy to createchange warnings
Count number of warning per stock every half a minute
val warningPerStock = priceWarnings.flatten()
.map(Count(_, 1))
.keyBy(“symbol”)
.window(Time.of(30, SECONDS))
.sum(“count”) Data Science Summit 2015 S. Haridi
StockStream
Delta 5% of price Warning Count 30 sec
window Sum
keyBysymbol
keyBysymbol
Iterative stream processing
20
Motivation Many applications require cyclic
streams Machine learning applications
(parallel model training, evaluation)
Iterations in Flink Streaming Native support for cyclic dataflows Integrated with functional API High performance and expressivity
Input
Train
Evaluate
Data Science Summit 2015 S. Haridi
Exactly-once processing in for operator state
22
Based on consistent global snapshots Low runtime overhead, stateful
exactly-once semantics
Data Science Summit 2015 S. Haridi
Checkpointing / Recovery
23
Detailed algorithm: Lightweight Asynchronous Snapshots for Distributed DataflowsData Science Summit 2015 S. Haridi
Fault tolerance Check-pointing and recovery of operator
state is very fast• Data processing does not block
Executions based on CPU/operator time are not idempotent
Other execution modes are based on timestamps of input streams (Event/Ingress time) • Allows idempotent executions • End-to-End exactly-once semantics• In Flink version 0.10
24Data Science Summit 2015 S. Haridi
Streaming in Apache Flink
True streaming over stateful distributed dataflow engine
Expressive Streaming API in Java/Scala• Flexible window semantics• Iterative computation
Low streaming latency, exactly-once semantics depending on execution mode, and low overhead for recovery
25Data Science Summit 2015 S. Haridi