Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and data-driven...

Preview:

Citation preview

Big thanks to everyone!

The convergence ofreal-time analytics and

event-driven applications@StephanEwen

Flink Forward San FranciscoApril 11, 2017

2

3

2016 was the year when streaming technologies became mainstream

2017 is the year to realize the full spectrum

of streaming applications

Some large scale streaming applications

4

5

Detecting fraud in real time

As fraudsters get better, need to update models without downtime

Live 24/7 service

Credit card transactions

Notificationsand alerts

Evolving fraudmodels built bydata scientists

@

6

@ Athena X SQL to define metrics Thresholds and actions to trigger Blends analytics and

actionsStreams from Hadoop, Kafka, etc

SQL, thresholds,

actions

AnalyticsAlerts

Derived streams

7

Route events to Kafka, ES, Hive Complex interaction sessions rules Mix of stateless / small state / large state

Stream Processing as a Service• Launching, monitoring, scaling, updating• DSL to define jobs

@

8

Blink based on Flink A core system in Alibaba Search

• Machine learning, search, recommendations• A/B testing of search algorithms• Online feature updates to boost conversion rate

Alibaba is a major contributor to Flink Contributing many changes back to open source

@

9

@

Complete social network implementedusing event sourcing andCQRS (Command Query Responsibility Segregation)

What can we learn from these?

10

All these applications run on Flink Applications, not just analytics

• Not just finding out what the data means but acting on that at the same time

Workloads going beyond the traditional Hadoop realm• Hadoop is possible deploy, source, and sink• Container engines and other storage systems

increasingly popular with Flink

So, what is data streaming?

11

First wave for streaming was lambda architecture• Aid batch systems to be more real-time

Second wave was analytics (real time and lag-time)• Based on distributed collections, functions, and

windows

The next wave is much broader:A new architecture for event-driven applications

12

Event–driven applications

Events, State, Time, and Snapshots

14

f(a,b)

Event-driven functionexecuted distributedly

Events, State, Time, and Snapshots

15

f(a,b)

Maintain fault tolerant local state similar toany normal application

Events, State, Time, and Snapshots

16

f(a,b)

wall clock

event time clock

Access and react tonotions of time and progress,handle out-of-order events

Events, State, Time, and Snapshots

17

f(a,b)

wall clock

event time clock

Snapshot point-in-timeview for recovery,rollback, cloning,versioning, etc.

Event–driven applications

18

Event-drivenApplications

Stream Processing

Batch Processing

Stateful, event-driven,event-time-aware processing

(event sourcing, CQRS, …)

(streams, windows, …)

(data sets)

The APIs

19

Process Function (events, state, time)

DataStream API (streams, windows)

Table API (dynamic tables)

Stream SQL

Stream- &Batch Processing

Analytics

StatefulEvent-DrivenApplications

Process Function

20

class MyFunction extends ProcessFunction[MyEvent, Result] {

// declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…)

def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = { // work with event and state (event, state.value) match { … }

out.collect(…) // emit events state.update(…) // modify state

// schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) }

def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = { // handle callback when event-/processing- time instant is reached }}

Data Stream API

21

val lines: DataStream[String] = env.addSource( new FlinkKafkaConsumer09<>(…))

val events: DataStream[Event] = lines.map((line) => parse(line))

val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction())

stats.addSink(new RollingSink(path))

Table API & Stream SQL

22

Streaming Architecturefor Event-driven Applications

23

Compute, State, and Storage

24

Classic tiered architecture

Streaming architecture

database

layer

computelayer

application state+ backup

compute+

stream storageand

snapshot storage(backup)

application state

Performance

25

synchronous reads/writesacross tier boundary

asynchronous writesof large blobs

all modificationsare local

Classic tiered architecture

Streaming architecture

Consistency

26

distributed transactions

at scale typicallyat-most / at-least once

exactly onceper state

=1 =1snapshot consistency

across states

Classic tiered architecture

Streaming architecture

Scaling a Service

27

separately provision additionaldatabase capacity

provision computeand state together

Classic tiered architecture

Streaming architecture

provision compute

Rolling out a new Service

28

provision a new database(or add capacity to an existing one)

provision compute

and state together

simply occupies someadditional backup

space

Classic tiered architecture

Streaming architecture

Time, Completeness, Out-of-order

29

?

event time clocksdefine data

completenessevent time timers

handle actions for

out-of-order data

Classic tiered architecture

Streaming architecture

Repair External State

30

Streaming architecture

streams(lets say Kafka etc) live application external state

wrong results

backed up data(HDFS, S3, etc.)

Repair External State

31

Streaming architecture

live application external state

overwritewith correct results

streams(lets say Kafka etc)

backed up data(HDFS, S3, etc.) application on backup

input

Repair External State

32

Streaming architecture

live application external state

overwritewith correct results

streams(lets say Kafka etc)

backed up date(HDFS, S3, etc.)

Each service doubles as a batch job!

application on backup input

33

Streaming has outgrown the Hadoop Stack

Event-driven applications and realtime analytics converge with Apache Flink

Event-driven applications become easierto manage, faster, and more powerful following a

streaming architecture implemented with Flink

Recommended