41
1

Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

1

Page 2: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Modern Stream Processing with Apache Flink®

Till Rohrmann

GOTO Berlin 2017

2

Page 3: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

3

Original creators ofApache Flink®

dA Platform 2Open Source Apache Flink + dA Application Manager

Page 4: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

What changes faster? Data or Query?

4

Data changes slowlycompared to fast changing queries

ad-hoc queries, data exploration, ML training and

(hyper) parameter tuning

Batch ProcessingUse Case

Data changes fastapplication logic

is long-lived

continuous applications,data pipelines, standing queries,

anomaly detection, ML evaluation, …

Stream ProcessingUse Case

Page 5: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Batch Processing

5

Page 6: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Stream Processing

6

Page 7: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

7

Apache Flink in a Nutshell

Page 8: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Apache Flink in a Nutshell

8

Queries

Applications

Devices

etc.

Database

Stream

File / ObjectStorage

Stateful computations over streams real-time and historic

fast, scalable, fault tolerant, in-memory,event time, large state, exactly-once

HistoricData

Streams

Application

Page 9: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

9

Event Streams State (Event) Time Snapshots

The Core Building Blocks

real-time andhindsight

complexbusiness logic

consistency without-of-order data

and late data

forking /versioning /time-travel

Page 10: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Powerful Abstractions

10

Process Function (events, state, time)

DataStream API (streams, windows)

Stream SQL / Tables (dynamic tables)

Stream- & Batch Data Processing

High-levelAnalytics API

Stateful Event-Driven Applications

val stats = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum((a, b) -> a.add(b))

def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … }

out.collect(…) // emit events state.update(…) // modify state

// schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) }

Layered abstractions tonavigate simple to complex use cases

Page 11: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Hardened at scale

11

Athena X Streaming SQLPlatform Service

Streaming Platform as a Service

Fraud detection Streaming Analytics Platform

100s jobs, 1000s nodes, TBs state metrics, analytics, real time ML Streaming SQL as a platform

Page 12: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

12

Distributed application infrastructure

Page 13: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Good old centralized architecture

13

The big meancentral database

$$$

The grumpyDBA

Application Application Application Application

Page 14: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Modern distributed app. architecture

14

Application

Sensor

APIsApplication

Application

Application

The limit to what you can do is how sophisticatedcan you compute over the stream!

Boils down to: How well do you handle state and time

Page 15: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

15

A Flink-favored approach

Page 16: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Event Sourcing + Memory Image

16

event logpersists events (temporarily)

event /command

Process

main memory

update local variables/structures

periodically snapshot the

memory

Page 17: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Event Sourcing + Memory Image

17

Recovery: Restore snapshot and replay events since snapshot

event logpersists events (temporarily)

Process

Page 18: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Distributed Memory Image

18

Distributed application, many memory images. Snapshots are all consistent together.

Page 19: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Stateful Event & Stream Processing

19

Scalable embedded state

Access at memory speed & scales with parallel operators

Page 20: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Compute, State, and Storage

20

Classic tiered architecture Streaming architecture

database layer

compute layer

application state + backup

compute+

stream storage and

snapshot storage (backup)

application state

Page 21: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Performance

21

synchronous reads/writes across tier boundary

asynchronous writes of large blobs

all modificationsare local

Classic tiered architecture Streaming architecture

Page 22: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Consistency

22

distributed transactions

at scale typically at-most / at-least once

exactly onceper state =1 =1

Classic tiered architecture Streaming architecture

Page 23: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Scaling a Service

23separately provision additional

database capacity

provision compute and state together

Classic tiered architecture Streaming architecture

provisioncompute

Page 24: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Rolling out a new Service

24provision a new database

(or add capacity to an existing one)simply occupies some

additional backup space

Classic tiered architecture Streaming architecture

provision compute and state together

Page 25: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

What users built on checkpoints…

▪ Upgrades and Rollbacks ▪ Cross Datacenter Failover ▪ State Archiving ▪ Application Migration ▪ Spot Instance Region Chasing ▪ A/B testing ▪ …

25

Page 26: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

26

What is the next wave of stream processing

applications?

Page 27: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

What changes faster? Data or Query?

27

Data changes slowlycompared to fast changing queries

ad-hoc queries, data exploration, ML training and tuning

Batch ProcessingUse Case

Data changes fastapplication logic

is long-lived

continuous applications,data pipelines, standing queries,

anomaly detection, ML evaluation, …

Stream ProcessingUse Case

Page 28: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Analytics & Business Logic

28

Stream

Source analytics metricsBusiness

Logic

Stream Processor

events

ApplicationDatabase

Page 29: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Blending Analytics & Business Logic

29

Stream

Source StreamingApplication

Stream Processor

events

RPCs

(Actor) Messages

Page 30: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

AthenaX by Uber

30

Page 31: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

31

Can one build an entire sophisticated web application (say a social network)

on a stream processor?

(Yes, we can!™)

Page 32: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

32

Social network implemented using event sourcing and CQRS (Command Query Responsibility Segregation) on Kafka/Flink/Elasticsearch/Redis

More: https://data-artisans.com/blog/drivetribe-cqrs-apache-flink

@

Page 33: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

33

The next wave of stream processing applications…

… is all types of stateful applications that react to

data and time!

Page 34: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

34

Stateful stream applications

versioning, upgrading, rollback, duplicating,

migrating, …

Continuous applications

Page 35: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

The dA Platform Architecture

35

dA Platform 2

Apache Flink Stateful stream processing

Kubernetes Container platform

Logging

Streams from Kafka, Kinesis,

S3, HDFS, Databases, ... dA

Application Manager

Application lifecycle management

Metrics

CI/CD

Real-time Analytics

Anomaly- & Fraud Detection

Real-time Data Integration

Reactive Microservices (and more)

Page 36: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Versioned Applications, not Jobs/Jars

36

Stream ProcessingApplication

Version 3

Version 2

Version 1

Code and Application Snapshot

upgrade

upgrade

New Application

Version 3a

Version 2afork /

duplicate

Page 37: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Deployments, not Flink Clusters

37

Testing / QA Kubernetes Cluster Production Kubernetes Cluster

Threat Metrics App. Testing

Activity Monitor Application

Fraud Detection Application

Fraud Detection App. Testing

Page 38: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

Hooks for CI/CD pipelines

38

Kubernetes Cluster

Application Version 1

CI Service

dA Application

Managerpush

update

triggerCI

upgrade API

stateful upgrade

Application Version 2

Page 39: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

39

Thank you!@stsffap @ApacheFlink @dataArtisans

Page 40: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

We are hiring!

data-artisans.com/careers

40

Page 41: Modern Stream Processing - GOTO Conference · Data or Query? 4 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter

41