23
Till Rohrmann [email protected] @stsffap Dynamic Scaling: How Apache Flink® Adapts to Changing Workloads

Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Embed Size (px)

Citation preview

Page 1: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Till [email protected]

@stsffap

Dynamic Scaling: How Apache Flink® Adapts to Changing Workloads

Page 2: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Changing Workloads And SLAs

2

Page 3: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Resource Adaption

3

+

time

WorkloadResources

time

WorkloadResources

Page 4: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

What Is This Talk About?§ Flink’s approach to dynamic scaling

§ Current state with demo

§ Outlook on next development steps

4

Page 5: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Dynamic Scaling

5

Page 6: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Basic Idea

6

• Spread work across more workers to decrease workload

Page 7: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Scaling Stateless Jobs

7

Scale Up Scale DownSource

Mapper

Sink

• Scale up: Deploy new tasks• Scale down: Cancel running tasks

Page 8: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Scaling Stateful Jobs

8

?

• Problem: Which state to assign to new task?

Page 9: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

State in Apache Flink

9

Page 10: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Keyed vs. Non-keyed State

10

• State bound to a key• E.g. Keyed UDF and window state

• State bound to a subtask• E.g. Source state

Keyed Non-keyed

Page 11: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Repartitioning Keyed State§ Similar to consistent

hashing§ Split key space into

key groups§ Assign key groups to

tasks

11

Key space

Key group #1 Key group #2

Key group #3Key group #4

Page 12: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Repartitioning Keyed State contd.

§ Rescaling changes key group assignment

§ Maximum parallelism defined by #key groups

12

Page 13: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Repartitioning Non-keyed state§ User defined merge and

split functions• Most general approach

§ Breaking non-keyed state up into finer granularity• State has to contain

multiple entries• Automatic repartitioning

wrt granularity

13

#1 #2

#3

Page 14: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Repartitioning Non-keyed State contd.

§ Non-keyed state entries gathered at the job manager

§ Repartitioning schemes• Repartition & send• Union & broadcast

14

Page 15: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Example: Kafka Source

15

partitionId: 1, offset: 42partitionId: 3, offset: 10partitionId: 6, offset: 27

• Store offset for each partition• Individual entries are repartitionable

partitionId: 1, offset: 42partitionId: 3, offset: 10partitionId: 6, offset: 27

Page 16: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Rescaling: Why is That so Hard?§ Handling of state§ Repartitioning of keyed & non-keyed

state§ Unique among open source stream

processors, afaik

16

Page 17: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Demo Time

17

Page 18: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Demo Topology

18

Kafka Source Counter

KeyBy

Page 19: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Current State and next Steps

19

Page 20: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Current State

§ Manual rescaling1. Take savepoint2. Stop the job3. Restart job with adjusted parallelism and

savepoint

20

Page 21: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Next Steps§ Integrate savepoint with stop signal§ Rescaling individual operators w/o restart§ Dynamic container de-/allocation• “Running Flink Everywhere” by Stephan

Ewen, 16:45 at Kesselhaus

21

Page 22: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Auto Scaling Policies

22

• Latency• Throughput• Resource utilization

• Kubernetes on GCE, EC2 and Mesos (marathon-autoscale) already support auto-scaling

Page 23: Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Conclusion§ Scaling of keyed and non-keyed state§ Flink supports manual rescaling with

restart (WIP branch: https://github.com/tillrohrmann/flink/tree/partitionable-op-state)

§ Future versions might support scaling on the fly and automatic rescaling policies

23