Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Fault Tolerance and Job Recovery in Apache Flink™

Till [email protected] @stsffap

2

Better be safe than sorry Failures will happen EMC estimated $1.7 billion costs

due to data loss and system downtime

Recovery will save you time and costs

Switch between algorithms Live upgrade of your system

3

4

Fault Tolerance

Fault tolerance guarantees At most once• No guarantees at all

At least once• For many applications

sufficient Exactly once

Flink provides all guarantees

5

Checkpoints Consistent snapshots of distributed

data stream and operator state

6

Barriers Markers for checkpoints Injected in the data flow

7

8

Alignment for multi-input operators

Operator State Stateless operators System state

User defined state

9

ds.filter(_ != 0)

ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS))

public class CounterSum implements RichReduceFunction<Long> { private OperatorState<Long> counter;

@Override public Long reduce(Long v1, Long v2) throws Exception { counter.update(counter.value() + 1); return v1 + v2; }

@Override public void open(Configuration config) { counter = getRuntimeContext().getOperatorState(“counter”, 0L, false); }}

10

11

12

13

Advantages Separation of app logic from recovery• Checkpointing interval is just a config

parameter

High throughput• Controllable checkpointing overhead

Low impact on latency

14

15

Cluster High Availability

16

Without high availability

17

JobManager

TaskManager

With high availability

18

JobManager

TaskManager

Stand-byJobManager

Apache Zookeeper™

KEEP GOING

Persisting jobs

19

JobManager

Client

TaskManagers

Apache Zookeeper™

Job

1. Submit job

Persisting jobs

20

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Submit job2. Persist execution graph

Persisting jobs

21

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Submit job2. Persist execution graph3. Write handle to ZooKeeper

Persisting jobs

22

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Submit job2. Persist execution graph3. Write handle to ZooKeeper4. Deploy tasks

Handling checkpoints

23

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots


24

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM


25

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint


26

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint5. Persist global checkpoint


27

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint5. Persist global checkpoint6. Write handle to ZooKeeper

28

Conclusion

29

30

TL;DL Job recovery mechanism with low

latency and high throughput Exactly one processing semantics No single point of failure

Flink will always keep processing your data

31

flink.apache.org@ApacheFlink

Technology

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink