32
Fault Tolerance and Job Recovery in Apache Flink™ Till Rohrmann [email protected] @stsffap

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Embed Size (px)

Citation preview

Page 1: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Fault Tolerance and Job Recovery in Apache Flink™

Till [email protected] @stsffap

Page 2: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

2

Page 3: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Better be safe than sorry Failures will happen EMC estimated $1.7 billion costs

due to data loss and system downtime

Recovery will save you time and costs

Switch between algorithms Live upgrade of your system

3

Page 4: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

4

Fault Tolerance

Page 5: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Fault tolerance guarantees At most once• No guarantees at all

At least once• For many applications

sufficient Exactly once

Flink provides all guarantees

5

Page 6: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Checkpoints Consistent snapshots of distributed

data stream and operator state

6

Page 7: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Barriers Markers for checkpoints Injected in the data flow

7

Page 8: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

8

Alignment for multi-input operators

Page 9: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Operator State Stateless operators System state

User defined state

9

ds.filter(_ != 0)

ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS))

public class CounterSum implements RichReduceFunction<Long> { private OperatorState<Long> counter;

@Override public Long reduce(Long v1, Long v2) throws Exception { counter.update(counter.value() + 1); return v1 + v2; }

@Override public void open(Configuration config) { counter = getRuntimeContext().getOperatorState(“counter”, 0L, false); }}

Page 10: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

10

Page 11: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

11

Page 12: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

12

Page 13: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

13

Page 14: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Advantages Separation of app logic from recovery• Checkpointing interval is just a config

parameter

High throughput• Controllable checkpointing overhead

Low impact on latency

14

Page 15: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

15

Page 16: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Cluster High Availability

16

Page 17: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Without high availability

17

JobManager

TaskManager

Page 18: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

With high availability

18

JobManager

TaskManager

Stand-byJobManager

Apache Zookeeper™

KEEP GOING

Page 19: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Persisting jobs

19

JobManager

Client

TaskManagers

Apache Zookeeper™

Job

1. Submit job

Page 20: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Persisting jobs

20

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Submit job2. Persist execution graph

Page 21: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Persisting jobs

21

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Submit job2. Persist execution graph3. Write handle to ZooKeeper

Page 22: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Persisting jobs

22

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Submit job2. Persist execution graph3. Write handle to ZooKeeper4. Deploy tasks

Page 23: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Handling checkpoints

23

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots

Page 24: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Handling checkpoints

24

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM

Page 25: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Handling checkpoints

25

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint

Page 26: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Handling checkpoints

26

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint5. Persist global checkpoint

Page 27: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Handling checkpoints

27

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint5. Persist global checkpoint6. Write handle to ZooKeeper

Page 28: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

28

Conclusion

Page 29: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

29

Page 30: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

30

Page 31: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

TL;DL Job recovery mechanism with low

latency and high throughput Exactly one processing semantics No single point of failure

Flink will always keep processing your data

31

Page 32: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

flink.apache.org@ApacheFlink