Fault Tolerance and Job Recovery in Apache Flink™
Till [email protected] @stsffap
2
Better be safe than sorry Failures will happen EMC estimated $1.7 billion costs
due to data loss and system downtime
Recovery will save you time and costs
Switch between algorithms Live upgrade of your system
3
4
Fault Tolerance
Fault tolerance guarantees At most once• No guarantees at all
At least once• For many applications
sufficient Exactly once
Flink provides all guarantees
5
Checkpoints Consistent snapshots of distributed
data stream and operator state
6
Barriers Markers for checkpoints Injected in the data flow
7
8
Alignment for multi-input operators
Operator State Stateless operators System state
User defined state
9
ds.filter(_ != 0)
ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS))
public class CounterSum implements RichReduceFunction<Long> { private OperatorState<Long> counter;
@Override public Long reduce(Long v1, Long v2) throws Exception { counter.update(counter.value() + 1); return v1 + v2; }
@Override public void open(Configuration config) { counter = getRuntimeContext().getOperatorState(“counter”, 0L, false); }}
10
11
12
13
Advantages Separation of app logic from recovery• Checkpointing interval is just a config
parameter
High throughput• Controllable checkpointing overhead
Low impact on latency
14
15
Cluster High Availability
16
Without high availability
17
JobManager
TaskManager
With high availability
18
JobManager
TaskManager
Stand-byJobManager
Apache Zookeeper™
KEEP GOING
Persisting jobs
19
JobManager
Client
TaskManagers
Apache Zookeeper™
Job
1. Submit job
Persisting jobs
20
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Submit job2. Persist execution graph
Persisting jobs
21
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Submit job2. Persist execution graph3. Write handle to ZooKeeper
Persisting jobs
22
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Submit job2. Persist execution graph3. Write handle to ZooKeeper4. Deploy tasks
Handling checkpoints
23
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots
Handling checkpoints
24
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots2. Persist snapshots3. Send handles to JM
Handling checkpoints
25
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint
Handling checkpoints
26
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint5. Persist global checkpoint
Handling checkpoints
27
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint5. Persist global checkpoint6. Write handle to ZooKeeper
28
Conclusion
29
30
TL;DL Job recovery mechanism with low
latency and high throughput Exactly one processing semantics No single point of failure
Flink will always keep processing your data
31
flink.apache.org@ApacheFlink