Apache Flink StreamingResiliency and Consistency
Paris Carbone - PhD Candidate KTH Royal Institute of Technology
<[email protected], [email protected]>1
Technical Talk @ Google
Overview• The Flink Execution Engine Architecture
• Exactly-once-processing challenges
• Using Snapshots for Recovery
• The ABS Algorithm for DAGs and Cyclic Dataflows
• Recovery, Cost and Performance
• Job Manager High Availability
2
Current Focus
3
Streaming APIBatch API
Flink Optimiser
Flink Runtime
Tabl
e
ML
Gel
ly
ML
Gel
ly
Stat
e M
anag
emen
t
Executing Pipelines
• Streaming pipelines translate into job graphs
4
Flink Runtime
Flink Job Graph Builder/Optimiser
Flink Client User Pipeline
Job Graph
Unbounded Data Processing Architectures
5(Hadoop, Spark) (Spark Streaming)
1) Streaming (Distributed Data Flow)
LONG-LIVED TASK EXECUTION STATE IS KEPT INSIDE TASKS
2) Micro-Batch
The Flink Runtime Engine
• Tasks run operator logic in a pipelined fashion
• They are scheduled among workers
• State is kept within tasks
6
LONG-LIVED TASK EXECUTION
Job Manager• scheduling • monitoring
Task Failures • Task failures are guaranteed to occur
• We need to make them transparent
• Can we simply recover a task from scratch?
7
task
state
consistent state
no duplicatesno loss
Record Acknowledgements
• We can monitor the consumption of every event
• It guarantees no loss
• Duplicates and inconsistent state are not handled
8
task
state
consistent state
no duplicatesno loss
bookkeeping
Atomic transactions per update
• It guarantees no loss and consistent state.
• No duplication can be achieved e.g. with bloom filters or caching
• Fine grained recovery
• Non-constant load in external storage can lead to unsustainable execution
9
task
state
consistent state
no duplicatesno loss
state
statestate
DistributedStorage
Trident
Lessons Learned from Batch
10
batch-1batch-2
• If a batch computation fails, simply repeat computation as a transaction
• Transaction rate is constant
• Can we apply these principles to a true streaming execution?
Distributed Snapshots
11
Distributed Snapshots
11
“A collection of operator states and records in transit (channels) that reflects a
moment at a valid execution”
Distributed Snapshots
11
“A collection of operator states and records in transit (channels) that reflects a
moment at a valid execution”
p1
p2
p3
p4
Distributed Snapshots
11
“A collection of operator states and records in transit (channels) that reflects a
moment at a valid execution”
p1
p2
p3
p4
Distributed Snapshots
11
“A collection of operator states and records in transit (channels) that reflects a
moment at a valid execution”
p1
p2
p3
p4
consistent cut
Distributed Snapshots
11
“A collection of operator states and records in transit (channels) that reflects a
moment at a valid execution”
p1
p2
p3
p4
consistent cut
Distributed Snapshots
11
“A collection of operator states and records in transit (channels) that reflects a
moment at a valid execution”
p1
p2
p3
p4
consistent cut
causality violation
Distributed Snapshots
11
“A collection of operator states and records in transit (channels) that reflects a
moment at a valid execution”
p1
p2
p3
p4
consistent cut
inconsistent cut
causality violation
Distributed Snapshots
11
“A collection of operator states and records in transit (channels) that reflects a
moment at a valid execution”
p1
p2
p3
p4
consistent cut
inconsistent cut
causality violation
Idea: We can resume from a snapshot that defines a consistent cut
Distributed Snapshots
11
Assumptions• repeatable sources • reliable FIFO channels
“A collection of operator states and records in transit (channels) that reflects a
moment at a valid execution”
p1
p2
p3
p4
consistent cut
inconsistent cut
causality violation
Idea: We can resume from a snapshot that defines a consistent cut
Distributed Snapshots
12
Distributed Snapshots
12
Distributed Snapshots
12
t1
snap - t1
Distributed Snapshots
12
t1
snap - t1
Distributed Snapshots
12
t2t1
snap - t1 snap - t2
Distributed Snapshots
12
t2t1
snap - t1 snap - t2
Distributed Snapshots
12
t3t2t1
snap - t1 snap - t2
Distributed Snapshots
12
t3t2t1
reset from snap t2snap - t1 snap - t2
Distributed Snapshots
12
t3t2t1
reset from snap t2snap - t1 snap - t2
Assumptions• repeatable sources • reliable FIFO channels
Taking Snapshots
13
t2t1
execution snapshots
Initial approach (see Naiad)• Pause execution on t1,t2,.. • Collect state • Restore execution
Lamport On the Rescue
14
“The global-state-detection algorithm is to be superimposed on the underlying computation:
it must run concurrently with, but not alter, this underlying computation”
Chandy-Lamport Snapshots
15
p1 p2
p1
Using Markers/Barriers• Triggers Snapshots • Separates preshot-postshot events • Leads to a consistent execution cut
markers propagate the snapshot execution under continuous ingestion markers
Asynchronous Snapshots
16
t2t1
snap - t1 snap - t2
snapshotting snapshotting
propagating markers
Asynchronous Barrier Snapshotting
17
barriers
Asynchronous Barrier Snapshotting
18
Asynchronous Barrier Snapshotting
19
Asynchronous Barrier Snapshotting
19
Asynchronous Barrier Snapshotting
20
Asynchronous Barrier Snapshotting
21
Asynchronous Barrier Snapshotting
21
Asynchronous Barrier Snapshotting
22
Asynchronous Barrier Snapshotting
22
Asynchronous Barrier Snapshotting
22
pending
Asynchronous Barrier Snapshotting
22
aligning
pending
Asynchronous Barrier Snapshotting
23
aligning
aligning
Asynchronous Barrier Snapshotting
24
aligning
Asynchronous Barrier Snapshotting
24
aligning
Asynchronous Barrier Snapshotting
25
aligning
Asynchronous Barrier Snapshotting
25
aligning
Asynchronous Barrier Snapshotting
26
Asynchronous Barrier Snapshotting Benefits
27
• Taking advantage of the execution graph structure
• No records in transit included in the snapshot
• Aligning has lower impact than halting
Snapshots on Cyclic Dataflows
28
Snapshots on Cyclic Dataflows
29
Snapshots on Cyclic Dataflows
30
Snapshots on Cyclic Dataflows
30
Snapshots on Cyclic Dataflows
31
Snapshots on Cyclic Dataflows
31
deadlock
Snapshots on Cyclic Dataflows
• Checkpointing should eventually terminate
• Records in transit within loops should be included in the checkpoint
Snapshots on Cyclic Dataflows
33
Snapshots on Cyclic Dataflows
33
Snapshots on Cyclic Dataflows
34
Snapshots on Cyclic Dataflows
34
Snapshots on Cyclic Dataflows
35
downstream backup
Snapshots on Cyclic Dataflows
36
downstream backup
Snapshots on Cyclic Dataflows
36
downstream backup
Snapshots on Cyclic Dataflows
37
Implementation in Flink
38
• Coordinator/Driver Actor per job in JM
• sends periodic markers to sources
• collects acknowledgements with state handles and registers complete snapshots
• injects references to latest consistent state handles and back-edge records in pending execution graph tasks
• Tasks
• snapshot state transparently in given backend upon receiving a barrier and send ack to coordinator
• propagate barriers further like normal records
Recovery
39
• Full rescheduling of the execution graph from the latest checkpoint - simple, no further modifications
• Partial rescheduling of the execution graph for all upstream dependencies - trickier, duplicate elimination should be guaranteed
Performance Impact
40
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
Node Failures
41
Job Manager
Task Manager
Task Manager
Task Manager
task task task
Clientjob graph
optimiser
JM State • Pending Execution Graphs • Snapshots (State Handles)
Node Failures
41
Job Manager
Task Manager
Task Manager
Task Manager
task task task
Clientjob graph
optimiser
JM State • Pending Execution Graphs • Snapshots (State Handles)
Node Failures
41
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Clientjob graph
optimiser
JM State • Pending Execution Graphs • Snapshots (State Handles)
Node Failures
41
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Clientjob graph
optimiser
JM State • Pending Execution Graphs • Snapshots (State Handles)
Node Failures
41
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Clientjob graph
optimiser
JM State • Pending Execution Graphs • Snapshots (State Handles)
Single Point Of Failure
Passive Standby JM
42
JM JM JM
Zookeeper State • Leader JM Address • Pending Execution Graphs • State Snapshots
Client
Eventual Leader Election
43
time
Eventual Leader Election
43
JM JM JMt0time
Eventual Leader Election
43
JM JM JMt0
t1 JM JM JM
time
Eventual Leader Election
43
JM JM JMt0
t1
t2
JM JM JM
JM JM JM
…recovering
time
Eventual Leader Election
43
JM JM JMt0
t1
t2
t3
JM JM JM
JM JM JM
…recovering
JM JMJM
time
Scheduling Jobs
44
Job Manager
Task Manager
Task Manager
Task Manager
Client
optimiser
Zookeeper
Scheduling Jobs
44
Job Manager
Task Manager
Task Manager
Task Manager
Client
optimiser
Zookeeperjob-1
Scheduling Jobs
44
Job Manager
Task Manager
Task Manager
Task Manager
Client
optimiser
Zookeeperjob-1execution
graph
Scheduling Jobs
44
Job Manager
Task Manager
Task Manager
Task Manager
Client
optimiser
Zookeeperjob-1execution
graph
blocking write in Zookeeper
Scheduling Jobs
44
Job Manager
Task Manager
Task Manager
Task Manager
Client
optimiser
Zookeeperjob-1execution
graph
task
blocking write in Zookeeper
Scheduling Jobs
44
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Client
optimiser
Zookeeperjob-1execution
graph
task
blocking write in Zookeeper
Logging Snapshots
45
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Zookeeper
State Backend
asynchronous writes in Zookeeper
Logging Snapshots
45
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Zookeeper
State Backend
asynchronous writes in Zookeeper
Checkpointing states
Logging Snapshots
45
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Zookeeper
State Backend
asynchronous writes in Zookeeper
Checkpointing states
Logging Snapshots
45
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Zookeeper
State Backend
asynchronous writes in Zookeeper
state handle
Checkpointing states
Logging Snapshots
45
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Zookeeper
State Backend
asynchronous writes in Zookeeper
state handle
Checkpointing states
Logging Snapshots
45
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Zookeeper
State Backend
asynchronous writes in Zookeeper
state handle
Checkpointing states
Logging Snapshots
45
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Zookeeper
State Backend
asynchronous writes in Zookeeper
state handle
Checkpointing states
Logging Snapshots
45
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Zookeeper
State Backend
asynchronous writes in Zookeeper
state handle
Checkpointing states
Logging Snapshots
45
Job Manager
Task Manager
Task Manager
Task Manager
task task task task
Zookeeper
State Backend
asynchronous writes in Zookeeper
global snapshot
state handle
Checkpointing states
HA in Zookeeper
• Task Managers query ZK for the current leader before connecting (with lease)
• Upon standby failover restart pending execution graphs
• Inject state handles from the last global snapshot
Summary• The Flink execution monitors long running tasks
• Establishing exactly-once-processing guarantees with snapshots can be achieved without halting the execution
• The ABS algorithm persists the minimal state at a low cost
• Master HA is achieved through Zookeeper and passive failover