Tech Talk @ Google on Flink Fault Tolerance and HA

Apache Flink StreamingResiliency and Consistency

Paris Carbone - PhD Candidate KTH Royal Institute of Technology

<[email protected], [email protected]>1

Technical Talk @ Google

mailto:[email protected]

mailto:[email protected]

Overview• The Flink Execution Engine Architecture

• Exactly-once-processing challenges

• Using Snapshots for Recovery

• The ABS Algorithm for DAGs and Cyclic Dataflows

• Recovery, Cost and Performance

• Job Manager High Availability

2

Current Focus

3

Streaming APIBatch API

Flink Optimiser

Flink Runtime

Tabl

e

ML

Gel

ly

ML

Gel

ly

Stat

e M

anag

emen

t

Executing Pipelines

• Streaming pipelines translate into job graphs

4

Flink Runtime

Flink Job Graph Builder/Optimiser

Flink Client User Pipeline

Job Graph

Unbounded Data Processing Architectures

5(Hadoop, Spark) (Spark Streaming)

1) Streaming (Distributed Data Flow)

LONG-LIVED TASK EXECUTION STATE IS KEPT INSIDE TASKS

2) Micro-Batch

The Flink Runtime Engine

• Tasks run operator logic in a pipelined fashion

• They are scheduled among workers

• State is kept within tasks

6

LONG-LIVED TASK EXECUTION

Job Manager• scheduling • monitoring

Task Failures • Task failures are guaranteed to occur

• We need to make them transparent

• Can we simply recover a task from scratch?

7

task

state

consistent state

no duplicatesno loss

Record Acknowledgements

• We can monitor the consumption of every event

• It guarantees no loss

• Duplicates and inconsistent state are not handled

8

task

state

consistent state


bookkeeping

Atomic transactions per update

• It guarantees no loss and consistent state.

• No duplication can be achieved e.g. with bloom filters or caching

• Fine grained recovery

• Non-constant load in external storage can lead to unsustainable execution

9

task

state

consistent state


state

statestate

DistributedStorage

Trident

Lessons Learned from Batch

10

batch-1batch-2

• If a batch computation fails, simply repeat computation as a transaction

• Transaction rate is constant

• Can we apply these principles to a true streaming execution?

Distributed Snapshots

11


11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”


11



p1

p2

p3

p4


11



p1

p2

p3

p4


11



p1

p2

p3

p4

consistent cut


11



p1

p2

p3

p4

consistent cut


11



p1

p2

p3

p4

consistent cut

causality violation


11



p1

p2

p3

p4

consistent cut

inconsistent cut

causality violation


11



p1

p2

p3

p4

consistent cut

inconsistent cut

causality violation

Idea: We can resume from a snapshot that defines a consistent cut


11

Assumptions• repeatable sources • reliable FIFO channels



p1

p2

p3

p4

consistent cut

inconsistent cut

causality violation

Idea: We can resume from a snapshot that defines a consistent cut


12


12


12

t1

snap - t1


12

t1

snap - t1


12

t2t1

snap - t1 snap - t2


12

t2t1

snap - t1 snap - t2


12

t3t2t1

snap - t1 snap - t2


12

t3t2t1

reset from snap t2snap - t1 snap - t2


12

t3t2t1

reset from snap t2snap - t1 snap - t2

Assumptions• repeatable sources • reliable FIFO channels

Taking Snapshots

13

t2t1

execution snapshots

Initial approach (see Naiad)• Pause execution on t1,t2,.. • Collect state • Restore execution

Lamport On the Rescue

14

“The global-state-detection algorithm is to be superimposed on the underlying computation:

it must run concurrently with, but not alter, this underlying computation”

Chandy-Lamport Snapshots

15

p1 p2

p1

Using Markers/Barriers• Triggers Snapshots • Separates preshot-postshot events • Leads to a consistent execution cut

markers propagate the snapshot execution under continuous ingestion markers

Asynchronous Snapshots

16

t2t1

snap - t1 snap - t2

snapshotting snapshotting

propagating markers

Asynchronous Barrier Snapshotting

17

barriers


18


19


19


20


21


21


22


22


22

pending


22

aligning

pending


23

aligning

aligning


24

aligning


24

aligning


25

aligning


25

aligning


26

Asynchronous Barrier Snapshotting Benefits

27

• Taking advantage of the execution graph structure

• No records in transit included in the snapshot

• Aligning has lower impact than halting

Snapshots on Cyclic Dataflows

28


29


30


30


31


31

deadlock


• Checkpointing should eventually terminate

• Records in transit within loops should be included in the checkpoint


33


33


34


34


35

downstream backup


36

downstream backup


36

downstream backup


37

Implementation in Flink

38

• Coordinator/Driver Actor per job in JM

• sends periodic markers to sources

• collects acknowledgements with state handles and registers complete snapshots

• injects references to latest consistent state handles and back-edge records in pending execution graph tasks

• Tasks

• snapshot state transparently in given backend upon receiving a barrier and send ack to coordinator

• propagate barriers further like normal records

Recovery

39

• Full rescheduling of the execution graph from the latest checkpoint - simple, no further modifications

• Partial rescheduling of the execution graph for all upstream dependencies - trickier, duplicate elimination should be guaranteed

Performance Impact

40

http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task

Clientjob graph

optimiser


Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Clientjob graph

optimiser


Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Clientjob graph

optimiser


Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Clientjob graph

optimiser


Single Point Of Failure

Passive Standby JM

42

JM JM JM

Zookeeper State • Leader JM Address • Pending Execution Graphs • State Snapshots

Client

Eventual Leader Election

43

time


43

JM JM JMt0time


43

JM JM JMt0

t1 JM JM JM

time


43

JM JM JMt0

t1

t2

JM JM JM

JM JM JM

…recovering

time


43

JM JM JMt0

t1

t2

t3

JM JM JM

JM JM JM

…recovering

JM JMJM

time

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeper

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeperjob-1

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeperjob-1execution

graph

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser


graph

blocking write in Zookeeper

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser


graph

task


Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Client

optimiser


graph

task


Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend


Checkpointing states

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend



Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend


state handle


Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend


state handle


Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend


state handle


Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend


state handle


Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend


state handle


Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend


global snapshot

state handle


HA in Zookeeper

• Task Managers query ZK for the current leader before connecting (with lease)

• Upon standby failover restart pending execution graphs

• Inject state handles from the last global snapshot

Summary• The Flink execution monitors long running tasks

• Establishing exactly-once-processing guarantees with snapshots can be achieved without halting the execution

• The ABS algorithm persists the minimal state at a low cost

• Master HA is achieved through Zookeeper and passive failover

Data & Analytics

Tech Talk @ Google on Flink Fault Tolerance and HA