Tech Talk @ Google on Flink Fault Tolerance and HA

Preview:

Citation preview

Apache Flink StreamingResiliency and Consistency

Paris Carbone - PhD Candidate KTH Royal Institute of Technology

<parisc@kth.se, senorcarbone@apache.org>1

Technical Talk @ Google

Overview• The Flink Execution Engine Architecture

• Exactly-once-processing challenges

• Using Snapshots for Recovery

• The ABS Algorithm for DAGs and Cyclic Dataflows

• Recovery, Cost and Performance

• Job Manager High Availability

2

Current Focus

3

Streaming APIBatch API

Flink Optimiser

Flink Runtime

Tabl

e

ML

Gel

ly

ML

Gel

ly

Stat

e M

anag

emen

t

Executing Pipelines

• Streaming pipelines translate into job graphs

4

Flink Runtime

Flink Job Graph Builder/Optimiser

Flink Client User Pipeline

Job Graph

Unbounded Data Processing Architectures

5(Hadoop, Spark) (Spark Streaming)

1) Streaming (Distributed Data Flow)

LONG-LIVED TASK EXECUTION STATE IS KEPT INSIDE TASKS

2) Micro-Batch

The Flink Runtime Engine

• Tasks run operator logic in a pipelined fashion

• They are scheduled among workers

• State is kept within tasks

6

LONG-LIVED TASK EXECUTION

Job Manager• scheduling • monitoring

Task Failures • Task failures are guaranteed to occur

• We need to make them transparent

• Can we simply recover a task from scratch?

7

task

state

consistent state

no duplicatesno loss

Record Acknowledgements

• We can monitor the consumption of every event

• It guarantees no loss

• Duplicates and inconsistent state are not handled

8

task

state

consistent state

no duplicatesno loss

bookkeeping

Atomic transactions per update

• It guarantees no loss and consistent state.

• No duplication can be achieved e.g. with bloom filters or caching

• Fine grained recovery

• Non-constant load in external storage can lead to unsustainable execution

9

task

state

consistent state

no duplicatesno loss

state

statestate

DistributedStorage

Trident

Lessons Learned from Batch

10

batch-1batch-2

• If a batch computation fails, simply repeat computation as a transaction

• Transaction rate is constant

• Can we apply these principles to a true streaming execution?

Distributed Snapshots

11

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

causality violation

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

inconsistent cut

causality violation

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

inconsistent cut

causality violation

Idea: We can resume from a snapshot that defines a consistent cut

Distributed Snapshots

11

Assumptions• repeatable sources • reliable FIFO channels

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

inconsistent cut

causality violation

Idea: We can resume from a snapshot that defines a consistent cut

Distributed Snapshots

12

Distributed Snapshots

12

Distributed Snapshots

12

t1

snap - t1

Distributed Snapshots

12

t1

snap - t1

Distributed Snapshots

12

t2t1

snap - t1 snap - t2

Distributed Snapshots

12

t2t1

snap - t1 snap - t2

Distributed Snapshots

12

t3t2t1

snap - t1 snap - t2

Distributed Snapshots

12

t3t2t1

reset from snap t2snap - t1 snap - t2

Distributed Snapshots

12

t3t2t1

reset from snap t2snap - t1 snap - t2

Assumptions• repeatable sources • reliable FIFO channels

Taking Snapshots

13

t2t1

execution snapshots

Initial approach (see Naiad)• Pause execution on t1,t2,.. • Collect state • Restore execution

Lamport On the Rescue

14

“The global-state-detection algorithm is to be superimposed on the underlying computation:

it must run concurrently with, but not alter, this underlying computation”

Chandy-Lamport Snapshots

15

p1 p2

p1

Using Markers/Barriers• Triggers Snapshots • Separates preshot-postshot events • Leads to a consistent execution cut

markers propagate the snapshot execution under continuous ingestion markers

Asynchronous Snapshots

16

t2t1

snap - t1 snap - t2

snapshotting snapshotting

propagating markers

Asynchronous Barrier Snapshotting

17

barriers

Asynchronous Barrier Snapshotting

18

Asynchronous Barrier Snapshotting

19

Asynchronous Barrier Snapshotting

19

Asynchronous Barrier Snapshotting

20

Asynchronous Barrier Snapshotting

21

Asynchronous Barrier Snapshotting

21

Asynchronous Barrier Snapshotting

22

Asynchronous Barrier Snapshotting

22

Asynchronous Barrier Snapshotting

22

pending

Asynchronous Barrier Snapshotting

22

aligning

pending

Asynchronous Barrier Snapshotting

23

aligning

aligning

Asynchronous Barrier Snapshotting

24

aligning

Asynchronous Barrier Snapshotting

24

aligning

Asynchronous Barrier Snapshotting

25

aligning

Asynchronous Barrier Snapshotting

25

aligning

Asynchronous Barrier Snapshotting

26

Asynchronous Barrier Snapshotting Benefits

27

• Taking advantage of the execution graph structure

• No records in transit included in the snapshot

• Aligning has lower impact than halting

Snapshots on Cyclic Dataflows

28

Snapshots on Cyclic Dataflows

29

Snapshots on Cyclic Dataflows

30

Snapshots on Cyclic Dataflows

30

Snapshots on Cyclic Dataflows

31

Snapshots on Cyclic Dataflows

31

deadlock

Snapshots on Cyclic Dataflows

• Checkpointing should eventually terminate

• Records in transit within loops should be included in the checkpoint

Snapshots on Cyclic Dataflows

33

Snapshots on Cyclic Dataflows

33

Snapshots on Cyclic Dataflows

34

Snapshots on Cyclic Dataflows

34

Snapshots on Cyclic Dataflows

35

downstream backup

Snapshots on Cyclic Dataflows

36

downstream backup

Snapshots on Cyclic Dataflows

36

downstream backup

Snapshots on Cyclic Dataflows

37

Implementation in Flink

38

• Coordinator/Driver Actor per job in JM

• sends periodic markers to sources

• collects acknowledgements with state handles and registers complete snapshots

• injects references to latest consistent state handles and back-edge records in pending execution graph tasks

• Tasks

• snapshot state transparently in given backend upon receiving a barrier and send ack to coordinator

• propagate barriers further like normal records

Recovery

39

• Full rescheduling of the execution graph from the latest checkpoint - simple, no further modifications

• Partial rescheduling of the execution graph for all upstream dependencies - trickier, duplicate elimination should be guaranteed

Performance Impact

40

http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Single Point Of Failure

Passive Standby JM

42

JM JM JM

Zookeeper State • Leader JM Address • Pending Execution Graphs • State Snapshots

Client

Eventual Leader Election

43

time

Eventual Leader Election

43

JM JM JMt0time

Eventual Leader Election

43

JM JM JMt0

t1 JM JM JM

time

Eventual Leader Election

43

JM JM JMt0

t1

t2

JM JM JM

JM JM JM

…recovering

time

Eventual Leader Election

43

JM JM JMt0

t1

t2

t3

JM JM JM

JM JM JM

…recovering

JM JMJM

time

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeper

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeperjob-1

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeperjob-1execution

graph

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeperjob-1execution

graph

blocking write in Zookeeper

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeperjob-1execution

graph

task

blocking write in Zookeeper

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Client

optimiser

Zookeeperjob-1execution

graph

task

blocking write in Zookeeper

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

Checkpointing states

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

Checkpointing states

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

state handle

Checkpointing states

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

state handle

Checkpointing states

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

state handle

Checkpointing states

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

state handle

Checkpointing states

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

state handle

Checkpointing states

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

global snapshot

state handle

Checkpointing states

HA in Zookeeper

• Task Managers query ZK for the current leader before connecting (with lease)

• Upon standby failover restart pending execution graphs

• Inject state handles from the last global snapshot

Summary• The Flink execution monitors long running tasks

• Establishing exactly-once-processing guarantees with snapshots can be achieved without halting the execution

• The ABS algorithm persists the minimal state at a low cost

• Master HA is achieved through Zookeeper and passive failover