97
Apache Flink Streaming Resiliency and Consistency Paris Carbone - PhD Candidate KTH Royal Institute of Technology <[email protected], [email protected]> 1 Technical Talk @ Google

Tech Talk @ Google on Flink Fault Tolerance and HA

Embed Size (px)

Citation preview

Page 1: Tech Talk @ Google on Flink Fault Tolerance and HA

Apache Flink StreamingResiliency and Consistency

Paris Carbone - PhD Candidate KTH Royal Institute of Technology

<[email protected], [email protected]>1

Technical Talk @ Google

Page 2: Tech Talk @ Google on Flink Fault Tolerance and HA

Overview• The Flink Execution Engine Architecture

• Exactly-once-processing challenges

• Using Snapshots for Recovery

• The ABS Algorithm for DAGs and Cyclic Dataflows

• Recovery, Cost and Performance

• Job Manager High Availability

2

Page 3: Tech Talk @ Google on Flink Fault Tolerance and HA

Current Focus

3

Streaming APIBatch API

Flink Optimiser

Flink Runtime

Tabl

e

ML

Gel

ly

ML

Gel

ly

Stat

e M

anag

emen

t

Page 4: Tech Talk @ Google on Flink Fault Tolerance and HA

Executing Pipelines

• Streaming pipelines translate into job graphs

4

Flink Runtime

Flink Job Graph Builder/Optimiser

Flink Client User Pipeline

Job Graph

Page 5: Tech Talk @ Google on Flink Fault Tolerance and HA

Unbounded Data Processing Architectures

5(Hadoop, Spark) (Spark Streaming)

1) Streaming (Distributed Data Flow)

LONG-LIVED TASK EXECUTION STATE IS KEPT INSIDE TASKS

2) Micro-Batch

Page 6: Tech Talk @ Google on Flink Fault Tolerance and HA

The Flink Runtime Engine

• Tasks run operator logic in a pipelined fashion

• They are scheduled among workers

• State is kept within tasks

6

LONG-LIVED TASK EXECUTION

Job Manager• scheduling • monitoring

Page 7: Tech Talk @ Google on Flink Fault Tolerance and HA

Task Failures • Task failures are guaranteed to occur

• We need to make them transparent

• Can we simply recover a task from scratch?

7

task

state

consistent state

no duplicatesno loss

Page 8: Tech Talk @ Google on Flink Fault Tolerance and HA

Record Acknowledgements

• We can monitor the consumption of every event

• It guarantees no loss

• Duplicates and inconsistent state are not handled

8

task

state

consistent state

no duplicatesno loss

bookkeeping

Page 9: Tech Talk @ Google on Flink Fault Tolerance and HA

Atomic transactions per update

• It guarantees no loss and consistent state.

• No duplication can be achieved e.g. with bloom filters or caching

• Fine grained recovery

• Non-constant load in external storage can lead to unsustainable execution

9

task

state

consistent state

no duplicatesno loss

state

statestate

DistributedStorage

Trident

Page 10: Tech Talk @ Google on Flink Fault Tolerance and HA

Lessons Learned from Batch

10

batch-1batch-2

• If a batch computation fails, simply repeat computation as a transaction

• Transaction rate is constant

• Can we apply these principles to a true streaming execution?

Page 11: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

11

Page 12: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

Page 13: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

Page 14: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

Page 15: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

Page 16: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

Page 17: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

causality violation

Page 18: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

inconsistent cut

causality violation

Page 19: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

11

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

inconsistent cut

causality violation

Idea: We can resume from a snapshot that defines a consistent cut

Page 20: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

11

Assumptions• repeatable sources • reliable FIFO channels

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

p1

p2

p3

p4

consistent cut

inconsistent cut

causality violation

Idea: We can resume from a snapshot that defines a consistent cut

Page 21: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

12

Page 22: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

12

Page 23: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

12

t1

snap - t1

Page 24: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

12

t1

snap - t1

Page 25: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

12

t2t1

snap - t1 snap - t2

Page 26: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

12

t2t1

snap - t1 snap - t2

Page 27: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

12

t3t2t1

snap - t1 snap - t2

Page 28: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

12

t3t2t1

reset from snap t2snap - t1 snap - t2

Page 29: Tech Talk @ Google on Flink Fault Tolerance and HA

Distributed Snapshots

12

t3t2t1

reset from snap t2snap - t1 snap - t2

Assumptions• repeatable sources • reliable FIFO channels

Page 30: Tech Talk @ Google on Flink Fault Tolerance and HA

Taking Snapshots

13

t2t1

execution snapshots

Initial approach (see Naiad)• Pause execution on t1,t2,.. • Collect state • Restore execution

Page 31: Tech Talk @ Google on Flink Fault Tolerance and HA

Lamport On the Rescue

14

“The global-state-detection algorithm is to be superimposed on the underlying computation:

it must run concurrently with, but not alter, this underlying computation”

Page 32: Tech Talk @ Google on Flink Fault Tolerance and HA

Chandy-Lamport Snapshots

15

p1 p2

p1

Using Markers/Barriers• Triggers Snapshots • Separates preshot-postshot events • Leads to a consistent execution cut

markers propagate the snapshot execution under continuous ingestion markers

Page 33: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Snapshots

16

t2t1

snap - t1 snap - t2

snapshotting snapshotting

propagating markers

Page 34: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

17

barriers

Page 35: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

18

Page 36: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

19

Page 37: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

19

Page 38: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

20

Page 39: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

21

Page 40: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

21

Page 41: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

22

Page 42: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

22

Page 43: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

22

pending

Page 44: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

22

aligning

pending

Page 45: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

23

aligning

aligning

Page 46: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

24

aligning

Page 47: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

24

aligning

Page 48: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

25

aligning

Page 49: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

25

aligning

Page 50: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting

26

Page 51: Tech Talk @ Google on Flink Fault Tolerance and HA

Asynchronous Barrier Snapshotting Benefits

27

• Taking advantage of the execution graph structure

• No records in transit included in the snapshot

• Aligning has lower impact than halting

Page 52: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

28

Page 53: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

29

Page 54: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

30

Page 55: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

30

Page 56: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

31

Page 57: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

31

deadlock

Page 58: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

• Checkpointing should eventually terminate

• Records in transit within loops should be included in the checkpoint

Page 59: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

33

Page 60: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

33

Page 61: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

34

Page 62: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

34

Page 63: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

35

downstream backup

Page 64: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

36

downstream backup

Page 65: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

36

downstream backup

Page 66: Tech Talk @ Google on Flink Fault Tolerance and HA

Snapshots on Cyclic Dataflows

37

Page 67: Tech Talk @ Google on Flink Fault Tolerance and HA

Implementation in Flink

38

• Coordinator/Driver Actor per job in JM

• sends periodic markers to sources

• collects acknowledgements with state handles and registers complete snapshots

• injects references to latest consistent state handles and back-edge records in pending execution graph tasks

• Tasks

• snapshot state transparently in given backend upon receiving a barrier and send ack to coordinator

• propagate barriers further like normal records

Page 68: Tech Talk @ Google on Flink Fault Tolerance and HA

Recovery

39

• Full rescheduling of the execution graph from the latest checkpoint - simple, no further modifications

• Partial rescheduling of the execution graph for all upstream dependencies - trickier, duplicate elimination should be guaranteed

Page 69: Tech Talk @ Google on Flink Fault Tolerance and HA

Performance Impact

40

http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

Page 70: Tech Talk @ Google on Flink Fault Tolerance and HA

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Page 71: Tech Talk @ Google on Flink Fault Tolerance and HA

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Page 72: Tech Talk @ Google on Flink Fault Tolerance and HA

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Page 73: Tech Talk @ Google on Flink Fault Tolerance and HA

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Page 74: Tech Talk @ Google on Flink Fault Tolerance and HA

Node Failures

41

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Single Point Of Failure

Page 75: Tech Talk @ Google on Flink Fault Tolerance and HA

Passive Standby JM

42

JM JM JM

Zookeeper State • Leader JM Address • Pending Execution Graphs • State Snapshots

Client

Page 76: Tech Talk @ Google on Flink Fault Tolerance and HA

Eventual Leader Election

43

time

Page 77: Tech Talk @ Google on Flink Fault Tolerance and HA

Eventual Leader Election

43

JM JM JMt0time

Page 78: Tech Talk @ Google on Flink Fault Tolerance and HA

Eventual Leader Election

43

JM JM JMt0

t1 JM JM JM

time

Page 79: Tech Talk @ Google on Flink Fault Tolerance and HA

Eventual Leader Election

43

JM JM JMt0

t1

t2

JM JM JM

JM JM JM

…recovering

time

Page 80: Tech Talk @ Google on Flink Fault Tolerance and HA

Eventual Leader Election

43

JM JM JMt0

t1

t2

t3

JM JM JM

JM JM JM

…recovering

JM JMJM

time

Page 81: Tech Talk @ Google on Flink Fault Tolerance and HA

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeper

Page 82: Tech Talk @ Google on Flink Fault Tolerance and HA

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeperjob-1

Page 83: Tech Talk @ Google on Flink Fault Tolerance and HA

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeperjob-1execution

graph

Page 84: Tech Talk @ Google on Flink Fault Tolerance and HA

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeperjob-1execution

graph

blocking write in Zookeeper

Page 85: Tech Talk @ Google on Flink Fault Tolerance and HA

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

Client

optimiser

Zookeeperjob-1execution

graph

task

blocking write in Zookeeper

Page 86: Tech Talk @ Google on Flink Fault Tolerance and HA

Scheduling Jobs

44

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Client

optimiser

Zookeeperjob-1execution

graph

task

blocking write in Zookeeper

Page 87: Tech Talk @ Google on Flink Fault Tolerance and HA

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

Page 88: Tech Talk @ Google on Flink Fault Tolerance and HA

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

Checkpointing states

Page 89: Tech Talk @ Google on Flink Fault Tolerance and HA

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

Checkpointing states

Page 90: Tech Talk @ Google on Flink Fault Tolerance and HA

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

state handle

Checkpointing states

Page 91: Tech Talk @ Google on Flink Fault Tolerance and HA

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

state handle

Checkpointing states

Page 92: Tech Talk @ Google on Flink Fault Tolerance and HA

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

state handle

Checkpointing states

Page 93: Tech Talk @ Google on Flink Fault Tolerance and HA

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

state handle

Checkpointing states

Page 94: Tech Talk @ Google on Flink Fault Tolerance and HA

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

state handle

Checkpointing states

Page 95: Tech Talk @ Google on Flink Fault Tolerance and HA

Logging Snapshots

45

Job Manager

Task Manager

Task Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

global snapshot

state handle

Checkpointing states

Page 96: Tech Talk @ Google on Flink Fault Tolerance and HA

HA in Zookeeper

• Task Managers query ZK for the current leader before connecting (with lease)

• Upon standby failover restart pending execution graphs

• Inject state handles from the last global snapshot

Page 97: Tech Talk @ Google on Flink Fault Tolerance and HA

Summary• The Flink execution monitors long running tasks

• Establishing exactly-once-processing guarantees with snapshots can be achieved without halting the execution

• The ABS algorithm persists the minimal state at a low cost

• Master HA is achieved through Zookeeper and passive failover