Tech Talk @ Google on Flink Fault Tolerance and HA

Apache Flink StreamingResiliency and Consistency

Paris Carbone - PhD Candidate KTH Royal Institute of Technology

<parisc@kth.se, senorcarbone@apache.org>1

Technical Talk @ Google

Overview• The Flink Execution Engine Architecture

• Exactly-once-processing challenges

• Using Snapshots for Recovery

• The ABS Algorithm for DAGs and Cyclic Dataflows

• Recovery, Cost and Performance

• Job Manager High Availability

Current Focus

Streaming APIBatch API

Flink Optimiser

Flink Runtime

Executing Pipelines

• Streaming pipelines translate into job graphs

Flink Runtime

Flink Job Graph Builder/Optimiser

Flink Client User Pipeline

Job Graph

Unbounded Data Processing Architectures

5(Hadoop, Spark) (Spark Streaming)

1) Streaming (Distributed Data Flow)

LONG-LIVED TASK EXECUTION STATE IS KEPT INSIDE TASKS

2) Micro-Batch

The Flink Runtime Engine

• Tasks run operator logic in a pipelined fashion

• They are scheduled among workers

• State is kept within tasks

LONG-LIVED TASK EXECUTION

Job Manager• scheduling • monitoring

Task Failures • Task failures are guaranteed to occur

• We need to make them transparent

• Can we simply recover a task from scratch?

consistent state

no duplicatesno loss

Record Acknowledgements

• We can monitor the consumption of every event

• It guarantees no loss

• Duplicates and inconsistent state are not handled

consistent state

bookkeeping

Atomic transactions per update

• It guarantees no loss and consistent state.

• No duplication can be achieved e.g. with bloom filters or caching

• Fine grained recovery

• Non-constant load in external storage can lead to unsustainable execution

consistent state

statestate

DistributedStorage

Trident

Lessons Learned from Batch

batch-1batch-2

• If a batch computation fails, simply repeat computation as a transaction

• Transaction rate is constant

• Can we apply these principles to a true streaming execution?

Distributed Snapshots

“A collection of operator states and records in transit (channels) that reflects a

moment at a valid execution”

consistent cut

causality violation

consistent cut

inconsistent cut

causality violation

consistent cut

inconsistent cut

causality violation

Idea: We can resume from a snapshot that defines a consistent cut

Assumptions• repeatable sources • reliable FIFO channels

consistent cut

inconsistent cut

causality violation

Idea: We can resume from a snapshot that defines a consistent cut

snap - t1

snap - t1 snap - t2

t3t2t1

snap - t1 snap - t2

t3t2t1

reset from snap t2snap - t1 snap - t2

t3t2t1

reset from snap t2snap - t1 snap - t2

Assumptions• repeatable sources • reliable FIFO channels

Taking Snapshots

execution snapshots

Initial approach (see Naiad)• Pause execution on t1,t2,.. • Collect state • Restore execution

Lamport On the Rescue

“The global-state-detection algorithm is to be superimposed on the underlying computation:

it must run concurrently with, but not alter, this underlying computation”

Chandy-Lamport Snapshots

Using Markers/Barriers• Triggers Snapshots • Separates preshot-postshot events • Leads to a consistent execution cut

markers propagate the snapshot execution under continuous ingestion markers

Asynchronous Snapshots

snap - t1 snap - t2

snapshotting snapshotting

propagating markers

Asynchronous Barrier Snapshotting

barriers

pending

aligning

pending

aligning

Asynchronous Barrier Snapshotting Benefits

• Taking advantage of the execution graph structure

• No records in transit included in the snapshot

• Aligning has lower impact than halting

Snapshots on Cyclic Dataflows

deadlock

• Checkpointing should eventually terminate

• Records in transit within loops should be included in the checkpoint

downstream backup

Implementation in Flink

• Coordinator/Driver Actor per job in JM

• sends periodic markers to sources

• collects acknowledgements with state handles and registers complete snapshots

• injects references to latest consistent state handles and back-edge records in pending execution graph tasks

• Tasks

• snapshot state transparently in given backend upon receiving a barrier and send ack to coordinator

• propagate barriers further like normal records

Recovery

• Full rescheduling of the execution graph from the latest checkpoint - simple, no further modifications

• Partial rescheduling of the execution graph for all upstream dependencies - trickier, duplicate elimination should be guaranteed

Performance Impact

http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

Node Failures

Job Manager

Task Manager

task task task

Clientjob graph

optimiser

JM State • Pending Execution Graphs • Snapshots (State Handles)

Node Failures

Job Manager

Task Manager

task task task

Clientjob graph

optimiser

Node Failures

Job Manager

Task Manager

task task task task

Clientjob graph

optimiser

Node Failures

Job Manager

Task Manager

task task task task

Clientjob graph

optimiser

Node Failures

Job Manager

Task Manager

task task task task

Clientjob graph

optimiser

Single Point Of Failure

Passive Standby JM

JM JM JM

Zookeeper State • Leader JM Address • Pending Execution Graphs • State Snapshots

Client

Eventual Leader Election

JM JM JMt0time

JM JM JMt0

t1 JM JM JM

JM JM JMt0

JM JM JM

…recovering

JM JM JMt0

JM JM JM

…recovering

JM JMJM

Scheduling Jobs

Job Manager

Task Manager

Client

optimiser

Zookeeper

Scheduling Jobs

Job Manager

Task Manager

Client

optimiser

Zookeeperjob-1

Scheduling Jobs

Job Manager

Task Manager

Client

optimiser

Zookeeperjob-1execution

Scheduling Jobs

Job Manager

Task Manager

Client

optimiser

blocking write in Zookeeper

Scheduling Jobs

Job Manager

Task Manager

Client

optimiser

Scheduling Jobs

Job Manager

Task Manager

task task task task

Client

optimiser

Logging Snapshots

Job Manager

Task Manager

task task task task

Zookeeper

State Backend

asynchronous writes in Zookeeper

Logging Snapshots

Job Manager

Task Manager

task task task task

Zookeeper

State Backend

Checkpointing states

Logging Snapshots

Job Manager

Task Manager

task task task task

Zookeeper

State Backend

Logging Snapshots

Job Manager

Task Manager

task task task task

Zookeeper

State Backend

state handle

Logging Snapshots

Job Manager

Task Manager

task task task task

Zookeeper

State Backend

state handle

Logging Snapshots

Job Manager

Task Manager

task task task task

Zookeeper

State Backend

state handle

Logging Snapshots

Job Manager

Task Manager

task task task task

Zookeeper

State Backend

state handle

Logging Snapshots

Job Manager

Task Manager

task task task task

Zookeeper

State Backend

state handle

Logging Snapshots

Job Manager

Task Manager

task task task task

Zookeeper

State Backend

global snapshot

state handle

HA in Zookeeper

• Task Managers query ZK for the current leader before connecting (with lease)

• Upon standby failover restart pending execution graphs

• Inject state handles from the last global snapshot

Summary• The Flink execution monitors long running tasks

• Establishing exactly-once-processing guarantees with snapshots can be achieved without halting the execution

• The ABS algorithm persists the minimal state at a low cost

• Master HA is achieved through Zookeeper and passive failover

Tech Talk @ Google on Flink Fault Tolerance and HA

Data & Analytics

Fault-Tolerance Techniques

PASC fault tolerance

Fault Tolerance Distributed

Fault Tolerance Fundamentals

Simulation Fault-Injection & Software Fault-Tolerance

Byzantine fault tolerance

9 fault-tolerance

Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises

L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2

Hadoop fault tolerance

Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015

“Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Akka Fault Tolerance

16 Fault Tolerance

Byzantine fault-tolerance

System Reliability and Fault Tolerance Reliable Communication Byzantine Fault Tolerance

6.Fault Tolerance

Fault tolerance

Fault Tolerance System

Cost of Fault-Tolerance on Data Stream Processingoa.upm.es/56629/1/INVE_MEM_2018_305600.pdf2 Flink Architecture Apache Flink is an open-source distributed and fault-tolerant stream