Apache Flink Deep Dive

Vasia KalavriFlink Committer & KTH PhD student

vasia@apache.org

1st Apache Flink Meetup StockholmMay 11, 2015

Flink Internals

● Job Life-Cycle○ what happens after you submit a Flink job?

● The Batch Optimizer○ how are execution plans chosen?

● Delta Iterations○ how are Flink iterations special for Graph and ML

what happens after you submit a Flink job?

The Flink Stack

Batch Optimizer

DataSet (Java/Scala) DataStream (Java/Scala)Hadoop M/R

Flink Runtime

Local Remote Yarn Tez EmbeddedD

*current Flink master + few PRs

Streaming Optimizer

DataSet<String> text = env.readTextFile(input);

DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("\\W")) { out.collect(new Tuple2(token, 1)); })

.groupBy(0).aggregate(SUM, 1);

Program Life-Cycle

Task Manager

Job Manager

Task Manager

Flink Client &Optimizer

DataSet<String> text = env.readTextFile(input);

DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("\\W")) { out.collect(new Tuple2(token, 1)); })

.groupBy(0).aggregate(SUM, 1);

O Romeo, Romeo, wherefore art thou Romeo?

O, 1Romeo, 3wherefore, 1art, 1thou, 1

Nor arm, nor face, nor any other part

nor, 3arm, 1face, 1,any, 1,other, 1part, 1

creates and submits the job graph

creates the execution graph and deploys tasks

execute tasks and send status updates

Input First SecondX Y

Operator X Operator Y

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();DataSet<String> input = env.readTextFile(input);

DataSet<String> first = input.filter (str -> str.contains(“Apache Flink“));DataSet<String> second = first.filter (str -> str.length() > 40);

second.print()env.execute();

Series of Transformations

DataSet AbstractionThink of it as a collection of data elements that can be produced/recovered in several ways:

… like a Java collection… like an RDD … perhaps it is never fully materialized (because the program does not need it to)… implicitly updated in an iteration

→ this is transparent to the user

Romeo, Romeo, where art thou Romeo?

Load Log

Search for str1

Search for str2

Search for str3

Grep 1

Grep 2

Grep 3

Example: grep

Load Log

Search for str1

Search for str2

Search for str3

Grep 1

Grep 2

Grep 3

Stage 1:Create/cache Log

Subsequent stages:Grep log for matches

Caching in-memory and disk if needed

Staged (batch) execution

Load Log

Search for str1

Search for str2

Search for str3

Grep 1

Grep 2

Grep 3

001100110011001100110011

Stage 1:Deploy and start operators

Data transfer in-memory and disk if needed

Note: Log DataSet is never “created”!

Pipelined execution

how are execution plans chosen?

Flink Batch Optimizer

Inspired by database optimizers, it creates and selects the execution plan for a user program

DataSet<Tuple5<Integer, String, String, String, Integer>> orders = … DataSet<Tuple2<Integer, Double>> lineitems = …

DataSet<Tuple2<Integer, Integer>> filteredOrders = orders .filter(. . .) .project(0,4).types(Integer.class, Integer.class);

DataSet<Tuple3<Integer, Integer, Double>> lineitemsOfOrders = filteredOrders .join(lineitems) .where(0).equalTo(0) .projectFirst(0,1).projectSecond(1) .types(Integer.class, Integer.class, Double.class);

DataSet<Tuple3<Integer, Integer, Double>> priceSums = lineitemsOfOrders .groupBy(0,1).aggregate(Aggregations.SUM, 2);

priceSums.writeAsCsv(outputPath);

A Simple Program

DataSourceorders.tbl

FilterMap DataSource

lineitem.tbl

JoinHybrid Hash

buildHT probe

broadcast forward

Combine

GroupRedsort

DataSourceorders.tbl

FilterMap DataSource

lineitem.tbl

JoinHybrid Hash

buildHT probe

hash-part [0] hash-part [0]

hash-part [0,1]

GroupRedsort

forwardBest plan depends onrelative sizes of input files

Alternative Execution Plans

● Evaluates physical execution strategies○ e.g. hash-join vs. sort-merge join

● Chooses data shipping strategies○ e.g. broadcast vs. partition

● Reuses partitioning and sort orders● Decides to cache loop-invariant data in

iterations

Optimization Examples

case class PageVisit(url: String, ip: String, userId: Long)

case class User(id: Long, name: String, email: String, country: String)

// get your data from somewhere

val visits: DataSet[PageVisit] = ...

val users: DataSet[User] = ...

// filter the users data set

val germanUsers = users.filter((u) => u.country.equals("de"))

// join data sets

val germanVisits: DataSet[(PageVisit, User)] =

// equi-join condition (PageVisit.userId = User.id)

visits.join(germanUsers).where("userId").equalTo("id")

Example: Distributed Joins

The join operator needs to create all the pairs of elements from the two inputs, for which the join condition evaluates to true

Example: Distributed Joins● Ship Strategy: The input data is distributed across all

parallel instances that participate in the join● Local Strategy: Each parallel instance performs a join

algorithm on its local partition

For both steps, there are multiple valid strategies which are favorable in different situations.

Repartition-Repartition Strategy

Partitions both inputs using the same partitioning function.

All elements that share the same join key are shipped to the same parallel instance and can be locally joined.

Broadcast-Forward Strategy

Sends one complete data set to each parallel instance that holds a partition of the other data.

The other Dataset remains local and is not shipped at all.

The optimizer will compute cost estimates for execution plans and will pick the “cheapest” plan:● amount of data shipped over the the network● if the data of one input is already partitioned

R-R Cost: Full shuffle of both data sets over the networkB-F Cost: Depends on the size of the dataset that is broadcasted and the number of parallel instancesRead more: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html

How does the Optimizer choose?

how are Flink iterations special?

● for/while loop in client submits one job per iteration step

● Data reuse by caching in memory and/or disk

Step Step Step Step Step

Client

Iterate by unrolling

Native Iterations● the runtime is aware of the iterative execution● no scheduling overhead between iterations● caching and state maintenance are handled automatically

Caching Loop-invariant DataPushing work“out of the loop”

Maintain state as index

Flink Iteration Operators

Iterate IterateDelta

Iterative Update Function

Result

Workset

IterativeUpdate Function

Result

Solution Set

Delta Iteration

● Not all the elements of the state are updated in each iteration.

● The elements that require an update, are stored in the workset.

● The step function is applied only to the workset elements.

Partition a graph into components by iteratively propagating the min vertex ID among neighbors

Example: Connected Components

Delta-Connected Components

Performance

Read the documentation and our blog posts!● Memory Management● Serialization and Type Extraction● Streaming Optimizations● Fault-Tolerance

Want to learn more?

Apache Flink Deep Dive

Vasia KalavriFlink Committer & KTH PhD student

vasia@apache.org

1st Apache Flink Meetup StockholmMay 11, 2015

Apache Flink Deep Dive

Technology

Flink and Apache Spark Fernanda de Camargo Magano Dylan ... · Flink and Apache Spark Fernanda de Camargo Magano Dylan Guedes. About Flink ... Introduction to Apache Flink Book. Use

Integrating Apache NiFi and Apache Flink

Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Introduction to Apache Flink

Google cloud Dataflow & Apache Flink

Apache Flink Hands On

Apache Flink Training - System Overview

Apache Flink - Overview

Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Implementing BigPetStore with Apache Flink

Apache Flink Stream Processing

SICS: Apache Flink Streaming

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink

Workshop Apache Flink Madrid

Apache Flink Training - Async IO

Apache Flink - tutorialspoint.comApache Flink was founded by Data Artisans company and is now developed under Apache License by Apache Flink Community. This community has over 479

FastR+Apache Flink

The Flink - Apache Bigtop integration

Apache Flink - SICS