Mikio Braun – Data flow vs. procedural programming

October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 1

Flink Forward 2015

Data flow vs. proceduralprogramming: How to put your

algorithms into Flink

October 13, 2015

Mikio L. Braun,Zalando SE

@mikiobraun

Python vs Flink

● Coming from Python, what are the differencesin programming style I have to know to getstarted in Flink?

Programming how we're used to

● Computing a sum

● Tools at our disposal:

– variables

– control flow (loops, if)

– function calls as basic piece of abstraction

def computeSum(a):sum = 0for i in range(len(a))

sum += a[i]return sum

Data Analysis Algorithms

Let's consider centering

becomes

or even just

def centerPoints(xs): sum = xs[0].copy() for i in range(1, len(xs)): sum += xs[i] mean = sum / len(xs) for i in range(len(xs)): xs[i] -= mean return xs

xs -xs.mean(axis=0)

Don't use for-loops

● Put your data into a matrix

● Don't use for loops

Least Squares Regression

● Compute

● Becomes

What you learn is thinking in matrices, breakingdown computations in terms of matrix algebra

def lsr(X, y, lam): d = X.shape[1] C = X.T.dot(X) + lam * pl.eye(d) w = np.linalg.solve(C, X.T.dot(y)) return w

Basic tools

Advantage

– very familiar

– close to math

Disadvantage

– hard to scale

● Basic procedural programming paradigm

● Variables

● Ordered arrays and efficient functions on those

Parallel Data Flow

Often you have stuff like

Which is inherently easy to scale

for i in someSet:map x[i] to y[i]

New Paradigm

● Basic building block is an (unordered) set.

● Basic operations inherently parallel

Computing, Data Flow Style

Computing a sum

Computing a mean

sum(x) = xs.reduce((x,y) => x + y)

mean(x) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2))

.map(xc => xc._1 / xc._2)

Apache Flink

● Data Flow system

● Basic building block is a DataSet[X]

● For execution, sets up all computing nodes,streams through data

Apache Flink: Getting Started

● Use Scala API

● Minimal project with Maven (build tool) orGradle

● Use an IDE like IntelliJ

● Always import org.apache.flink.api.scala._

Centering (First Try)

def computeMeans(xs: DataSet[DenseVector]) =xs.map(x => (x,1))

.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)

def centerPoints(xs: DataSet[DenseVector]) = {val mean = computeMean(xs)

xs.map(x => x – mean)}

You cannot nest DataSet operations!

Sorry, restrictions apply.

● Variables hold (lazy) computations

● You can't work with sets within the operations

● Even if result is just a single element, it's aDataSet[Elem].

● So what to do?

– cross joins

– broadcast variables

Centering (Second Try)

Works, but seems excessive because the meanis copied to each data element.

def computeMeans(xs: DataSet[DenseVector]) =xs.map(x => (x,1))

.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)

def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.crossWithTiny(mean).map(xm => xm._1 – xm._2)}

Broadcast Variables

● Side information sent to all worker nodes

● Can be a DataSet

● Gets accessed as a Java collection

class BroadcastSingleElementMapper[T, B, O](fun: (T, B) => O) extends RichMapFunction[T, O] { var broadcastVariable: B = _

@throws(classOf[Exception]) override def open(configuration: Configuration): Unit = { broadcastVariable = getRuntimeContext .getBroadcastVariable[B]("broadcastVariable") .get(0) }

override def map(value: T): O = { fun(value, broadcastVariable) } }

Broadcast Variables

Centering (Third Try)def computeMeans(xs: DataSet[DenseVector]) =

xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)

def centerPoints(xs: DataSet[DenseVector]) = {val mean = computeMean(xs)

xs.mapWithBcVar(mean).map((x, m) => x – m)}

Intermediate Results pattern

val x = someDataSetComputation()val y = someOtherDataSetComputation()

val z = dataSet.mapWithBcVar(x)((d, x) => …)

val result = anotherDataSet.mapWithBcVar((y,z)) { (d, yz) => val (y,z) = yz …}

x = someComputation()y = someOtherComputation()

z = someComputationOn(dataSet, x)

result = moreComputationOn(y, z)

Matrix Algebra

● No ordered sets per se in Data Flow context.

Vector operations by explicit joins

● Encode vector (a1, a2, …, an) with

{(1, a1), (2, a2), … (n, an)}

● Addition:

– a.join(b).where(0).equalTo(0) .map((ab) => (ab._1._1, ab._1._2 + ab._2._2))

after join: {((1, a1), (1, b1)), ((2, a1), (2, b1)), … }

Back to Least Squares Regression

Two operations: computing X'X and X'Y

def lsr(xys: DataSet[(DenseVector, Double)]) = { val XTX = xs.map(x => x.outer(x)).reduce(_ + _) val XTY = xys.map(xy => xy._1 * xy._2).reduce(_ + _)

C = XTX.mapWithBcVar(XTY) { vars => val XTX = vars._1 val XTY = var.s_2

val weight = XTX \ XTY }}

Summary and Outlook

● Procedural vs. Data Flow

– basic building blocks elementwise operations onunordered sets

– can't be nested

– combine intermediate results via broadcast vars

● Iterations

● Beware of TypeInformation implicits.

Mikio Braun – Data flow vs. procedural programming

Technology

Kazumi Tanida , Nobuyoshi Shimizu, Mikio Kato

Braun e Braun 1980

Toshihide IBARAKI Mikio KUBO Tomoyasu MASUDA Takeaki UNO Mutsunori YAGIURA

Torsten Braun Universität Bern braun@iam.unibe.ch rvs.unibe.ch

Kyosuke Yamamoto and Mikio Ishikawa · Mikio Ishikawa is a master student of University of Tsukuba, Ibaraki prefecture, Japan (e-mail: s1520863@u.tsukuba.ac.jp). can do the same”

braun multiquick

BRAUN ELITE

MICHELLE BRAUN, ON BEHALF OF : IN THE SUPERIOR COURT OF ... · Michelle Braun (“Braun”), Dolores Hummel (“Hummel”) (we refer to Braun and Hummel collectively as “Appellees”),

Procedural NUMERACY Y4 sample materials 4, Procedural Sample Materials.pdf · Y4 Procedural sample materials. 2 Procedural sample materials: Guidance for teachers The procedural tests

CVC Procedural Manual Administrative Procedural Manual

Ölzähler von BRAUN Messtechnikdatenblatt.stark-elektronik.de/braun/Datenblatt_Oelzaehler_HZ_3_RR.pdf · BRAUN Ölzähler - eine optimale ... Installationsmöglichkeiten der Ölzähler

Procedural Cross Coder - Optum360 Sample.pdf · Current Procedural Terminology Procedural Cross Coder History Format Procedural Cross Coder Organization Procedural Cross Coder. Miscellaneous

O Marketing da Consultoria Palestrante: MYRLE BRAUN braun@interconect.com.br

Braun thermoscan

Susumu YAMASAKI, Mikio YOSHIDA and Shuji DOSHI · Susumu YAMASAKI, Mikio YOSHIDA and Shuji DOSHI Department of Information Science, Kyoto Uniwrsity, Kyoto, Japan Communicated by R

Gabriel Garber and Sergio Mikio Koyama April, 2016Gabriel Garber and Sergio Mikio Koyama April, 2016 430 ISSN 1518-3548 CGC 00.038.166/0001-05 Working Paper Series Brasília n. 430

Mikio YAMAMOTO Publication and Presentaion List · Mikio YAMAMOTO Publication and Presentaion List 1969-1979 1. Suzuki H, Yamamoto M, Konishi A,Wakiya K : Measurement of energy spectra

Miller, Tracy, Braun, Funk & Miller, Ltd. presents Law and ......Law and Order: Manifestation Determination Unit Basic Legal Requirements Trends and Common Issues Procedural Compliance

22.Kiyomi Sadamoto, Mikio Murata, Kiyoshi Kubota

Wolfgang Emmerich, Mikio Aoyama & Joe Sventek · Wolfgang Emmerich, Mikio Aoyama & Joe Sventek. 2 About the Impact Project ... – Standards documents – Minutes of standards meetings