64
StratoSphere Above the Clouds Stratosphere Massively Parallel Analytics Alexander Alexandrov, Stephan Ewen , Joseph Harjung, Fabian Hüske, Moritz Kaufmann, Aljoscha Krettek, Volker Markl, Kostas Tzoumas, Sebastian Schelter

Big Data Ecosystem & The Stratosphere Project

Embed Size (px)

Citation preview

Page 1: Big Data Ecosystem & The Stratosphere Project

StratoSphereAbove the Clouds

Stratosphere

Massively Parallel Analytics

Alexander Alexandrov, Stephan Ewen,Joseph Harjung, Fabian Hüske,

Moritz Kaufmann, Aljoscha Krettek, Volker Markl, Kostas Tzoumas, Sebastian Schelter

Page 2: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

The Big Data Context

2

Large Quantitiesof Data

Diverse Data Structures

Complex AnalysisTasks

Page 3: Big Data Ecosystem & The Stratosphere Project

SQL

?

Page 4: Big Data Ecosystem & The Stratosphere Project

SQL NoSQL

?

Page 5: Big Data Ecosystem & The Stratosphere Project

NoMapReduce

SQL NoSQL

?

Page 6: Big Data Ecosystem & The Stratosphere Project

NoMapReduce

SQL NoSQL

SQL--

?

Page 7: Big Data Ecosystem & The Stratosphere Project

NoMapReduce

SQL NoSQL

SQL--

?

?

Page 8: Big Data Ecosystem & The Stratosphere Project

NoMapReduce

SQL NoSQL

SQL--

?

?Question 1:

Is it faster to add a HiveQL parser and

an HDFS adapter to your favorite

parallel database, or develop a parallel

engine from scratch?

Page 9: Big Data Ecosystem & The Stratosphere Project

NoMapReduce

SQL NoSQL

SQL--

?

?Question 1:

Is it faster to add a HiveQL parser and

an HDFS adapter to your favorite

parallel database, or develop a parallel

engine from scratch?

Question 2:Have we closed the circle (“we want

SQL!”) or is there more in analytics?

Page 10: Big Data Ecosystem & The Stratosphere Project

10

Page 11: Big Data Ecosystem & The Stratosphere Project

11

scripting

Page 12: Big Data Ecosystem & The Stratosphere Project

12

scripting

SQL--

Page 13: Big Data Ecosystem & The Stratosphere Project

13

scripting

SQL--

XQuery+/-

Page 14: Big Data Ecosystem & The Stratosphere Project

14

scripting

SQL--

scalable parallel sort

XQuery+/-

Page 15: Big Data Ecosystem & The Stratosphere Project

15

scripting

SQL--

scalable parallel sort

XQuery+/- not a sortingproblem!

Page 16: Big Data Ecosystem & The Stratosphere Project

16

scripting

SQL--

columnstore--

scalable parallel sort

XQuery+/- not a sortingproblem!

Page 17: Big Data Ecosystem & The Stratosphere Project

17

scripting

SQL--

columnstore--

scalable parallel sort

a queryplan

XQuery+/- not a sortingproblem!

Page 18: Big Data Ecosystem & The Stratosphere Project

18

scripting

SQL--

columnstore--

scalable parallel sort

a queryplan

XQuery+/- not a sortingproblem!

Question 3:

How do we architect systems for the

next wave of rich data analysis?

Page 19: Big Data Ecosystem & The Stratosphere Project

19

Page 20: Big Data Ecosystem & The Stratosphere Project

commandments

for Big Data

Analytics

10

Page 21: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)

val vertices = hdfsFile(…);val edges = hdfsFile(…);

val result = step iterate (vertices distinctBy {_.id}, vertices)

def step = (s: Data[Vertex], ws: Data[Vertex]) => {

val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}

val min = allNeighbors reduceBy {_.id} ( minBy _.component)

val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}

(I) Thou shalt…

21

… use declarative languages!

Page 22: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce22

case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)

val vertices = hdfsFile(…);val edges = hdfsFile(…);

val result = step iterate (vertices distinctBy {_.id}, vertices)

def step = (s: Data[Vertex], ws: Data[Vertex]) => {

val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}

val min = allNeighbors reduceBy {_.id} ( minBy _.component)

val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}

(I) Thou shalt…

… use declarative languages!

Executive Summary

Connected components of a graph.

- Joins and aggregations on custom data types

- Incremental / Delta Iterations

- Mixture of operators and UDFs

Page 23: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce23

case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)

val vertices = hdfsFile(…);val edges = hdfsFile(…);

val result = step iterate (vertices distinctBy {_.id}, vertices)

def step = (s: Data[Vertex], ws: Data[Vertex]) => {

val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}

val min = allNeighbors reduceBy {_.id} ( minBy _.component)

val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}

(II) Thou shalt…

… accept external (dynamic) sources! “In situ” data - no load

Page 24: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce24

case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)

val vertices = hdfsFile(…);val edges = hdfsFile(…);

val result = step iterate (vertices distinctBy {_.id}, vertices)

def step = (s: Data[Vertex], ws: Data[Vertex]) => {

val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}

val min = allNeighbors reduceBy {_.id} ( minBy _.component)

val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}

(III) Thou shalt…

… use rich primitives! (beyond MapReduce)

Page 25: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce25

case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)

val vertices = hdfsFile(…);val edges = hdfsFile(…);

val result = step iterate (vertices distinctBy {_.id}, vertices)

def step = (s: Data[Vertex], ws: Data[Vertex]) => {

val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}

val min = allNeighbors reduceBy {_.id} ( minBy _.component)

val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}

(III) Thou shalt…

… use rich primitives! (beyond MapReduce)

Map

Reduce

Cross

Match

CoGroup

Page 26: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce26

case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)

val vertices = hdfsFile(…);val edges = hdfsFile(…);

val result = step iterate (vertices distinctBy {_.id}, vertices)

def step = (s: Data[Vertex], ws: Data[Vertex]) => {

val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}

val min = allNeighbors reduceBy {_.id} ( minBy _.component)

val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}

(IV) Thou shalt…

… define queries and UDFs in the same language!

UDF

Query definition

Page 27: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce27

case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)

val vertices = hdfsFile(…);val edges = hdfsFile(…);

val result = step iterate (vertices distinctBy {_.id}, vertices)

def step = (s: Data[Vertex], ws: Data[Vertex]) => {

val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}

val min = allNeighbors reduceBy {_.id} ( minBy _.component)

val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}

(V) Thou shalt…

… use an algebraic butrich data model!

Custom Object Oriented andFunctional Data Types

Use functions as referencesto fields/attributes

Page 28: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce28

case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)

val vertices = hdfsFile(…);val edges = hdfsFile(…);

val result = step iterate (vertices distinctBy {_.id}, vertices)

def step = (s: Data[Vertex], ws: Data[Vertex]) => {

val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}

val min = allNeighbors reduceBy {_.id} ( minBy _.component)

val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}

(VI) Thou shalt…

… optimize! Auto-parallelization and optimization à la relational databases.

Page 29: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce29

case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)

val vertices = hdfsFile(…);val edges = hdfsFile(…);

val result = step iterate (vertices distinctBy {_.id}, vertices)

def step = (s: Data[Vertex], ws: Data[Vertex]) => {

val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}

val min = allNeighbors reduceBy {_.id} ( minBy _.component)

val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}

(VII) Thou shalt…

… not treat UDFs as black boxes!

Static code analysis of UDFsto determine field accessesand modificationsVastly increases optimization

potential

Page 30: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce30

case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)

val vertices = hdfsFile(…);val edges = hdfsFile(…);

val result = step iterate (vertices distinctBy {_.id}, vertices)

def step = (s: Data[Vertex], ws: Data[Vertex]) => {

val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}

val min = allNeighbors reduceBy {_.id} ( minBy _.component)

val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}

(VIII) Thou shalt…

… iterate/recurse!

Step function

Needed for most interesting analysis cases

Page 31: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce31

case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)

val vertices = hdfsFile(…);val edges = hdfsFile(…);

val result = step iterate (vertices distinctBy {_.id}, vertices)

def step = (s: Data[Vertex], ws: Data[Vertex]) => {

val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}

val min = allNeighbors reduceBy {_.id} ( minBy _.component)

val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}

(IX) Thou shalt…

… exploit dynamic computation!

Naïve (Bulk)

Incremental

0200000400000600000800000

100000012000001400000

Superstep

# Ve

rtice

s (t

hous

ands

)

Pregel as a Stratosphere plan with comparable performance.

Page 32: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce32

(X) Thou shalt…

… use a scalable and efficient execution engine!

Pipeline and data parallelism, flexible checkpointing, optimized network data transfers

Page 33: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Write like a programming language

Fazit

33

Execute like a Database

Page 34: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Write like a programming language

Fazit

34

Execute like a DatabaseAdd a bit of "languages and compilers" sauce to the database stack…

Page 35: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Stratosphere Programming Stack

35

Nephele Dataflow Engine

Runtime Operators

SOPREMOCompiler

MeteorScript

Scala

Scala-Compiler Plugin

Stratosphere Optimizer

Nephele Parallel Dataflow

PACT Program

Layered approach – several entry points to the system

Page 36: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Stratosphere Programming Stack

36

Nephele Dataflow Engine

Runtime Operators

SOPREMOCompiler

MeteorScript

Scala

Scala-Compiler Plugin

Stratosphere Optimizer

Nephele Parallel Dataflow

PACT Program

Page 37: Big Data Ecosystem & The Stratosphere Project

Pact programScala program

Scala compiler plug-in

RuntimeHash- and sort-based out-of-core operator implementations, memory management

Stratosphere optimizerPicks data shipping and local strategies, operator order

Execution plan

Nephele Execution EngineTask scheduling, network data transfers, resource allocation, checkpointing

Job graph Execution graph

Page 38: Big Data Ecosystem & The Stratosphere Project

Pact programScala program

Scala compiler plug-in

RuntimeHash- and sort-based out-of-core operator implementations, memory management

Stratosphere optimizerPicks data shipping and local strategies, operator order

Execution plan

Nephele Execution EngineTask scheduling, network data transfers, resource allocation, checkpointing

Job graph Execution graph

1

2

3

Page 39: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

StratoSphereAbove the Clouds

PARALLEL PROGRAMMING MODEL

Part 1

39

Page 40: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Background: PACTs

40

D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Second-orderfunction

First-order function(UDF)Data Data

Map Reduce Cross Match CoGroup

Page 41: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

■ Data flow operators (UDFs)are first-order functions

■ Application of UDFs to thedata through second-orderfunctions that defineparallel semantics

■ Declarative, as executionstrategies are not fixed

Background: PACTs

41

Reduce (on A)sum(B), avg(C)

Match (A = D)if (A>3) emit

MapC := max(A,B)

Mapif (D>4) emit

Sink 1

Source 1Extract (A,B)

Source 2Extract (D,E)

D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Page 42: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Iterative Programs

42

S. Ewen, K. Tzoumas, M. Kaufmann, V. Markl:Spinning Fast Iterative Data Flows. PVLDB 5(11), 2012

Wi Si

(v2, cid) Match

(v1,v2), (vid,cid)

(vid, cid)CoGroup

[(vid,cid)],(vid, cid)

N

Wi+1 Di+1

U.

Edges

Bulk Iteration(Page Rank)

Incremental Iteration(Connected Components)

(pid, tid, p)

Join Pand A

(pid, r)

A

Reduce (on tid)(pid=tid, r=∑ k)

Match (on pid)(tid, k=r*p)

Sum uppartial ranks

p

Page 43: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

How does it look in code

43

val result = step iterate (vertices distinctBy {_.id}, messages)

def step = (s: Data[Vertex], ws: Data[Message]) => { val sNext = ws join s on {…} isEqualTo {…} using {…} val wNext = sNext join edges on … (sNext, wNext)}

Java

Scala

Page 44: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Incremental Iterations matter…

44

0 3 6 9 12 15 18 21 24 27 30 330

200000

400000

600000

800000

1000000

1200000

1400000

Superstep

# Ve

rtice

s (t

hous

ands

)

Naïve (Bulk)

Incremental

Twitter Webbase (20)0

1000

2000

3000

4000

5000

6000

Changes to the iteration's result for Connected Components in each superstep…

… and runtime.

Page 45: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Pregel as a Pact program

45

Page 46: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

StratoSphereAbove the Clouds

THE PROGRAM COMPILER AND OPTIMIZER

Part 2

46

Page 47: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Why an Optimizer for such Programs?

47

Do you want to hand-optimize that?

Page 48: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

■ Cost-based optimizer produces physical execution plan given PACT program□ Annotates data channels with distribution patters, e.g., broadcast, partition□ Chooses physical execution strategies (e.g., hash/sort)□ Reorders PACT functions Deeply embeds MapReduce style UDFs in the

optimization

■ Optimization of iterative programs□ Passing data between super-steps□ Loop-invariant data□ Efficient state maintenance in partitioned indexes

■ Challenge: Semantics of user-defined functions unknown

Pact Optimizer Overview

48

Page 49: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Current architecture

49

1) Analyze 3) Parallelize

2) Reorder

Page 50: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

1) Opening the Black Boxes …

50

Analyze user code to discover:

■ Read set Rf: Attributes of the input record(s) that might influence output

■ Write set Wf: Attributes of the output record(s) that might have different values from respective input attributes

■ Emit cardinality Ef: Bounds on records emitted per call (1, >1, …)

PACTf

(Rf,Wf,Ef)

Page 51: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

1 void match (Record left,2 Record right,3 Collector col) {4 Record out = copy (left);5 if (left.get(0) > 3) {6 double a = right.get(2);7 out.set(2,1.0/a);8 }9 out.set(1, 42);10 out.set(3,right.get(0));11 out.set(4,right.get(1));12 out.set(5,right.get(2));13 col.emit (out);14 }

… via Static Code Analysis

51

Feasible:1. No control flow between

operators 2. Record data model, fixed API

Correct: ■ Difficulty comes from different code

paths■ Correctness guaranteed through

conservatism■ Add to R,W when in doubt

Page 52: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Conditions for reordering UDFs

52

Enabled optimizations: Selection push-down (Bushy) join reordering Aggregation push-down

Equivalent to invariant grouping transformation [Chaudhuri & Shim 1994]

Reordering of non-relational Reduce functions

Theorem 1: Two Map operators can be reordered if their UDFs have only read-read conflictsTheorem 2: For a Map and a Reduce, we need in addition the Reduce key groups to be preserved

Page 53: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

■ Simple enumeration algorithm that checks pairwise reordering for all neighboring operators

■ Current problem: Walking all points in the search space

■ Next: Deduce join-graph-like information from reordering degrees-of-freedom

Optimizer Architecture (I)

53

Page 54: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

■ Operators are defined in terms of possible global data properties (partitioning/replication/...) and local data properties (order/grouping/uniqueness/...)

■ Nodes propagate requested properties top-down□ Filtered by UDF‘s field modification□ Filtered by incompatibility□ Every data flow edge has a set of possible requested properties

■ Requested properties are instantiated at each point□ Global properties by exchange strategies□ Local properties by local operators

■ Requested properties used for pruning candidate (as with intersting properties)

Optimizer Architecture (II)

54

Page 55: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

■ Determine static and dynamic data flow paths for iterations□ Static path contains data that is loop-invariant

■ Use heuristics to place caches such that loop-invariant computations are not repeated□ Cache loop-invariant data also in ordered form, or as hash tables

■ Weigh costs for static and dynamic path differently□ Optimizer favors plans that „push“ work into static path

Optimizer Architecture (III)

55

Page 56: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

PageRank: Two Optimizer Plans

56

Match (on pid)(tid, k=r*p)

Reduce (on tid)(pid=tid, r=∑ k)

O

I(pid, tid, p)

CACHE

Join P and A

Sum uppartial ranks

(pid, r)

Abroadcast

part./sort (tid)

probeHashTable (pid)buildHash-Table (pid)

p

O

I(pid, tid, p)

buildHashTable (pid)

Join P and A

(pid, r)

A

part./sort (tid)

partition (pid)

CACHEprobeHash-Table (pid)

Reduce (on tid)(pid=tid, r=∑ k)

Match (on pid)(tid, k=r*p)

Sum uppartial ranks

ppartition (pid)

fifo

fifo

Page 57: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

StratoSphereAbove the Clouds

THE FUNCTIONAL LANGUAGE COMPILATION

Part 3

57

Page 58: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

The Compiler Mismatch

58

Parser/Checker Optimizer Code

Generation Runtime

Parser/Checker

Code Generation Optimizer Runtime

The Database Approach

UDF Systems: MapReduce &Stratosphere (original)

Code Generation AFTERcontext of operation is fixed.

Code Generation BEFOREcontext of operation is fixed.

Query Compiler

Language Compiler

Page 59: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

The Program Compilation Pipeline

59

Program Code

Parser/Checker

ByteCode

Generator

Analyzer and Code

Generator

GlobalSchema

Generator

PactOptimizer

ProgramInstantiation

Schema and Code

Finalization

Parallel Data Flow

Generator

Parallel Data Flow

Language Compiler

Page 60: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

■ Supported Types□ Primitive (Integers, Floating-Point, Strings, …), Lists, Tuples, Product Types

(classes), Summation Types (class hierarchies) , Recursive Types

■ Data types are logically flattened□ Some fields are transparent members of the flat model, some are black box

members

■ Transparent members may be references in selector functions

■ Selector Functions are likewise analyzed and translated into logical positions

1) Analyzing Data Types

60

Page 61: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

■ User Code is pure Scala, no Stratosphere specific types, interfaces

■ Wrapper code necessary to run it as a UDF in Stratosphere

■ Serializer/Comparator Code is generated as a template (omitting exact field positions, storing logical positions)

■ Code is inserted by modifying the program's Abstract-Syntax-Tree

2) Generating Glue Code

61

Page 62: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

■ Schema generated from logical flattened model■ Each field in every operator’s result gets a unique name

□ Unless exact copy of an input field (info from code analysis)

■ Run Stratosphere optimizer□ Potentially reorders functions

■ Prune unused fields early□ Information whether fields are accessed by UDF from code analysis

■ Create physical data layout■ Finalize serializer / comparator code

3) Schema Generation

62

Page 63: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

Some preliminary results...

63

Page 64: Big Data Ecosystem & The Stratosphere Project

Stratosphere – Parallel Analytics Beyond MapReduce

■ MapReduce ■ Pig, JAQL, Hive■ AQL■ Scope■ Datalog for Machine Learning■ BOOM■ Twister / HaLoop■ Spark■ Naiad■ Flume Java / Plume Java■ Scalops■ Jet■ LINQ

Related Work

64