58
Jet: An embedded DSL for Distributed Data Parallel Compu9ng Master Thesis Project EPFL 2012 Stefan Ackermann (ETHZ) Supervisors: MSc. Vojin Jovanovic (EPFL) Prof. Mar9n Odersky (EPFL)

Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Jet:  An  embedded  DSL  for  Distributed  Data  Parallel  Compu9ng  

Master  Thesis  Project  EPFL  2012  

Stefan  Ackermann  (ETHZ)    

Supervisors:  MSc.  Vojin  Jovanovic  (EPFL)  Prof.  Mar9n  Odersky  (EPFL)  

Page 2: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Intro:  Big  Data  

•  Huge  data  sets:  Order  of  terabytes  •  Data  does  not  fit  on  one  machine  •  Variety  of  input  formats  

         =>  Databases  not  suitable    •  Processing  on  commodity  hardware  •  Fault  tolerant  computa9ons  

2  

Page 3: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Intro:  Big  Data  in  industry  

•  Pat  Gelsinger  (CEO  of  EMC):  Big  Data  is  a  business  of  70  Billion  $  up,  with  an  annual  growth  of  15%  

•  Big  internet  companies  are  all  invested:  Google  (MapReduce,  FlumeJava,  …),  Facebook  (Hive),  Yahoo!  (Pig)  and  Microso]  /  Bing  (Dryad,  DryadLINQ,  Scope)  

3  

Page 4: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Intro:  MapReduce  

4  

Map   Reduce  

Page 5: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Intro:  Hadoop  

•  Opensource  MapReduce  implementa9on  •  Scalable  •  Fault  tolerant  •  But:  

– Low  level.  Just  one  map  and  one  reduce  phase  per  Job.  No  joins.  No  sor9ng.  Needs  serializa9on  

– Wordcount:  58  lines  

5  

Page 6: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Intro:  Pig  

•  DSL  for  Hadoop  •  Has  SQL  like  syntax,  with  assignments  

–  Joins,  sor9ng,  ...  •  Performs  rela9onal  op9miza9ons  •  Wordcount:  5  lines  

6  

Page 7: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Intro:  Pig  downsides  

•  I  wanted  a  Wordcount  using  a  different  paiern  to  split  on  – 2  days  of  effort  – needs  external  func9on  (~  100  lines  of  code)  

•  Pig  La9n  does  not  have  func9ons  or  classes  •  Pig  La9n  does  not  have  loops  •  User  defined  func9ons  must  be  in  other  language  and  break  op9miza9ons  

7  

Page 8: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Cascading  

Scalding  

Cascalog  

Intro:  Frameworks  •  High  level  •  Automa9c  Serializa9on  

•  Projec9on  Inser9on  

•  Itera9ve  jobs  •  Language  Embedding  

•  Extensibility  •  Code  portability  

Jet   Pig  

Crunch  

Scoobi  

Scrunch  

Spark  

Hadoop   Hive  

Pangool  8  

Page 9: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Jet  

9  

DList("hdfs://..." + input) .flatMap(_.split("\\s")) .map(x => (x, 1)) .groupByKey() .reduce(_ + _) .save("hdfs://..." + output)

Wordcount  in  Jet  

User  Defined    Func9on  in  Jet  

def parse(x: Rep[String]): Rep[String] = { x.trim().split("\\s+").apply(2) }

Page 10: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Jet  

•  Applies  compile  9me  op9miza9ons  •  Extensible  /  Modular    •  General:  Loops,  condi9onals  •  Portable:  Compiles  to  Scala  code  for  Crunch  (Hadoop)  and  Spark  – Some  opera9ons  specific  to  one  backend  

 

10  

Page 11: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Jet  Modularity  

•  Code  genera9on  is  completely  separated  from  the  op9miza9ons  

•  Code  genera9on  is  small:  400  Lines  of  code  per  backend  

•  Crunch  backend:  One  week  of  effort  

11  

Page 12: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Outline  

•  Background  •  Op9miza9ons  •  Evalua9on  •  Conclusion  

12  

Page 13: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Background:  Frameworks  

•  All  offer  a  collec9on  like  interface  •  Hadoop  

– Crunch:  Java  based  – Scoobi:  Scala  based  

•  Spark  – Spark:  Scala  based  –  Inspired  by  Hadoop  – Keeps  objects  in  memory  by  default  

13  

Page 14: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Background:  LMS  

•  Framework  for  wri9ng  DSL’s  •  Basis  for  Jet  •  Deeply  embedded  in  Scala  •  Modular  /  Extensible  •  Effects  tracking  •  Code  genera9on  for  mul9ple  languages                  (C,  CUDA,  Scala)  

14  

Page 15: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Background:  LMS  Op9miza9ons  

•  Inline  – Removes  method  calls  

•  Loop  Fusion  (ver9cal  &  horizontal)  •  Code  Mo9on  •  Dead  Code  Elimina9on  •  Structs  

15  

Page 16: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Background:  Structs  in  LMS  

•  Assume:  No  subtyping  •  With  inlining  

Idea:  Work  with  Fields  directly  

16  

Object  

Fields:    Map[String,  (Type,  Value)]  

Methods:  Map[Signature,  

Method]  

Page 17: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Background:  Field  Read  Shortcut  

17  

becomes  

val re = 1

val complex = new Complex(re = 1, im = -1) val re = complex.re

No  object  required  =>  No  object  created  

Page 18: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Background:  Decomposi9on  

complex.map{ c => if (c.im > 0) c else Complex(-1*c.re, -1*c.im) }

def map1(in: Complex) = { val cond = in.im > 0.0 val reOut = if (cond) { in.re } else { -1.0 * in.re } val imOut = if (cond) { in.im } else { -1.0 * in.im } val out = new Complex(reOut, imOut) out: Complex }

if  copied  for  each  field  

Constructor  Invoca9on  last  

Page 19: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Outline  

•  Background  •  Op9miza9ons  

– Code  Mo9on  – Loop  Fusion  – Projec9on  Inser9on  

•  Evalua9on  •  Conclusion  

19  

Page 20: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Op9miza9ons  in  MapReduce  

20  

Reduce  CPU  Time  Reduce  CPU  Time  

Page 21: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Op9miza9ons:  Code  Mo9on  

becomes  

val pattern = Pattern.compile("wiki") in.filter(s: String => pattern.matcher(s).matches() )

in.filter(s: String => s.matches("wiki"))

21  

Page 22: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Op9miza9ons:  Regular  Expressions  

RegexFrontend  

Java  Regex   Automaton   Fast  Spliier  

Regex  Paiern  

matches  /  split  /  replaceAll  

create  

22  

Page 23: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Op9miza9ons:  Loop  Fusion  

map  (parse)  

filter  

map  (tuple)  

flatMap  (parse,  filter,  

tuple)  

String  

LogEvent  

LogEvent  

(Long,  LogEvent)  

23  

Page 24: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Op9miza9ons  in  MapReduce  

24  

Less  Data  

Page 25: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Goals  

•  Remove  unneeded  fields  •  Reduce  network  traffic  &  disk  writes    •  Reduce  CPU  9me,  only  parse  necessary  fields  •  Spark:  Reduce  memory  usage  

25  

Page 26: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Classes  

26  

Page 27: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on  

27  

def map1(in: Complex) = { val cond = in.im > 0.0 val reOut = if (cond) { in.re } else { -1.0 * in.re } val imOut = if (cond) { in.im } else { -1.0 * in.im } val out = new Complex_0_1(reOut, imOut) out: Complex }

def project(in: Complex) = { Complex_0(in.re) }

We  know:  Only  field  «re»  is  needed  a]erwards.  

Page 28: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Step  1  

28  

def map1(in: Complex) = { val cond = in.im > 0.0 val reOut = if (cond) { in.re } else { -1.0 * in.re } val imOut = if (cond) { in.im } else { -1.0 * in.im } val out = new Complex_0_1(reOut, imOut) out: Complex } def project(in: Complex) = { Complex_0(in.re) }

def map1(in: Complex) = { val cond = in.im > 0.0 val reOut = if (cond) { in.re } else { -1.0 * in.re } val imOut = if (cond) { in.im } else { -1.0 * in.im } val out = new Complex_0_1(reOut, imOut) Complex_0(out.re) }

Loop  Fusion  

Page 29: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Step  2  

29  

def map1(in: Complex) = { val cond = in.im > 0.0 val reOut = if (cond) { in.re } else { -1.0 * in.re } val imOut = if (cond) { in.im } else { -1.0 * in.im } val out = new Complex_0_1(reOut, imOut) Complex_0(out.re) }

def map1(in: Complex) = { val cond = in.im > 0.0 val reOut = if (cond) { in.re } else { -1.0 * in.re } val imOut = if (cond) { in.im } else { -1.0 * in.im } val out = new Complex_0_1(reOut, imOut) Complex_0(reOut) }

Field  Read  Shortcut  

Page 30: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Step  3  

30  

def map1(in: Complex) = { val cond = in.im > 0.0 val reOut = if (cond) { in.re } else { -1.0 * in.re } val imOut = if (cond) { in.im } else { -1.0 * in.im } val out = new Complex_0_1(reOut, imOut) Complex_0(reOut) }

def map1(in: Complex) = { val cond = in.im > 0.0 val reOut = if (cond) { in.re } else { -1.0 * in.re } val imOut = if (cond) { in.im } else { -1.0 * in.im } Complex_0(reOut) }

Dead  Code  Elimina9on  

Page 31: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Step  4  

31  

def map1(in: Complex) = { val cond = in.im > 0.0 val reOut = if (cond) { in.re } else { -1.0 * in.re } val imOut = if (cond) { in.im } else { -1.0 * in.im } Complex_0(reOut) }

def map1(in: Complex) = { val cond = in.im > 0.0 val reOut = if (cond) { in.re } else { -1.0 * in.re } Complex_0(reOut) }

Dead  Code  Elimina9on  

Page 32: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Analysis  

32  

def project(in: Complex) = { Complex_0(in.re) }

How  do  we  know  which  fields  to  keep?  

Page 33: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Analysis  

33  

     def  map1(in:  Complex)  =  {            val  cond  =  in.im  >  0.0            val  reOut  =  if  (cond)  {                        in.re            }  else  {                  -­‐1.0  *  in.re            }            Complex_0(reOut)        }  

{complex.re,  complex.im}  

Page 34: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Propaga9on  

join  

map  (parse)  

map  

(Long,  (LogEvent,  User))  

String  

(Long,  LogEvent)   (Long,  User)  

String   String  

34  

{string}  

map  (parse)  

Page 35: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Propaga9on  

join  

map  (parse)  

map  

(Long,  (LogEvent,  User))  

String  

(Long,  LogEvent)   (Long,  User)  

String   String  

35  

{string}  

map  (parse)  

Page 36: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Propaga9on  

join  

map  (parse)  

map  

(Long,  (LogEvent,  User))  

String  

(Long,  LogEvent)   (Long,  User)  

String   String  

36  

{string}  

{logevent.date,  user.name}  

map  (parse)  

Page 37: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Propaga9on  

join  

map  (parse)  

map  

(Long,  (LogEvent,  User))  

String  

(Long,  LogEvent)   (Long,  User)  

String   String  

37  

{string}  

{logevent.date,  user.name}  

map  (parse)  

Page 38: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Propaga9on  

join  

map  (parse)  

map  

(Long,  (LogEvent,  User))  

String  

(Long,  LogEvent)   (Long,  User)  

String   String  

38  

{string}  

{logevent.date,  user.name}  

{long,  logevent.date}   {long,  user.name}  

map  (parse)  

Page 39: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on:  Propaga9on  

join  

map  (parse)  

map  

(Long,  (LogEvent,  User))  

String  

(Long,  LogEvent)   (Long,  User)  

String   String  

39  

{string}  

{logevent.date,  user.name}  

{long,  logevent.date}   {long,  user.name}  

map  (parse)  

Page 40: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

map  (parse,  project)  

map  (parse,  project)  

Projec9on  Inser9on:  Fusion  

join  

map  (project)  

map  

(Long,  (LogEvent,  User))  

String  

(Long,  LogEvent)   (Long,  User)  

String   String  

40  

{string}  

map  (parse)  

(Long,  LogEvent)  

{logevent.date,  user.name}  

{long,  logevent.date}   {long,  user.name}  

map  (project)  

map  (parse)  

(Long,  User)  

Page 41: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Op9miza9ons:  Mapper  of  TPCH  Q12  

41  

map  (tuple)  

filter  

filter  

map  (parse)  

(3  more  filters)  

(Long,  LineItem)  

String  

LineItem  

LineItem  

LineItem  

Page 42: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Op9miza9ons:  Mapper  of  TPCH  Q12  

42  

With  Projec9on   With  Loop  Fusion  Unop9mized  

Fields:  16  /  16                5  /  1              1  –  5  /  1  

Parsing  Filtering  

Crea9ng  Tuple  

Page 43: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Outline  

•  Background  •  Op9miza9ons  •  Evalua9on  

– WordCount  – TPCH  Q12  – KMeans  –  Jet  vs  Pig  

•  Conclusion  

43  

Page 44: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Results:  Setup  

•  Amazon  EC2  Cloud  •  21  EC2  m1.large  nodes  (1  master,  20  slaves)  

– 7.5  Gb  Ram  – 2  Cores  – 2  Hard  disks  – Gbit  connec9ons  

44  

Page 45: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Results:  Wordcount  

•  Program  has  only  one  map  and  one  reduce  phase  

•  Uses  5  regular  expressions  •  Input:  62  Gb  Wikipedia  ar9cles  

45  

Page 46: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Results:  Wordcount  

913  s   512  s   220  s  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

Scoobi   Crunch   Spark  

Original  

1)  Fusion  +  Projec9on    

2)  1  +  Code  Mo9on      

3)  2  +  Opt.  Split  

4)  3  +  Opt.  Automaton  

46  

Page 47: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Results:  TPCH  Q12  

•  TPCH  Q12  reads  from  two  collec9ons,  performs  a  join,  and  then  reduces  the  output  to  two  values  (2  mapreduce  jobs)  

•  Projec9on  Inser9on  can  remove  most  of  the  fields  

•  Input:  dbgen  with  scaling  factor  100  (~  100Gb)  

47  

Page 48: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Results:  TPCH  Q12  

48  

663  s   508  s   481  s  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

Scoobi   Crunch   Spark  

Original  

Opt.  Split  

Projec9on  

Fusion  

Combined  

Page 49: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Results:  KMeans  

•  KMeans  is  an  itera9ve  clustering  algorithm  •  Only  tested  in  Spark,  as  it  is  30x  faster  than  Hadoop  for  this  job  

•  Input  data:  20  Gb,  50  Centers,  10  –  1000  dimensions  

49  

Page 50: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Results:  KMeans  

•  Implementa9on  taken  from  Spark  repository  •  Ported  to  Jet  •  Extended  Jet  with  an  abstrac9on  for  mul9-­‐dimensional  points,  which  generates  arrays  and  while  loops  (no  iterators)  

50  

Page 51: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Results:  KMeans  

0  

50  

100  

150  

200  

250  

Spark   Jet  

Second

s  

10  

100  

1000  

51  

Page 52: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Jet  vs  Pig  

•  Pig’s  goals  are  similar  to  ours  •  Op9miza9ons  are  similar  

– Projec9on  Inser9on  – Lazy  parsing  

•  Pig  only  uses  Hadoop  

52  

Page 53: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Jet  vs  Pig  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

Word  Count   TPCH  Q12  

Pig  

Scoobi  

Crunch  

Spark  

53  

Page 54: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Outline  

•  Background  •  Op9miza9ons  •  Evalua9on  •  Conclusion  

54  

Page 55: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Future  Work  

•  Add  other  op9miza9ons  – Rela9onal  op9miza9ons  (Reorder  joins  etc)  – Move  filters  before  joins  

•  Integrate  with  other  LMS  DSL’s  – Use  GPU’s  – Regular  Expressions  

55  

Page 56: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Projec9on  Inser9on  

56  

def  project(in:  Complex)  =  {                Complex_0(in.re)  }  

Page 57: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

57  

Jet  

Fast   Concise  

Extensible  Portable  

Page 58: Jet:%An%embedded%DSL%for% … › labs › lamp › wp-content › uploads › ...Intro:%Big%Data • Huge%datasets:%Order%of%terabytes% • Datadoes%notfiton%one%machine% • Variety%of%inputformats%!!!!!=>Databases!not!suitable!

Backup  

•  Parsing  – How  to  define  class,  parsing  method,  etc  

•  Generated  Writables  – Bitset  usage,  switch  

•  Why  not  AoS  to  SoA  

58