12
val ScAlH2O = Scala ++ H2O San Francisco Data Science

Michal Malohlava presents: Open Source H2O and Scala

Embed Size (px)

DESCRIPTION

Michal Malohlava discusses the magic behind the math - exposing the way that open source big data analysis H2O uses Scala to get work done, and demos how users can interact with Scala to get the most out of data analysis.

Citation preview

Page 1: Michal Malohlava presents: Open Source H2O and Scala

val ScAlH2O =Scala ++ H2O San Francisco Data Science

Page 2: Michal Malohlava presents: Open Source H2O and Scala

Why Scala & H2O ?

● H 2O ~ fa s t , d is tr ib u te d , la rg e s c a le c om p u ta t io n p la t fo rm p ro v id in g r ic h J a v a A P I– B u t low - le v e l a n d fo r m a n y u s e r s to o c om p l ic a te d

public class ShuffleTask extends MRTask2<ShuffleTask> {

@Override public void map(Chunk ic, Chunk oc) { if (ic._len==0) return; // Each vector is shuffled in the same way Random rng = Utils.getRNG(0xe031e74f321f7e29L + (ic.cidx() << 32L)); oc.set0(0,ic.at0(0)); for (int row=1; row<ic._len; row++) { int j = rng.nextInt(row+1); // inclusive upper bound <0,row> if (j!=row) oc.set0(row, oc.at0(j)); oc.set0(j, ic.at0(row)); } }}

Page 3: Michal Malohlava presents: Open Source H2O and Scala

What we provides

● ScAlH2O - Scala library providing a DSL – Abstracting of H2O low-level API– Easy data manipulation and distributed computation– BUT still inside JVM

● Scala REPL integration into H2O– Console for experimenting with ScAlH2O

Page 4: Michal Malohlava presents: Open Source H2O and Scala

Basic concepts

● First-class entities– Scalars

– Frames

● Scala expressions

● Access to H2O aglos– And still preserving access to low-level H2O API)

Page 5: Michal Malohlava presents: Open Source H2O and Scala

Frame operations

● Parse data● Basic slicing

– Column/Rows selectors, append

● Scalar operations● Support head/tail/ncols/nrows/...● Cooperation with H2O distributed KV store

– Load/save operations

val f = parse("smalldata/cars.csv")

val f1 = f("name") ++ f(*, 5 to 7)

val f2 = f1("year") + 1900

val g = load("cars.hex")val g1 = g ++ g("year") > 80save("cars.hex", g1)

Page 6: Michal Malohlava presents: Open Source H2O and Scala

Map/filter/collect operations

● M a p– P e r v a lu e /r o w

● F ilte r

● C o lle c t

// Collect all cars with more than 4 cylindersval ff = f filter ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });

// Returns a boolean vectorval fm = f map ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });

// Compute sum of 2. column val fc = f collect ( 0.0, new CDOp() { def apply(acc:scala.Double rhs:Array[scala.Double]) =

acc + rhs(2) def reduce(l:scala.Double,r:scala.Double) = l+r } )

Page 7: Michal Malohlava presents: Open Source H2O and Scala

InternalsIt's magic

Page 8: Michal Malohlava presents: Open Source H2O and Scala

Internals

● No magic, BUT there are key-tricks– connect H2O classloaders with Scala ecosystem

● M a k e s u r e t h a t a l l d i s t r i b . o b j e c t s a r e c o r r e c t l y i c e d

– make translation of Scala code into calls of Java API● C r e a t e H 2 O M R t a s k s● P a s s o p e r a t i o n s a r o u n d t h e c l o u d ● C r e a t e n e w f r a m e s

– preserve primitives types ● d o n o t i n t r o d u c e o v e r h e a d o f b o x i n g / u n b o x i n g

Page 9: Michal Malohlava presents: Open Source H2O and Scala

Internals – translation to H2O MR tasks

def filter(af: T_A2B_Transf[scala.Double]):T = { val f = frame() val mrt = new MRTask2() { override def map(in:Array[Chunk], out:Array[NewChunk]) = { val rlen = in(0)._len val tmprow = new Array[scala.Double](in.length) for (row:Int <- 0 until rlen ) { if (af(Utils.readRow(in,row,tmprow))) { for (i:Int <- 0 until in.length) out(i).addNum(tmprow(i)) } } } } mrt.doAll(f.numCols(), f) val result = mrt.outputFrame(f.names(), f.domains()) apply(result) // return the DFrame }

// Collect all cars with more than 4 cylindersval f5 = f filter ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });

T_A2B_Transf has to be water.Freezable

Page 10: Michal Malohlava presents: Open Source H2O and Scala

Party demo time!

Page 11: Michal Malohlava presents: Open Source H2O and Scala

Towards Scalding-like API

● V is io n is to p ro v id e S c a ld in g - lik e s y n ta x

● B u t s o fa r D S L is s t il l u g ly

f map ( ('name, 'cylinders) -> ('name, 'moreThan4) ) { (n:String, c:Int) => (n, if (c>4) 1 else 0) }

Input scheme Output scheme

Transformation

f map (f, ('name, 'cylinders) -> ('name, 'moreThan4) ) { new IcedFunctor2to2[Double,Int,Double,Int] {

def apply(n:Double, c:Int) = (n, if (c>4) 1 else 0) } }

Page 12: Michal Malohlava presents: Open Source H2O and Scala

Try and contribute !

> git clone [email protected]:0xdata/h2o.git

> git checkout -b h2oscala origin/h2oscala

> cd h2o-scala && ./depl.sh # or sbt compile

=== Welcome to the world of ScAlH2O === Type `help` or `example` to begin...

h2o>

Thank you!