Michal Malohlava presents: Open Source H2O and Scala

Preview:

DESCRIPTION

Michal Malohlava discusses the magic behind the math - exposing the way that open source big data analysis H2O uses Scala to get work done, and demos how users can interact with Scala to get the most out of data analysis.

Citation preview

val ScAlH2O =Scala ++ H2O San Francisco Data Science

Why Scala & H2O ?

● H 2O ~ fa s t , d is tr ib u te d , la rg e s c a le c om p u ta t io n p la t fo rm p ro v id in g r ic h J a v a A P I– B u t low - le v e l a n d fo r m a n y u s e r s to o c om p l ic a te d

public class ShuffleTask extends MRTask2<ShuffleTask> {

@Override public void map(Chunk ic, Chunk oc) { if (ic._len==0) return; // Each vector is shuffled in the same way Random rng = Utils.getRNG(0xe031e74f321f7e29L + (ic.cidx() << 32L)); oc.set0(0,ic.at0(0)); for (int row=1; row<ic._len; row++) { int j = rng.nextInt(row+1); // inclusive upper bound <0,row> if (j!=row) oc.set0(row, oc.at0(j)); oc.set0(j, ic.at0(row)); } }}

What we provides

● ScAlH2O - Scala library providing a DSL – Abstracting of H2O low-level API– Easy data manipulation and distributed computation– BUT still inside JVM

● Scala REPL integration into H2O– Console for experimenting with ScAlH2O

Basic concepts

● First-class entities– Scalars

– Frames

● Scala expressions

● Access to H2O aglos– And still preserving access to low-level H2O API)

Frame operations

● Parse data● Basic slicing

– Column/Rows selectors, append

● Scalar operations● Support head/tail/ncols/nrows/...● Cooperation with H2O distributed KV store

– Load/save operations

val f = parse("smalldata/cars.csv")

val f1 = f("name") ++ f(*, 5 to 7)

val f2 = f1("year") + 1900

val g = load("cars.hex")val g1 = g ++ g("year") > 80save("cars.hex", g1)

Map/filter/collect operations

● M a p– P e r v a lu e /r o w

● F ilte r

● C o lle c t

// Collect all cars with more than 4 cylindersval ff = f filter ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });

// Returns a boolean vectorval fm = f map ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });

// Compute sum of 2. column val fc = f collect ( 0.0, new CDOp() { def apply(acc:scala.Double rhs:Array[scala.Double]) =

acc + rhs(2) def reduce(l:scala.Double,r:scala.Double) = l+r } )

InternalsIt's magic

Internals

● No magic, BUT there are key-tricks– connect H2O classloaders with Scala ecosystem

● M a k e s u r e t h a t a l l d i s t r i b . o b j e c t s a r e c o r r e c t l y i c e d

– make translation of Scala code into calls of Java API● C r e a t e H 2 O M R t a s k s● P a s s o p e r a t i o n s a r o u n d t h e c l o u d ● C r e a t e n e w f r a m e s

– preserve primitives types ● d o n o t i n t r o d u c e o v e r h e a d o f b o x i n g / u n b o x i n g

Internals – translation to H2O MR tasks

def filter(af: T_A2B_Transf[scala.Double]):T = { val f = frame() val mrt = new MRTask2() { override def map(in:Array[Chunk], out:Array[NewChunk]) = { val rlen = in(0)._len val tmprow = new Array[scala.Double](in.length) for (row:Int <- 0 until rlen ) { if (af(Utils.readRow(in,row,tmprow))) { for (i:Int <- 0 until in.length) out(i).addNum(tmprow(i)) } } } } mrt.doAll(f.numCols(), f) val result = mrt.outputFrame(f.names(), f.domains()) apply(result) // return the DFrame }

// Collect all cars with more than 4 cylindersval f5 = f filter ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });

T_A2B_Transf has to be water.Freezable

Party demo time!

Towards Scalding-like API

● V is io n is to p ro v id e S c a ld in g - lik e s y n ta x

● B u t s o fa r D S L is s t il l u g ly

f map ( ('name, 'cylinders) -> ('name, 'moreThan4) ) { (n:String, c:Int) => (n, if (c>4) 1 else 0) }

Input scheme Output scheme

Transformation

f map (f, ('name, 'cylinders) -> ('name, 'moreThan4) ) { new IcedFunctor2to2[Double,Int,Double,Int] {

def apply(n:Double, c:Int) = (n, if (c>4) 1 else 0) } }

Try and contribute !

> git clone git@github.com:0xdata/h2o.git

> git checkout -b h2oscala origin/h2oscala

> cd h2o-scala && ./depl.sh # or sbt compile

=== Welcome to the world of ScAlH2O === Type `help` or `example` to begin...

h2o>

Thank you!

Recommended