20
8/9/2013 © MapR Confidential 1 R Hadoop and MapR

R user group 2011 09

Embed Size (px)

DESCRIPTION

Talk given on September 2011 to the Bay Area R User Group. The talk walks a stochastic project SVD algorithm through the steps from initial implementation in R to a proposed implementation using map-reduce that integrates cleanly with R via NFS export of the distributed file system. Not surprisingly, this algorithm is essentially the same as the one used by Mahout.

Citation preview

Page 1: R user group 2011 09

8/9/2013 © MapR Confidential 1

RHadoop

and MapR

Page 2: R user group 2011 09

8/9/2013 © MapR Confidential 2

The bad old days (i.e. now)

• Hadoop is a silo

• HDFS isn’t a normal file system

• Hadoop doesn’t really like C++

• R is limited

• One machine, one memory space

• Isn’t there any way we can just get along?

Page 3: R user group 2011 09

8/9/2013 © MapR Confidential 3

The white knight

• MapR changes things

• Lots of new stuff like snapshots, NFS

• All you need to know, you already know

• NFS provides cluster wide file access

• Everything works the way you expect

• Performance high enough to use as a message bus

Page 4: R user group 2011 09

8/9/2013 © MapR Confidential 4

Example, out-of-core SVD

• SVD provides compressed matrix form

• Based on sum of rank-1 matrices

A =s1u1¢v1 +s 2u2

¢v2 +e

± ±≈ + + ?

Page 5: R user group 2011 09

8/9/2013 © MapR Confidential 5

More on SVD

• SVD provides a very nice basis

Ax = A aiviå = s ju j ¢v jj

åé

ëêê

ù

ûúú

aivii

åé

ëê

ù

ûú= ais iui

i

å

Page 6: R user group 2011 09

8/9/2013 © MapR Confidential 6

• And a nifty approximation property

Ax =s1a1u1 +s 2a2u2 + s iaiuii>2

å

e2

£ s i

2

i>2

å

Page 7: R user group 2011 09

8/9/2013 © MapR Confidential 7

Also known as …

• Latent Semantic Indexing

• PCA

• Eigenvectors

Page 8: R user group 2011 09

8/9/2013 © MapR Confidential 8

An application, approximate translation

• Translation distributes over concatenation

• But counting turns concatenation into addition

• This means that translation is linear!

T(s1 | s2 ) =T(s1) | T(s2 )

k(s1 | s2 ) = k(s1) + k(s2 )

k(T(s1 | s2 )) = k(T(s1)) + k(T(s2 ))

Page 9: R user group 2011 09

8/9/2013 © MapR Confidential 9

ish

Page 10: R user group 2011 09

8/9/2013 © MapR Confidential 10

Traditional computation

• Products of A are dominated by large singular values and corresponding vectors

• Subtracting these dominate singular values allows the next ones to appear

• Lanczos method, generally Krylov sub-space

A ¢AA( )n

=US2n+1 ¢V

Page 11: R user group 2011 09

8/9/2013 © MapR Confidential 11

But …

Page 12: R user group 2011 09

8/9/2013 © MapR Confidential 12

The gotcha

• Iteration in Hadoop is death

• Huge process invocation costs

• Lose all memory residency of data

• Total lost cause

Page 13: R user group 2011 09

8/9/2013 © MapR Confidential 13

Randomness to the rescue

• To save the day, run all iterations at the same time

Y = AW

QR =Y

B = ¢QA

US ¢V = B

QU( ) S ¢V » A

==

A

Page 14: R user group 2011 09

8/9/2013 © MapR Confidential 14

In R

lsa = function(a, k, p) {n = dim(a)[1]m = dim(a)[2]y = a %*% matrix(rnorm(m*(k+p)), nrow=m)y.qr = qr(y)b = t(qr.Q(y.qr)) %*% ab.qr = qr(t(b))svd = svd(t(qr.R(b.qr)))list(u=qr.Q(y.qr) %*% svd$u[,1:k],

d=svd$d[1:k], v=qr.Q(b.qr) %*% svd$v[,1:k])

}

Page 15: R user group 2011 09

8/9/2013 © MapR Confidential 15

Not good enough yet

• Limited to memory size

• After memory limits, feature extraction dominates

Page 16: R user group 2011 09

8/9/2013 © MapR Confidential 16

Hybrid architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSVD

Map-reduce

Via NFS

Page 17: R user group 2011 09

8/9/2013 © MapR Confidential 17

Hybrid architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

Map-reduce

Via NFS

RVisualization

SequentialSVD

Page 18: R user group 2011 09

8/9/2013 © MapR Confidential 18

Randomness to the rescue

• To save the day again, use blocks

Yi = AiW

¢R R = ¢Y Y = ¢YiYiå

B j = AiWR-1( )Aij

i

å

LL ' = B ¢B

US ¢V = L

AWR-1U( )S L-1B ¢V( ) » A

==

=

Page 19: R user group 2011 09

8/9/2013 © MapR Confidential 19

Hybrid architecture

Map-reduce

Feature extractionand

down sampling Via NFS

RVisualization

Map-reduce

Block-wiseparallel

SVD

Page 20: R user group 2011 09

8/9/2013 © MapR Confidential 20

Conclusions

• Inter-operability allows massively scalability

• Prototyping in R not wasted

• Map-reduce iteration not needed for SVD

• Feasible scale ~10^9 non-zeros or more