Machine Learning on Big Data - ClassIntro

7/29/2019 Machine Learning on Big Data - ClassIntro

1/29

M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

M achine Learning Big Dat a

using

M ap Reduce

By

M ichael Bow les, PhD


2/29


W her e Does Big Dat a Com e Fro m ?

-W eb data (w eb logs, cl ick histo r ies)

-e-com m erce appl icat ions (purchase histor ies)

-Retai l purchase histor ies (W almart )

-Bank and credit card t r ansact ions


3/29


W hat is Data M in ing?

-W hat page wil l v isi tor next visi t? Given:

Visitor 's bro w sing history

Visitor 's dem ographics

-Should card com pany appro ve transact ion that 's w ait ing? Given:

User's usage histor y

I tem b eing purchased

Locat ion o f merchant .

-What isn ' t da ta min ing

W hat pages d id v isi to rs v iew m ost o f ten?

W hat products are most popu lar?

Data mining tel ls us som ething th at isn' t in th e data or isn' t a simp le sum m ary.


4/29


Appr oaches fo r Data M ining Large Dat a

-Data M ine a Samp le

Take a manageable subset ( f i ts in m em ory, run s in reasonable t ime)

Develop m ode ls

-Limi ta t ions o f t h is method?

General ly, mo re data supp ort s f iner grained mo dels

e.g. makin g specif ic purchase recom me ndat ions

"custom ers who bo ught . " requ i res much m ore data than " top ten m ost popu lar a re "


5/29


Side N ot e o n Large Dat a:

-Rhine' s parad ox

ESP exper im ent in 50's

1 in 1000 can correct ly ident i fy color (re d or b lue) of 10 cards th ey can' t see

Do t hey h ave ESP?

-Bonferron i ' s p r incip le Given enou gh data any combinat ion o f o u tcom es can be fo und

Is th is a reason to avoid large data sets?

No

I t ' s a reason to not d r aw conclusions tha t the data don ' t suppor t (and th is is t rue no m at ter how

large the data set)


6/29


Use M ult iple Pro cessor Cor es

-What i f we w ant to use the fu l l data se t? How can we devote more comp uta t iona l power to our job?

-There are always perfor m ance l imits w ith a single processor core => Use mult ip le cores simu lataneously.

-Tradit ional appro ach add struct ure t o pr ogram m ing language (C++, Java)

-Issues w ith th is appro ach

High com mu nicat ion costs (i f da ta m ust be d is t r ibu ted over a net wo rk)

Dif f icult t o deal w ith CPU f ai lures at t h is level (Processor fa i lures are in evitable as scale increases)

-To d eal wit h t hese issues Google developed M ap-Reduce Paradigm


7/29


W hat is M ap-Reduce?

-Arrangement of com put e tasks enabl ing relat ively easy scal ing.

-Includes:

Hardw are arrangemen t racks of CPU's with direct access to local d isk, netw orked t o on e another

Fi le System Distr ibut ed stor age across mult ip le disks, redun dancy

-Soft w are pro cesses runn ing on various CPU in th e assem bly

Cont rol ler m anages m apper and red ucer tasks, fault det ect ion and recover y

M apper Ident ical tasks assigned t o m ult ip le CPU's for each t o ru n over i ts local data.

Reducer Aggregates out put f rom severa l mappers to fo rm end pro duct

-Programm er on ly needs to author m apper and reducer . The rest o f the s t ructure is p rov ided.

Dean, Jeff and Gh em awat , Sanjay. M apRe duce: Simp lified Data Processing on Large Clustersh t tp : / / labs.google.com/papers/mapreduce-

osdi04.pdf
http://labs.google.com/papers/mapreduce-http://labs.google.com/papers/mapreduce-


8/29


Sim ple Cod e Exam ple (using m rJob ) ( t his is runn ing code)

f rom mr job . j ob impor t M RJob

f rom m ath impor t sq r timport json

class m rM eanVar(M RJob):

DEFAULT_PROTOCOL = 'json'

def m apper(sel f , key, l ine) :

num = json. loads( l ine)

va r = [num,num* num]

yield 1,var

def re ducer(self, n, vars):

N = 0.0

sum = 0.0

sumsq = 0.0

for x in vars:

N += 1

sum += x[0]

sumsq += x[1]

mean = sum/ Nsd = sq r t (sumsq /N - mean* mean)

results = [mean,sd]

yield 1,results

i f __name__ == '__m ain__' :

mrMeanVar . run ( )


9/29


Exam p le Sum Values fr o m a Large Dat a Set

-Data set D divided int o d1, d2, dn (each a l ist o f th in gs that can be sum m ed say real num bers or vectors)

-M apper s run ning o n CPU 1 , CPU 2, CPU n

-Each mapp er for m s a sum o ver i ts p iece of th e data and em its the sum s1, s2, sn.

CPU 1 m apper

Sum Element s of

d1

CPU 2 - m apper

Sum Element s of

d2

CPU - map per

Sum Elem ents of

dn

s1 s2 sn

Reducer

Sum m apper

ou t u t s


10/29


M achine Learning w . M ap Reduce

-Stat ist ical Query M odel (see ref below ) A two -step p rocess

1. Com pute su f f ic ien t s ta t ist ics by sum ming some funct ions over the data

2. Perform calcu la t ion on sums to y ie ld data m in ing mode l

3. Conceptua l ly sim i la r to the "sum" exam ple above.

-Consider ordinary least squares regression

Given m out puts yi (also called labe ls, ob servatio ns, etc.)

And m correspond ing a t t r ibu t e vectors xi (also called regressors, pr edict or s, etc.)

Fi t a mode l o f t he fo rm y = T x by solving * = m in ( )

* = A -1 b wher e A = ( ) and b =

-See the natu ral d ivision int o m apper and red ucer? (Hint : Look f or a )

M ap-Reduce for M achine Learn ing on M ul t icore h t tp : / /w ww .cs.stan fo rd .edu /peop le /ang /papers/n ips06-mapreducemu l t i co re .pd f
http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdfhttp://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf


11/29


M achine Learn ing w i th M ap Reduce

W ith OLS the m appers comp ute par t ia l sums o f the fo rm ( ) an d The reducer aggregates the par t ia l sum s int o to tals and com pletes the calculat ion * = A -1 b

CP U 1 ma ppe r

Sum x ix ,T & x iy i

fo r x i d1

CP U 2 - m a ppe r


fo r x i d2

CPU - map per


fo r x i dn

Sum xx, ,

Sum xy

Reducer

Aggregate mapper output s

A = Sum xx , b = Sum xy

Calculate: * = A -1 b

Sum xx ,,

Sum xy

Sum xx ,,

Sum xy


12/29


M achine Learning w . M ap Reduce

-Referenced paper dem onstrat es that t he fo l low ing algori th m s can al l be arranged in th is Stat ist ical Query M odel

f o r m :

Locally W eight ed Linear Regression,

Nave Bayes,

Gaussian Discrim inat ive Analysis,

k-M eans,

Neura l Networ ks,

Principal Compo nent Analysis,

Independent Component Analysis,

Expectat ion M aximizat ion,

Suppor t Vector M ach ines

-In som e cases, i terat ion is required. Each i terat ive step in volves a m ap-reduce sequence

-Other mach ine learn ing a lgor i thm s can be ar ranged fo r map r educe but no t in S tat is t ica l Query M ode l fo rm

(e.g. canopy cluster ing or b inary decision t rees)


13/29


M ore M ap Reduce Deta il

-M appers emit a key-value pair

contro l ler sorts key-value pairs by key

Reducer gets pairs grou ped b y key

-M apper can be a t wo -step pro cess

-With OLS, fo r example , we m ight have had the m apper em i t each xix iT and x iy i, instead of em it t ing sum s of

these quant i t ies

Post processing the mapp er out put (e.g. form ing ) is a mapper funct ion ca l led "com biner"

Reduces net w ork t raf f ic


14/29


Som e Algor i th m s

-Canop y Cluster -

-K-means

-EM a lgo fo r Gaussian M ixture M ode l

-Glmnet

-SVM


15/29


Cano py Cluster ing

-Usual ly used as rough cluster in g to r educe com put at ion (e.g. search zip code fo r t he closest p izza versus search

the wo r ld )

-Find w ell d istr ibut ed in i t ia l condit ions for ot her cluster ing also (e.g. kmeans)

-A lgor i thm:

Given set o f po ints P and distance measure d(,)

Pick tw o d istance t hresholds T1 > T2 > 0

Step 1: Find cluster centers

1. In i t ia l ize set of centers C = nul l

2. I te ra te over po in ts p i P

If there isn ' t c C s.t. d(c,p i) < T2

Add p i t o C

get next p i

Step 2: Assign point t o cluster s

For p i P assign p i to {c C : d( p i , c) < T1}

-Not ice that p oint s general ly get assigned to m ore t han one cluster.

M cCal lum, Nigam, and Un gar "Eff icient Cluster ing of High-Dimensional Data Sets w i th Ap pl icat ion to Reference M atching",

h t tp : / /www.kama ln igam.com/papers /canopy-kdd00 .pd f
http://www.kamalnigam.com/papers/canopy-kdd00.pdfhttp://www.kamalnigam.com/papers/canopy-kdd00.pdf


16/29


Cano py Clust er in g Pict ur e

x

x

x

x

x

x

x

x

Cluster 1

Cluster 2

Cluster 3

Cluster 4


17/29


Cano py Clust er ing w M ap Redu ce

1st Pass f ind cent ers

M appers Run canopy cluster ing o n subset, pass center s to redu cer

Reducer Run canopy cluster ing on cente rs f rom m appers.

2nd Pass M ake cluster assignm ent s (if necessary)

M appers compare po in ts p i to centers to fo rm set c i = {c C | d (p i , c) < T1}

Em it = for each c c i

Reducer Since the redu cer input is sorted o n key value (here, th at 's cluster center), the re ducer inpu t w i l l

be a l ist of a l l the po ints assigned t o a given center.

-One smal l p rob lem is tha t t he centers p icked by the reducer m ay not cover a l l the po in ts o f the com bined

orig inal set . Pick T1 > 2* T2 or use larger T2 in reducer in or der to insure that a l l poin ts are covered.

The Apache M ahout pro ject has a lo t o f great a lgor i thm s and docum entat ion. https:/ /cwik i .apache.org/MAHOUT/canopy-cluster ing.html
https://cwiki.apache.org/MAHOUT/canopy-clustering.htmlhttps://cwiki.apache.org/MAHOUT/canopy-clustering.html


18/29


K-M eans Clust er ing

-K-means algori t hm seeks to part i t ion a data set into K disjo int sets, such that the sum of t he w ith in set var iances

is m inim ized.

-Using Eucl idean distance, w ith in set var iance is the sum o f squared distances fro m the set 's centroid

-Lloyd's algori t hm for K-means goes as fo l low s:

Init ialize:

Pick K start ing guesses for cent roids (at r andom )

Iterate:

Assign point s to cluster w hose centro id is closest

Calculate cluster cent roids

-Here's a sequence fro m W ikipedia page on k-m eans cluster ing ht tp : / /en .w ik iped ia .org /w ik i /K -means_cluster ing
http://en.wikipedia.org/wiki/K-means_clusteringhttp://en.wikipedia.org/wiki/K-means_clustering


19/29


K-m eans in M ap Redu ce

M apper given K in i t ia l guesses, run t hrou gh local data and for each po int , deter mi ne w hich centro id is closest,

accumula te vector sum of po in ts c losest to each cent ro id , com biner em i ts for each of the i

centroids.

Reducer for each old centr oid i , aggregate sum and n f rom a l l mappers and ca lcu late new cent ro id .

This m ap-reduce pair comp letes an i terat io n of Lloyd's algori th m .


20/29


Sup por t Vect or M achine - SVM

-For classif icat ion, SVM f inds separat ing hyperp lane

-H 3 doe s no t separate classes. H1 separa tes, bu t no t w i th max marg in . H2 separates w ith m ax m argin

Figure from http:/ /en.wik ipedia.org/wik i /Support_vector_machine
http://en.wikipedia.org/wiki/Support_vector_machinehttp://en.wikipedia.org/wiki/Support_vector_machine


21/29


SVM as M athem at ical Opt im izat ion Prob lem

-Given a tr ainin g set S = { (

,

) }

-Where x i Rn and y i { +1, -1}

-W e can write any l inear fu nct ion al of x as w Tx +b, w here w Rn and b is scalar

Find we ight vector w and constant b t o m in imize

,1

2 | || | + ( , ; ( , ) )(,)

W h e r e

l(w ,b;(x,y)) = m ax{0, 1 y (w Tx + b)}


22/29


Solu t ion by Bat ch Gradient De scent

-W e can solve th is by using batch gradient descent. Take a derivat ive w rt w & b.

Not ice tha t i f the 1 - y (w Tx + b) < 0, then l() = 0 and,( ) = 0 .

Denote by S+ = {(x,y) S | 1 - y (w Tx + b) > 0 } . Then the grad ient w r t w is

w +

(,)

and wr t b is:

(,)

For r eference see "M ap Reduce for M achine Learn ing on M ul t icore" m entioned ear l ier . Also see Shalev-Shwart z, "Pegasos: Pr imal Est im ated

sub-Gradient Solver for SVM "


23/29


Calculat ing a Grad ient St ep

-What ' s impo r tant about what we just deve loped? Severa l th ings

1. I n t he equa t ion we w ound up w i th t e rms l i ke (w + (,) ) , on ly the t e rm inside the sum isdata dependent .

2. The data dependent t e rms sum(-yx) is summ ed over the po in ts in the input da ta where t he const ra in ts

are act ive.

-Le t ' s sum mar ize by draw ing up th e m apper and red ucer funct ions.

In it ia l ize w and b , C and m (# of instances).

I terate

M apper Each m apper has access to a subset Sm of S. For point s (x,y) Sm check 1 - y (wTx + b) > 0. If yes,

accum ulate y and yx. Em it . W e' l l pu t in a dum my key value, bu t a l l the out put ge ts

summ arized in a single (vector) quan t i t y to be pro cessed by a single reducer

Reducer Accum ulate -y and -yx and update est imate o f w and b us ing

w ne w = w ol d - (w ol d +

-yx)

an d

b ne w = bol d - (-y )


24/29


GLM Net Algor i thm

-Regularized Regression

-To avoid over-f i t t ing have to exert cont rol over degrees of f reedom in regression

-cut back on at t r ibut es subset select ion

-penalize regression coefficients coefficient shrinkage, ridge regression, lasso regression

-W ith Coeff ic ient Shrinkage, d if feren t penalt ies give solut ions with dif f erent pr oper t ies

"Regularization Paths for General ized Linear M odels, via Coordinat e Descent" Friedm an, Hastie and Tibshirani

h t tp : / /www.s tan fo rd .edu /~has t i e /Papers /g lmne t .pd f
http://www.stanford.edu/~hastie/Papers/glmnet.pdfhttp://www.stanford.edu/~hastie/Papers/glmnet.pdf


25/29


Regular izing Regression w it h Coe ff icient Shr inkage

-Start w ith OLS pro blem f orm ulat ion and add a penalt y to OLS erro r

-Suppose

y R and vector o f pred ictors x Rn

-As w ith OLS, seek a f i t for y of t he for m

y 0 + xT

-Assume th at x i have been standardized to mean zero and unit var iance

-Find

(,)

+ ( )

-The part in t he sum is the or dinary least squares penalty. The bit at the end (( ) ) is new .

-Not ice tha t the minimizing set of 's is a funct ion o f the param eter . If = 0 we get OLS coefficients.


26/29


Pen alty Term

-The coeff ic ient penalt y term is g iven by

P() = ( 1 ) + | |

-This is cal led " elast icnet " p enalty (see ref below ).

- = 1 gives l1 penalty (sum o f absolute values)

- = 0 gives l2 squared penalty (sum of squ ares)

-Why is th is impo r tant? The cho ice o f pena l ty in f luences tha t na tu re o f the so lu t ions. l1 gives sparse solut ion s. It

ignores at t r ibutes. l2 tend s to average correlated at t r ibut es.

-Consider the coeff ic ient p aths as funct ions of the param eter .

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. Royal. Stat. Soc. B., 67(2):301{320, 2005.


27/29


Coef f ic ient Trajecto r ies

-Here's Figure 1 f ro m Friedm an et . a l . paper


28/29


Path w ise Solu t ion

-This algori thm w orks as fo l low s:

In it ia l ize w ith value of large enou gh that a l l 's are 0.Decrease slight ly and updat e ' s by tak ing a grad ient s tep f ro m the o ld based on new va lue o f .

Sm all chan ges in means that th e update converges quickly.

In many cases faster t o genera te ent i re coef f ic ien t t ra jectory w i th th is a lgo , than to generate po in t so lu t ion

-Fr iedman e t . a l . show that the e lemen ts o f t he coef f ic ien t vector j satisfies

j

=

( ) ,

( )

w here t he funct ion S() is g iven by

S(z,) = z if z > 0 and < |z|

= z + if z < 0 and < |z|

= 0 if >= |z|

-The po int is:This algori thm f i ts the Stat ist ical Query M odel.

The sum inside the fun ct ion S() can be spread over any num ber of m appers

This algori thm handles elast icnet re gular ized r egressions, logist ic r egression and m ult ic lass logist ic

regression


29/29


Summary

-Here 's what w e covered

1. W here do big-data machine learning pro blems arise?

-e-com m erce, ret ai l , narro w t arget ing, bank and credit cards

2. W hat w ays are there to dea l w i th b ig -data mach ine learn ing prob lems?

-samp le, language level paral le l izat ion , m ap-reduce

3. W hat is map-reduce?-programm ing fo rmal ism t hat iso la tes programm ing task to m apper and reducer funct ions

3. Appl ica t ion o f map-reduce to some fam i l iar m ach ine learn ing algor i thm s

Hope you learned som eth ing from t he class. To get into m ore detai l , come to t he big-data class.

Documents

Machine Learning on Big Data - ClassIntro