Upload
mighty-itauma
View
227
Download
0
Embed Size (px)
Citation preview
7/29/2019 Machine Learning on Big Data - ClassIntro
1/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
M achine Learning Big Dat a
using
M ap Reduce
By
M ichael Bow les, PhD
7/29/2019 Machine Learning on Big Data - ClassIntro
2/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
W her e Does Big Dat a Com e Fro m ?
-W eb data (w eb logs, cl ick histo r ies)
-e-com m erce appl icat ions (purchase histor ies)
-Retai l purchase histor ies (W almart )
-Bank and credit card t r ansact ions
7/29/2019 Machine Learning on Big Data - ClassIntro
3/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
W hat is Data M in ing?
-W hat page wil l v isi tor next visi t? Given:
Visitor 's bro w sing history
Visitor 's dem ographics
-Should card com pany appro ve transact ion that 's w ait ing? Given:
User's usage histor y
I tem b eing purchased
Locat ion o f merchant .
-What isn ' t da ta min ing
W hat pages d id v isi to rs v iew m ost o f ten?
W hat products are most popu lar?
Data mining tel ls us som ething th at isn' t in th e data or isn' t a simp le sum m ary.
7/29/2019 Machine Learning on Big Data - ClassIntro
4/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Appr oaches fo r Data M ining Large Dat a
-Data M ine a Samp le
Take a manageable subset ( f i ts in m em ory, run s in reasonable t ime)
Develop m ode ls
-Limi ta t ions o f t h is method?
General ly, mo re data supp ort s f iner grained mo dels
e.g. makin g specif ic purchase recom me ndat ions
"custom ers who bo ught . " requ i res much m ore data than " top ten m ost popu lar a re "
7/29/2019 Machine Learning on Big Data - ClassIntro
5/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Side N ot e o n Large Dat a:
-Rhine' s parad ox
ESP exper im ent in 50's
1 in 1000 can correct ly ident i fy color (re d or b lue) of 10 cards th ey can' t see
Do t hey h ave ESP?
-Bonferron i ' s p r incip le Given enou gh data any combinat ion o f o u tcom es can be fo und
Is th is a reason to avoid large data sets?
No
I t ' s a reason to not d r aw conclusions tha t the data don ' t suppor t (and th is is t rue no m at ter how
large the data set)
7/29/2019 Machine Learning on Big Data - ClassIntro
6/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Use M ult iple Pro cessor Cor es
-What i f we w ant to use the fu l l data se t? How can we devote more comp uta t iona l power to our job?
-There are always perfor m ance l imits w ith a single processor core => Use mult ip le cores simu lataneously.
-Tradit ional appro ach add struct ure t o pr ogram m ing language (C++, Java)
-Issues w ith th is appro ach
High com mu nicat ion costs (i f da ta m ust be d is t r ibu ted over a net wo rk)
Dif f icult t o deal w ith CPU f ai lures at t h is level (Processor fa i lures are in evitable as scale increases)
-To d eal wit h t hese issues Google developed M ap-Reduce Paradigm
7/29/2019 Machine Learning on Big Data - ClassIntro
7/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
W hat is M ap-Reduce?
-Arrangement of com put e tasks enabl ing relat ively easy scal ing.
-Includes:
Hardw are arrangemen t racks of CPU's with direct access to local d isk, netw orked t o on e another
Fi le System Distr ibut ed stor age across mult ip le disks, redun dancy
-Soft w are pro cesses runn ing on various CPU in th e assem bly
Cont rol ler m anages m apper and red ucer tasks, fault det ect ion and recover y
M apper Ident ical tasks assigned t o m ult ip le CPU's for each t o ru n over i ts local data.
Reducer Aggregates out put f rom severa l mappers to fo rm end pro duct
-Programm er on ly needs to author m apper and reducer . The rest o f the s t ructure is p rov ided.
Dean, Jeff and Gh em awat , Sanjay. M apRe duce: Simp lified Data Processing on Large Clustersh t tp : / / labs.google.com/papers/mapreduce-
osdi04.pdf
http://labs.google.com/papers/mapreduce-http://labs.google.com/papers/mapreduce-7/29/2019 Machine Learning on Big Data - ClassIntro
8/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Sim ple Cod e Exam ple (using m rJob ) ( t his is runn ing code)
f rom mr job . j ob impor t M RJob
f rom m ath impor t sq r timport json
class m rM eanVar(M RJob):
DEFAULT_PROTOCOL = 'json'
def m apper(sel f , key, l ine) :
num = json. loads( l ine)
va r = [num,num* num]
yield 1,var
def re ducer(self, n, vars):
N = 0.0
sum = 0.0
sumsq = 0.0
for x in vars:
N += 1
sum += x[0]
sumsq += x[1]
mean = sum/ Nsd = sq r t (sumsq /N - mean* mean)
results = [mean,sd]
yield 1,results
i f __name__ == '__m ain__' :
mrMeanVar . run ( )
7/29/2019 Machine Learning on Big Data - ClassIntro
9/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Exam p le Sum Values fr o m a Large Dat a Set
-Data set D divided int o d1, d2, dn (each a l ist o f th in gs that can be sum m ed say real num bers or vectors)
-M apper s run ning o n CPU 1 , CPU 2, CPU n
-Each mapp er for m s a sum o ver i ts p iece of th e data and em its the sum s1, s2, sn.
CPU 1 m apper
Sum Element s of
d1
CPU 2 - m apper
Sum Element s of
d2
CPU - map per
Sum Elem ents of
dn
s1 s2 sn
Reducer
Sum m apper
ou t u t s
7/29/2019 Machine Learning on Big Data - ClassIntro
10/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
M achine Learning w . M ap Reduce
-Stat ist ical Query M odel (see ref below ) A two -step p rocess
1. Com pute su f f ic ien t s ta t ist ics by sum ming some funct ions over the data
2. Perform calcu la t ion on sums to y ie ld data m in ing mode l
3. Conceptua l ly sim i la r to the "sum" exam ple above.
-Consider ordinary least squares regression
Given m out puts yi (also called labe ls, ob servatio ns, etc.)
And m correspond ing a t t r ibu t e vectors xi (also called regressors, pr edict or s, etc.)
Fi t a mode l o f t he fo rm y = T x by solving * = m in ( )
* = A -1 b wher e A = ( ) and b =
-See the natu ral d ivision int o m apper and red ucer? (Hint : Look f or a )
M ap-Reduce for M achine Learn ing on M ul t icore h t tp : / /w ww .cs.stan fo rd .edu /peop le /ang /papers/n ips06-mapreducemu l t i co re .pd f
http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdfhttp://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf7/29/2019 Machine Learning on Big Data - ClassIntro
11/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
M achine Learn ing w i th M ap Reduce
W ith OLS the m appers comp ute par t ia l sums o f the fo rm ( ) an d The reducer aggregates the par t ia l sum s int o to tals and com pletes the calculat ion * = A -1 b
CP U 1 ma ppe r
Sum x ix ,T & x iy i
fo r x i d1
CP U 2 - m a ppe r
Sum x ix ,T & x iy i
fo r x i d2
CPU - map per
Sum x ix ,T & x iy i
fo r x i dn
Sum xx, ,
Sum xy
Reducer
Aggregate mapper output s
A = Sum xx , b = Sum xy
Calculate: * = A -1 b
Sum xx ,,
Sum xy
Sum xx ,,
Sum xy
7/29/2019 Machine Learning on Big Data - ClassIntro
12/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
M achine Learning w . M ap Reduce
-Referenced paper dem onstrat es that t he fo l low ing algori th m s can al l be arranged in th is Stat ist ical Query M odel
f o r m :
Locally W eight ed Linear Regression,
Nave Bayes,
Gaussian Discrim inat ive Analysis,
k-M eans,
Neura l Networ ks,
Principal Compo nent Analysis,
Independent Component Analysis,
Expectat ion M aximizat ion,
Suppor t Vector M ach ines
-In som e cases, i terat ion is required. Each i terat ive step in volves a m ap-reduce sequence
-Other mach ine learn ing a lgor i thm s can be ar ranged fo r map r educe but no t in S tat is t ica l Query M ode l fo rm
(e.g. canopy cluster ing or b inary decision t rees)
7/29/2019 Machine Learning on Big Data - ClassIntro
13/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
M ore M ap Reduce Deta il
-M appers emit a key-value pair
contro l ler sorts key-value pairs by key
Reducer gets pairs grou ped b y key
-M apper can be a t wo -step pro cess
-With OLS, fo r example , we m ight have had the m apper em i t each xix iT and x iy i, instead of em it t ing sum s of
these quant i t ies
Post processing the mapp er out put (e.g. form ing ) is a mapper funct ion ca l led "com biner"
Reduces net w ork t raf f ic
7/29/2019 Machine Learning on Big Data - ClassIntro
14/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Som e Algor i th m s
-Canop y Cluster -
-K-means
-EM a lgo fo r Gaussian M ixture M ode l
-Glmnet
-SVM
7/29/2019 Machine Learning on Big Data - ClassIntro
15/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Cano py Cluster ing
-Usual ly used as rough cluster in g to r educe com put at ion (e.g. search zip code fo r t he closest p izza versus search
the wo r ld )
-Find w ell d istr ibut ed in i t ia l condit ions for ot her cluster ing also (e.g. kmeans)
-A lgor i thm:
Given set o f po ints P and distance measure d(,)
Pick tw o d istance t hresholds T1 > T2 > 0
Step 1: Find cluster centers
1. In i t ia l ize set of centers C = nul l
2. I te ra te over po in ts p i P
If there isn ' t c C s.t. d(c,p i) < T2
Add p i t o C
get next p i
Step 2: Assign point t o cluster s
For p i P assign p i to {c C : d( p i , c) < T1}
-Not ice that p oint s general ly get assigned to m ore t han one cluster.
M cCal lum, Nigam, and Un gar "Eff icient Cluster ing of High-Dimensional Data Sets w i th Ap pl icat ion to Reference M atching",
h t tp : / /www.kama ln igam.com/papers /canopy-kdd00 .pd f
http://www.kamalnigam.com/papers/canopy-kdd00.pdfhttp://www.kamalnigam.com/papers/canopy-kdd00.pdf7/29/2019 Machine Learning on Big Data - ClassIntro
16/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Cano py Clust er in g Pict ur e
x
x
x
x
x
x
x
x
Cluster 1
Cluster 2
Cluster 3
Cluster 4
7/29/2019 Machine Learning on Big Data - ClassIntro
17/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Cano py Clust er ing w M ap Redu ce
1st Pass f ind cent ers
M appers Run canopy cluster ing o n subset, pass center s to redu cer
Reducer Run canopy cluster ing on cente rs f rom m appers.
2nd Pass M ake cluster assignm ent s (if necessary)
M appers compare po in ts p i to centers to fo rm set c i = {c C | d (p i , c) < T1}
Em it = for each c c i
Reducer Since the redu cer input is sorted o n key value (here, th at 's cluster center), the re ducer inpu t w i l l
be a l ist of a l l the po ints assigned t o a given center.
-One smal l p rob lem is tha t t he centers p icked by the reducer m ay not cover a l l the po in ts o f the com bined
orig inal set . Pick T1 > 2* T2 or use larger T2 in reducer in or der to insure that a l l poin ts are covered.
The Apache M ahout pro ject has a lo t o f great a lgor i thm s and docum entat ion. https:/ /cwik i .apache.org/MAHOUT/canopy-cluster ing.html
https://cwiki.apache.org/MAHOUT/canopy-clustering.htmlhttps://cwiki.apache.org/MAHOUT/canopy-clustering.html7/29/2019 Machine Learning on Big Data - ClassIntro
18/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
K-M eans Clust er ing
-K-means algori t hm seeks to part i t ion a data set into K disjo int sets, such that the sum of t he w ith in set var iances
is m inim ized.
-Using Eucl idean distance, w ith in set var iance is the sum o f squared distances fro m the set 's centroid
-Lloyd's algori t hm for K-means goes as fo l low s:
Init ialize:
Pick K start ing guesses for cent roids (at r andom )
Iterate:
Assign point s to cluster w hose centro id is closest
Calculate cluster cent roids
-Here's a sequence fro m W ikipedia page on k-m eans cluster ing ht tp : / /en .w ik iped ia .org /w ik i /K -means_cluster ing
http://en.wikipedia.org/wiki/K-means_clusteringhttp://en.wikipedia.org/wiki/K-means_clustering7/29/2019 Machine Learning on Big Data - ClassIntro
19/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
K-m eans in M ap Redu ce
M apper given K in i t ia l guesses, run t hrou gh local data and for each po int , deter mi ne w hich centro id is closest,
accumula te vector sum of po in ts c losest to each cent ro id , com biner em i ts for each of the i
centroids.
Reducer for each old centr oid i , aggregate sum and n f rom a l l mappers and ca lcu late new cent ro id .
This m ap-reduce pair comp letes an i terat io n of Lloyd's algori th m .
7/29/2019 Machine Learning on Big Data - ClassIntro
20/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Sup por t Vect or M achine - SVM
-For classif icat ion, SVM f inds separat ing hyperp lane
-H 3 doe s no t separate classes. H1 separa tes, bu t no t w i th max marg in . H2 separates w ith m ax m argin
Figure from http:/ /en.wik ipedia.org/wik i /Support_vector_machine
http://en.wikipedia.org/wiki/Support_vector_machinehttp://en.wikipedia.org/wiki/Support_vector_machine7/29/2019 Machine Learning on Big Data - ClassIntro
21/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
SVM as M athem at ical Opt im izat ion Prob lem
-Given a tr ainin g set S = { (
,
) }
-Where x i Rn and y i { +1, -1}
-W e can write any l inear fu nct ion al of x as w Tx +b, w here w Rn and b is scalar
Find we ight vector w and constant b t o m in imize
,1
2 | || | + ( , ; ( , ) )(,)
W h e r e
l(w ,b;(x,y)) = m ax{0, 1 y (w Tx + b)}
7/29/2019 Machine Learning on Big Data - ClassIntro
22/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Solu t ion by Bat ch Gradient De scent
-W e can solve th is by using batch gradient descent. Take a derivat ive w rt w & b.
Not ice tha t i f the 1 - y (w Tx + b) < 0, then l() = 0 and,( ) = 0 .
Denote by S+ = {(x,y) S | 1 - y (w Tx + b) > 0 } . Then the grad ient w r t w is
w +
(,)
and wr t b is:
(,)
For r eference see "M ap Reduce for M achine Learn ing on M ul t icore" m entioned ear l ier . Also see Shalev-Shwart z, "Pegasos: Pr imal Est im ated
sub-Gradient Solver for SVM "
7/29/2019 Machine Learning on Big Data - ClassIntro
23/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Calculat ing a Grad ient St ep
-What ' s impo r tant about what we just deve loped? Severa l th ings
1. I n t he equa t ion we w ound up w i th t e rms l i ke (w + (,) ) , on ly the t e rm inside the sum isdata dependent .
2. The data dependent t e rms sum(-yx) is summ ed over the po in ts in the input da ta where t he const ra in ts
are act ive.
-Le t ' s sum mar ize by draw ing up th e m apper and red ucer funct ions.
In it ia l ize w and b , C and m (# of instances).
I terate
M apper Each m apper has access to a subset Sm of S. For point s (x,y) Sm check 1 - y (wTx + b) > 0. If yes,
accum ulate y and yx. Em it . W e' l l pu t in a dum my key value, bu t a l l the out put ge ts
summ arized in a single (vector) quan t i t y to be pro cessed by a single reducer
Reducer Accum ulate -y and -yx and update est imate o f w and b us ing
w ne w = w ol d - (w ol d +
-yx)
an d
b ne w = bol d - (-y )
7/29/2019 Machine Learning on Big Data - ClassIntro
24/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
GLM Net Algor i thm
-Regularized Regression
-To avoid over-f i t t ing have to exert cont rol over degrees of f reedom in regression
-cut back on at t r ibut es subset select ion
-penalize regression coefficients coefficient shrinkage, ridge regression, lasso regression
-W ith Coeff ic ient Shrinkage, d if feren t penalt ies give solut ions with dif f erent pr oper t ies
"Regularization Paths for General ized Linear M odels, via Coordinat e Descent" Friedm an, Hastie and Tibshirani
h t tp : / /www.s tan fo rd .edu /~has t i e /Papers /g lmne t .pd f
http://www.stanford.edu/~hastie/Papers/glmnet.pdfhttp://www.stanford.edu/~hastie/Papers/glmnet.pdf7/29/2019 Machine Learning on Big Data - ClassIntro
25/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Regular izing Regression w it h Coe ff icient Shr inkage
-Start w ith OLS pro blem f orm ulat ion and add a penalt y to OLS erro r
-Suppose
y R and vector o f pred ictors x Rn
-As w ith OLS, seek a f i t for y of t he for m
y 0 + xT
-Assume th at x i have been standardized to mean zero and unit var iance
-Find
(,)
+ ( )
-The part in t he sum is the or dinary least squares penalty. The bit at the end (( ) ) is new .
-Not ice tha t the minimizing set of 's is a funct ion o f the param eter . If = 0 we get OLS coefficients.
7/29/2019 Machine Learning on Big Data - ClassIntro
26/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Pen alty Term
-The coeff ic ient penalt y term is g iven by
P() = ( 1 ) + | |
-This is cal led " elast icnet " p enalty (see ref below ).
- = 1 gives l1 penalty (sum o f absolute values)
- = 0 gives l2 squared penalty (sum of squ ares)
-Why is th is impo r tant? The cho ice o f pena l ty in f luences tha t na tu re o f the so lu t ions. l1 gives sparse solut ion s. It
ignores at t r ibutes. l2 tend s to average correlated at t r ibut es.
-Consider the coeff ic ient p aths as funct ions of the param eter .
H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. Royal. Stat. Soc. B., 67(2):301{320, 2005.
7/29/2019 Machine Learning on Big Data - ClassIntro
27/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Coef f ic ient Trajecto r ies
-Here's Figure 1 f ro m Friedm an et . a l . paper
7/29/2019 Machine Learning on Big Data - ClassIntro
28/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Path w ise Solu t ion
-This algori thm w orks as fo l low s:
In it ia l ize w ith value of large enou gh that a l l 's are 0.Decrease slight ly and updat e ' s by tak ing a grad ient s tep f ro m the o ld based on new va lue o f .
Sm all chan ges in means that th e update converges quickly.
In many cases faster t o genera te ent i re coef f ic ien t t ra jectory w i th th is a lgo , than to generate po in t so lu t ion
-Fr iedman e t . a l . show that the e lemen ts o f t he coef f ic ien t vector j satisfies
j
=
( ) ,
( )
w here t he funct ion S() is g iven by
S(z,) = z if z > 0 and < |z|
= z + if z < 0 and < |z|
= 0 if >= |z|
-The po int is:This algori thm f i ts the Stat ist ical Query M odel.
The sum inside the fun ct ion S() can be spread over any num ber of m appers
This algori thm handles elast icnet re gular ized r egressions, logist ic r egression and m ult ic lass logist ic
regression
7/29/2019 Machine Learning on Big Data - ClassIntro
29/29
M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012
Summary
-Here 's what w e covered
1. W here do big-data machine learning pro blems arise?
-e-com m erce, ret ai l , narro w t arget ing, bank and credit cards
2. W hat w ays are there to dea l w i th b ig -data mach ine learn ing prob lems?
-samp le, language level paral le l izat ion , m ap-reduce
3. W hat is map-reduce?-programm ing fo rmal ism t hat iso la tes programm ing task to m apper and reducer funct ions
3. Appl ica t ion o f map-reduce to some fam i l iar m ach ine learn ing algor i thm s
Hope you learned som eth ing from t he class. To get into m ore detai l , come to t he big-data class.