83
Stanford CS 246H Winter ‘14 Stanford CS 246H: Mining Massive Data Sets Hadoop Lab

Lecture 7 - CS 246h

Embed Size (px)

DESCRIPTION

a

Citation preview

Page 1: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Stanford  CS  246H:  Mining  Massive  Data  Sets  Hadoop  Lab  

Page 2: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Machine  Learning  &  Hadoop  

Page 3: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Peanut  BuCer  and  Chocolate?  

•  The  Promise  of  Big  Data™  •  Sounds  great,  but  how?  

•  Hadoop  talent  pool  is  small  •  ML  talent  pool  is  Kny  

•  Tools  and  toolkits  starKng  to  appear  •  Mahout,  Oryx,  Alpine,  Ayasdi,  Skytree,  etc.  

•  Summary:  Hadoop  is  hard,  and  ML  is  hard  1.  Lots  of  people/companies  are  trying  to  make  it  easy  2.  Don’t  believe  anyone  who  tells  you  they  make  it  easy  

Page 4: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Hadoop  &  ML:  A  Brief  History  

•  2005  –  Taste  project  started  on  SourceForge  •  2007  –  Mahout  project  started  at  Apache  •  2008  –  Taste  donated  to  Mahout  •  …  Kme  passes  …  •  2012  –  Myrrix  is  launched  •  2013  –  Cloudera  ML  project  started  on  Github  •  Late  2013  –  Oryx  project  started  on  Github  

Page 5: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Hadoop  ML  Family  Tree  

Taste  

Mahout  

Myrrix  Cloudera  ML  

Oryx  

Lucene  

Andrew  Ng  

Page 6: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Apache  Mahout  

Page 7: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

What  is  Mahout?  

•  “Scalable  machine  learning”  •  not  just  Hadoop-­‐oriented  machine  learning  •  not  en%rely,  that  is.    Just  mostly.  

•  Components  •  math  library  •  clustering  •  classificaKon  •  decomposiKons  •  recommendaKons  

©MapR  Technologies  2013  

Page 8: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Mahout  Math  

•  Goals  are  •  basic  linear  algebra,  •  and  staKsKcal  sampling,  •  and  good  clustering,  •  decent  speed,  •  extensibility,  •  especially  for  sparse  data  

•  But  not    •  totally  badass  speed  •  comprehensive  set  of  algorithms  •  opKmizaKon,  root  finders,  quadrature  

©MapR  Technologies  2013  

Page 9: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Caveat  Emptor  

• Mahout  is  a  toolkit  •  There  is  a  command  line  interface  

•  You  can’t  always  use  it  

•  Very  oken  end  up  wriKng  code  •  DocumentaKon  is…  ahem…  scant  

•  Best  reference  is  Mahout  in  AcKon  

•  Varying  levels  of  maturity  •  Varying  levels  of  Hadoop  support  

Page 10: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Matrices  and  Vectors  

•  At  the  core:  •  DenseVector,  RandomAccessSparseVector  •  DenseMatrix,  SparseRowMatrix  

•  Highly  composable  API  

•  Important  ideas:    •  view*,  assign  and  aggregate  •  iteraKon  

m.viewDiagonal().assign(v)!

©MapR  Technologies  2013  

Page 11: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Assign?    View?  

•  Why  assign?  •  Copying  is  the  major  cost  for  naïve  matrix  packages  •  In-­‐place  operaKons  criKcal  to  reasonable  performance  •  Many  kinds  of  updates  required,  so  funcKonal  style  very  helpful  

•  Why  view?  •  In-­‐place  operaKons  oken  required  for  blocks,  rows,  columns  or  diagonals  

•  With  views,  we  need  #assign  +  #views  methods  •  Without  views,  we  need  #assign  x  #views  methods  

•  Synergies  •  With  both  views  and  assign,  many  loops  become  single  line  

©MapR  Technologies  2013  

Page 12: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Assign  

• Matrices  

•  Vectors  

Matrix assign(double value);!Matrix assign(double[][] values);!Matrix assign(Matrix other);!Matrix assign(DoubleFunction f);!Matrix assign(Matrix other, DoubleDoubleFunction f);!

Vector assign(double value);!Vector assign(double[] values);!Vector assign(Vector other);!Vector assign(DoubleFunction f);!Vector assign(Vector other, DoubleDoubleFunction f);!Vector assign(DoubleDoubleFunction f, double y);!

©MapR  Technologies  2013  

Page 13: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Views  

• Matrices  

•  Vectors  

Matrix viewPart(int[] offset, int[] size);!Matrix viewPart(int row, int rlen, int col, int clen);!Vector viewRow(int row);!Vector viewColumn(int column);!Vector viewDiagonal();!

Vector viewPart(int offset, int length);!

©MapR  Technologies  2013  

Page 14: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Aggregates  

• Matrices  

 

•  Vectors  double zSum();!double aggregate(! DoubleDoubleFunction reduce, DoubleFunction map);!double aggregate(Vector other, ! DoubleDoubleFunction aggregator, ! DoubleDoubleFunction combiner);!

double zSum();!Vector aggregateRows(VectorFunction f);!Vector aggregateColumns(VectorFunction f);!double aggregate(DoubleDoubleFunction combiner, ! DoubleFunction mapper);!

©MapR  Technologies  2013  

Page 15: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Predefined  FuncKons  

• Many  handy  funcKons  ABS LOG2 !ACOS NEGATE !ASIN RINT !ATAN SIGN !CEIL SIN !COS SQRT !EXP SQUARE !FLOOR SIGMOID !IDENTITY SIGMOIDGRADIENT !INV TAN !LOGARITHM!

©MapR  Technologies  2013  

Page 16: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Examples  

double  alpha;  a.assign(alpha);  

a.assign(b,  FuncKons.chain(          FuncKons.plus(beta),            FuncKons.mult(alpha));  

A =α

A =αB+β

©MapR  Technologies  2013  

Page 17: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Sparse  OpKmizaKons  

•  DoubleDoubleFuncKon  abstract  properKes  

•  And  Vector  properKes  

public boolean isLikeRightPlus();!public boolean isLikeLeftMult();!public boolean isLikeRightMult();!public boolean isLikeMult();!public boolean isCommutative();!public boolean isAssociative();!public boolean isAssociativeAndCommutative();!public boolean isDensifying();!

public boolean isDense();!public boolean isSequentialAccess();!public double getLookupCost();!public double getIteratorAdvanceCost();!public boolean isAddConstantTime();!

©MapR  Technologies  2013  

Page 18: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Examples  

•  The  trace  of  a  matrix  

•  Set  diagonal  to  zero  

•  Set  diagonal  to  negaKve  of  row  sums  excluding  the  diagonal  

m.viewDiagonal().zSum()!

m.viewDiagonal().assign(0)!

Vector diag = m.viewDiagonal().assign(0);!diag.assign(m.rowSums().assign(Functions.MINUS));!

©MapR  Technologies  2013  

Page 19: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

IteraKon  

• Matrices  are  Iterable  in  Mahout  

 

•  Vectors  are  densely  or  sparsely  iterable  

// compute both row and columns sums in one pass!for (MatrixSlice row: m) {! rSums.set(row.index(), row.zSum());! cSums.assign(row, Functions.PLUS);!}!

double entropy = 0;!for (Vector.Element e: v.iterateNonZero()) {! entropy += e.get() * Math.log(e.get());!}!

©MapR  Technologies  2013  

Page 20: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Random  Sampling  

•  Samples  from  some  type  

•  Lots  of  kinds  ChineseRestaurant Missing Normal !Empirical Multinomial PoissonSampler !IndianBuffet MultiNormal Sampler !

public interface Sampler<T> {! T sample();!}!!public abstract class AbstractSamplerFunction ! extends DoubleFunction ! implements Sampler<Double>!

©MapR  Technologies  2013  

Page 21: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Mahout  Math  Summary  

•  Matrices,  Vectors  •  views  •  in-­‐place  assignment  •  aggregaKons  •  iteraKons  

•  FuncKons  •  lots  built-­‐in  •  cooperate  with  sparse  vector  opKmizaKons  

•  Sampling  •  abstract  samplers  •  samplers  as  funcKons  

•  Other  stuff  …  clustering,  SVD    

©MapR  Technologies  2013  

Page 22: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Other  Stuff  

• Matrix  DecomposiKon  •  ClassificaKon  •  Clustering  •  RecommendaKons  

Page 23: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Focus:  Machine  Learning  

Math  Vectors/Matrices/SVD  

Recommenders  Clustering  ClassificaKon  Freq.  PaCern  Mining  

GeneKc  

UKliKes  Lucene/Vectorizer  

CollecKons  (primiKves)  

Apache  Hadoop  

ApplicaKons  

Examples  

See  hCp://cwiki.apache.org/confluence/display/MAHOUT/Algorithms  

©Lucid  ImaginaKon  2010  

Page 24: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Prepare  Data  from  Raw  content  

•  Data  Sources:  •  Lucene  integraKon  

•  bin/mahout  lucenevector  …  

•  Document  Vectorizer  •  bin/mahout  seqdirectory  …  •  bin/mahout  seq2sparse  …  

•  ProgrammaKcally  •  See  the  UKls  module  in  Mahout  

•  Database  •  File  system  

©Lucid  ImaginaKon  2010  

Page 25: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

RecommendaKons  

•  Extensive  framework  for  collaboraKve  filtering  •  Recommenders  

•  User  based,  Item  based,  ALS,  SlopeOne,  SVD,  others  

•  Online  and  Offline  support  •  Offline  can  uKlize  Hadoop  

• Many  different  Similarity  measures  •  Cosine,  LLR,  Tanimoto,  Pearson,  others  

©Lucid  ImaginaKon  2010  

Page 26: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Clustering  

•  Document  level  •  Group  documents  based  on  a  noKon  of  similarity  

•  K-­‐Means,  Fuzzy  K-­‐Means,  Dirichlet,  Canopy,  Mean-­‐Shik  

•  Distance  Measures  •  ManhaCan,  Euclidean,  other  

•  Topic  Modeling    •  Cluster  words  across  documents  to  idenKfy  topics  

•  Latent  Dirichlet  AllocaKon  

©Lucid  ImaginaKon  2010  

Page 27: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

CategorizaKon  

•  Place  new  items  into  predefined  categories:  •  Sports,  poliKcs,  entertainment  

•  Mahout  has  several  implementaKons  •  Naïve  Bayes  •  Complementary  Naïve  Bayes  •  Decision  Forests  •  LogisKc  Regression  (SGD)  

©Lucid  ImaginaKon  2010  

Page 28: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Freq.  PaCern  Mining  

•  IdenKfy  frequently  co-­‐occurrent  items  

•  Useful  for:  •  Query  RecommendaKons  

•  Apple  -­‐>  iPhone,  orange,  OS  X  

•  Related  product  placement  •  “Beer  and  Diapers”  

•  Spam  DetecKon  •  Yahoo:  hCp://www.slideshare.net/hadoopusergroup/mail-­‐anKspam  

hCp://www.amazon.com  

©Lucid  ImaginaKon  2010  

Page 29: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

EvoluKonary  

• Map-­‐Reduce  ready  fitness  funcKons  for  geneKc  programming  

•  IntegraKon  with  Watchmaker  •  hCp://watchmaker.uncommons.org/index.php  

•  Problems  solved:  •  Traveling  salesman  •  Class  discovery  •  Many  others  

©Lucid  ImaginaKon  2010  

Page 30: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Singular  Value  DecomposiKon  

•  Reduces  a  big  matrix  into  a  much  smaller  matrix  by  amplifying  the  important  parts  while  removing/reducing  the  less  important  parts  

•  Mahout  has  fully  distributed  Lanczos  implementaKon  <MAHOUT_HOME>/bin/mahout  svd  -­‐Dmapred.input.dir=path/to/corpus  -­‐-­‐tempDir  path/for/svd-­‐output  -­‐-­‐rank  300  -­‐-­‐numColumns  <numcols>  -­‐-­‐numRows  <num  rows  in  the  input>  <MAHOUT_HOME>/bin/mahout  cleansvd  -­‐-­‐eigenInput  path/for/svd-­‐output  -­‐-­‐corpusInput  path/to/corpus  -­‐-­‐output  path/for/cleanOutput  -­‐-­‐maxError  0.1  -­‐-­‐minEigenvalue  10.0    

•  hCps://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+ReducKon    

©Lucid  ImaginaKon  2010  

Page 31: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

How  to:  Command  Line  

• Most  algorithms  have  a  Driver  program  •  Shell  script  in  $MAHOUT_HOME/bin  helps  with  most  tasks  

•  Prepare  the  Data  •  Different  algorithms  require  different  setup  

•  Run  the  algorithm  •  Single  Node  •  Hadoop  

•  Print  out  the  results  •  Several  helper  classes:    

•  LDAPrintTopics,  ClusterDumper,  etc.  

©Lucid  ImaginaKon  2010  

Page 32: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  II  -­‐  Prep  

•  Data  Set:  Reuters  •  hCp://www.daviddlewis.com/resources/testcollecKons/reuters21578/  

•  Convert  to  Text  via  hCp://www.lucenebootcamp.com/lucene-­‐boot-­‐camp-­‐preclass-­‐training/  

•  Convert  to  Sequence  File:  bin/mahout  seqdirectory  –input  <PATH>  -­‐-­‐output  <PATH>  -­‐-­‐charset  UTF-­‐8  

•  Convert  to  Sparse  Vector:  bin/mahout  seq2sparse  -­‐-­‐input  <PATH>/content/reuters/seqfiles/  -­‐-­‐norm  2  -­‐-­‐weight  TF  -­‐-­‐output  <PATH>/content/reuters/seqfiles-­‐TF/  -­‐-­‐minDF  5  -­‐-­‐maxDFPercent  90  

©Lucid  ImaginaKon  2010  

Page 33: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  II:  Topic  Modeling  

•  Latent  Dirichlet  AllocaKon  ./mahout  lda  -­‐-­‐input    <PATH>/content/reuters/seqfiles-­‐TF/vectors/  -­‐-­‐output    <PATH>/content/reuters/seqfiles-­‐TF/lda-­‐output  -­‐-­‐numWords  34000  –numTopics  10  ./mahout  org.apache.mahout.clustering.lda.LDAPrintTopics  -­‐-­‐input  <PATH>/content/reuters/seqfiles-­‐TF/lda-­‐output/state-­‐19  -­‐-­‐dict  <PATH>/content/reuters/seqfiles-­‐TF/dictionary.file-­‐0  -­‐-­‐words  10  -­‐-­‐output  <PATH>/content/reuters/seqfiles-­‐TF/lda-­‐output/topics  -­‐-­‐dictionaryType  sequencefile  

•  Good  feature  reducKon  (stopword  removal)  required  

©Lucid  ImaginaKon  2010  

Page 34: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  III:  Clustering  

•  K-­‐Means  •  Same  Prep  as  UD  II,  except  use  TFIDF  weight  ./mahout  kmeans  -­‐-­‐input  <PATH>/content/reuters/seqfiles-­‐TFIDF/vectors/part-­‐00000  -­‐-­‐k  15  -­‐-­‐output  <PATH>/content/reuters/seqfiles-­‐TFIDF/output-­‐kmeans  -­‐-­‐clusters  <PATH>/content/reuters/seqfiles-­‐TFIDF/output-­‐kmeans/clusters  

•  Print  out  the  clusters:  ./mahout  clusterdump  -­‐-­‐seqFileDir  <PATH>/content/reuters/seqfiles-­‐TFIDF/output-­‐kmeans/clusters-­‐15/  -­‐-­‐pointsDir  <PATH>/content/reuters/seqfiles-­‐TFIDF/output-­‐kmeans/points/  -­‐-­‐dictionary  <PATH>/content/reuters/seqfiles-­‐TFIDF/dictionary.file-­‐0  -­‐-­‐dictionaryType  sequencefile  -­‐-­‐substring  20  

©Lucid  ImaginaKon  2010  

Page 35: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  IV:  Frequent  PaCern  Mining  

•  Data:  hCp://fimi.cs.helsinki.fi/data/  •  ./mahout  fpg  -­‐i  <PATH>/content/freqitemset/accidents.dat  -­‐o  patterns  -­‐k  50  -­‐method  mapreduce  -­‐g  10  -­‐regex  [\  ]  

•   ./mahout  seqdump  -­‐-­‐seqFile  patterns/fpgrowth/part-­‐r-­‐00000    

©Lucid  ImaginaKon  2010  

Page 36: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML  

Page 37: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML  

•  CollecKon  of  Java  libraries  and  command-­‐line  tools  •  Goal:  make  data  scienKsts  more  producKve  with  CDH  

•  Exploratory  data  analysis  •  Data  preparaKon  •  Model  fi}ng  •  Model  evaluaKon  

•  Apache  2.0  licensed  •  Developed  on  GitHub  

•  hCp://github.com/cloudera/ml  

37  

Page 38: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  Building  Blocks  

•  Apache  Hadoop  •  Scalable  data  storage  (HDFS)  and  processing  (MapReduce)  

•  Apache  Hive  •  Metadata  for  structured  data  in  HDFS  

•  Apache  Crunch  •  Easy  MapReduce  pipelines  

•  Apache  Mahout  •  Vector  interface  

•  Apache  Avro  •  SerializaKon  format  

38  

Page 39: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  39  

Cloudera  ML  Workflow:  Clustering  

Page 40: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary  

•  client/bin/ml  summary  -­‐-­‐input-­‐paths  kddcup.data_10_percent  (HDFS)  -­‐-­‐format  text  -­‐-­‐header-­‐file  examples/kdd99/header.csv  (local  FS)  -­‐-­‐summary-­‐file  examples/kdd99/s.json  (local  FS)    

40  

Page 41: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary  

41  

HDFS

Local FS

kddcup.data_10_percent

header.csv

1. summary

Page 42: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary  

42  

HDFS

Local FS

kddcup.data_10_percent

header.csv

1. summary

s.json

Page 43: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary  

•  s.json  •  Categorical  features:  histogram  •  Numerical  features:  distribuKon  summary  

43  

Page 44: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize  

•  client/bin/ml  normalize  -­‐-­‐input-­‐paths  kddcup.data_10_percent  (HDFS)  -­‐-­‐format  text  -­‐-­‐summary-­‐file  examples/kdd99/s.json  (local  FS)  -­‐-­‐transform  Z  -­‐-­‐output-­‐path  kdd99  (HDFS)  -­‐-­‐output-­‐type  avro  -­‐-­‐id-­‐column  category  -­‐-­‐compress  

44  

Page 45: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize  

45  

HDFS

Local FS

kddcup.data_10_percent

header.csv

2. normalize

s.json

Page 46: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize  

46  

HDFS

Local FS

kddcup.data_10_percent

header.csv

2. normalize

s.json

kdd99/

Page 47: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize  

•  kdd99/part-­‐m-­‐0000[0|1].avro  •  Examples  (rows)    

•  Part  0:  442,454  vectors  •  Part  1:  51,567  vectors  •  Total:  494,021  vectors  

•  Features  (columns)  •  Before:  41  fields  •  Aker:  143  fields  

47  

Page 48: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch  

•  client/bin/ml  ksketch    -­‐-­‐input-­‐paths  kdd99  (HDFS)  -­‐-­‐format  avro  -­‐-­‐points-­‐per-­‐iteraKon  500  -­‐-­‐output-­‐file  wc.avro  (local  FS)  -­‐-­‐seed  1729  -­‐-­‐iteraKons  5  -­‐-­‐cross-­‐folds  2  

48  

Page 49: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch  

49  

HDFS

Local FS

kddcup.data_10_percent

header.csv

3. ksketch

s.json

kdd99/

Page 50: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch  

50  

HDFS

Local FS

kddcup.data_10_percent

header.csv

3. ksketch

s.json

kdd99/

wc.avro

Page 51: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch  

•  wc.avro  •  Examples  (rows)  

•  2  “folds”  of  2501  examples  •  1  iniKal  example  •  500  examples  from  each  iteraKon  (5  iteraKons)  •  Each  example  has  an  associated  weight  

•  Features  (columns)  •  143  features  (sKll)  

51  

Page 52: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kmeans  

•  client/bin/ml  kmeans  -­‐-­‐input-­‐file  wc.avro  (local  FS)  -­‐-­‐centers-­‐file  centers.avro  (local  FS)  -­‐-­‐seed  19  -­‐-­‐clusters  1,10,25,35,45  -­‐-­‐best-­‐of  2  -­‐-­‐num-­‐threads  4  -­‐-­‐eval-­‐stats-­‐file  kmeans_stats.csv  (local  FS)  

52  

Page 53: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kmeans  

53  

HDFS

Local FS

kddcup.data_10_percent

header.csv

4. kmeans

s.json

kdd99/

wc.avro

Page 54: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

HDFS

Local FS

kddcup.data_10_percent

header.csv

4. kmeans

s.json

kdd99/

wc.avro

kmeans_stats.csv

centers.avro

Cloudera  ML:  kmeans  

54  

Page 55: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kmeans  

•  centers.avro  •  1  row  for  each  run  of  k-­‐means++  •  9  total  runs:  1  for  k=1,  2  each  for  k=10,  25,  35,  and  45  

•  kmeans_stats.csv  •  Clustering  quality  scores  

55  

Page 56: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign  

•  client/bin/ml  kassign  -­‐-­‐input-­‐paths  kdd99  (HDFS)  -­‐-­‐format  avro  -­‐-­‐centers-­‐file  centers.avro  (local  FS)  -­‐-­‐center-­‐ids  4  -­‐-­‐output-­‐path  assigned  (HDFS)  -­‐-­‐output-­‐type  csv  

56  

Page 57: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign  

57  

HDFS

Local FS

kddcup.data_10_percent

header.csv

5. kassign

s.json

kdd99/

wc.avro centers.avro

Page 58: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign  

58  

HDFS

Local FS

kddcup.data_10_percent

header.csv

5. kassign

s.json

kdd99/

wc.avro centers.avro

assigned/

Page 59: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign  

•  assigned/part-­‐m-­‐0000[0|1]  •  Rows    

•  Part  0:  442,454  •  Part  1:  51,567  •  Total:  494,021  

•  Columns  •  Point  ID  (normal/aCack  type,  in  this  case)  •  Index  in  centers.avro  •  Assigned  cluster  ID  •  Squared  distance  to  nearest  cluster  

59  

Page 60: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample  

•  client/bin/ml  sample  -­‐-­‐input-­‐paths  assigned  (HDFS)  -­‐-­‐format  text  -­‐-­‐header-­‐file  examples/kdd99/kassign_header.csv  (local  FS)  -­‐-­‐weight-­‐field  squared_distance  -­‐-­‐group-­‐fields  clustering_id,closest_center_id  -­‐-­‐output-­‐type  csv  -­‐-­‐size  20  -­‐-­‐output-­‐path  extremal  (HDFS)  

60  

Page 61: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample  

61  

HDFS

Local FS

kddcup.data_10_percent

header.csv

6. sample

s.json

kdd99/

wc.avro centers.avro

assigned/

kassign_header.csv

Page 62: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample  

62  

HDFS

Local FS

kddcup.data_10_percent

header.csv

6. sample

s.json

kdd99/

wc.avro centers.avro

assigned/

kassign_header.csv

extremal/

Page 63: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample  

•  extremal/part-­‐r-­‐00000  •  Rows    

•  Up  to  20  examples  from  each  cluster  •  Examples  that  are  furthest  from  the  center  of  the  cluster  

•  Columns  •  Point  ID  (normal/aCack  type,  in  this  case)  •  Index  in  centers.avro  •  Assigned  cluster  ID  •  Squared  distance  to  nearest  cluster  

63  

Page 64: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Oryx  

Page 65: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

2014:  Lab  to  Factory  

65  

Page 66: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Data  Science  Will  Be  Opera-onal  Analy-cs  

66  

Page 67: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

I  Built  A  Model.  Now  What?  

67  

Build  Model   Query  Model  Collect  Input  

Repeat  

Page 68: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

I  Built  A  Model  On  Hadoop.  Now  What?  

68  

Build  Model   Query  Model  Collect  Input  

Repeat  

?  ?  ?  

Page 69: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  69  

Example:  Oryx  

Page 70: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  70  

www.mwCl.com/wp-­‐content/uploads/2013/11/IMG_5446_edited-­‐2_mwCl.jpg  

Page 71: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Gaps  to  fill,  and  Goals  

71  

• Model  Building  •  Large-­‐scale  •  Con-nuous  •  Apache  Hadoop™-­‐based  •  Few,  good  algorithms  

• Model  Serving  •  Real-­‐-me  query  •  Real-­‐-me  update  

•  Algorithms  •  Parallelizable  •  Updateable  •  Works  on  diverse  input  

•  Interoperable  •  PMML  model  format  •  Simple  REST  API  •  Open  source  

Page 72: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Large-­‐Scale  or  Real-­‐Time?  

72  

Large-­‐Scale  Offline  Batch  

Real-­‐Time  Online  Streaming  

vs  

Why  Don’t  We  Have  Both?  

λ!  

Page 73: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Lambda  Architecture  

73  

•  Batch,  Stream    Processing  are  different  

•  Tackle  separately  in    2+  Layers  

•  Batch  Layer:  offline,  asynchronous  

•  Serving  /  Speed  Layer:  real-­‐Kme,  incremental,  approximate  

jameskinley.tumblr.com/post/37398560534/the-­‐lambda-­‐architecture-­‐principles-­‐for-­‐architecKng  

…  λ?  

Page 74: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  74  

Batch  

Serving/Speed  

Page 75: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Two  Layers  

75  

•  ComputaKon  Layer  •  Java-­‐based  server  process  •  Client  of  Hadoop  2.x  •  Periodically  builds  “generaKon”  from  recent  data  and  past  model  

•  Baby-­‐sits  MapReduce*  jobs  (or,  locally  in-­‐core)  

•  Publishes  models  

•  Serving  Layer  •  Apache  Tomcat™-­‐based  server  process  

•  Consumes  models  from  HDFS  (or  local  FS)  

•  Serves  queries  from  model  in  memory  

•  Updates  from  new  input  •  Also  writes  input  to  HDFS  •  Replicas  for  scale  

*  Apache  Spark  later  

Page 76: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

CollaboraKve  Filtering  :  ALS  

76  

•  AlternaKng  Least  Squares  •  Latent-­‐factor  model  •  Accepts  implicit  or    explicit  feedback  

•  Real-­‐Kme  update    via  fold-­‐in  of  input  

•  No  cold-­‐start  •  Parallelizable  

YT  

X  

Page 77: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Clustering  :  k-­‐means++  

77  

• Well-­‐known  and  understood  

•  Parallelizable  •  Clusters  updateable  

cwiki.apache.org/confluence/display/MAHOUT/K-­‐Means+Clustering  

Page 78: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

ClassificaKon  /  Regression  :  RDF  

78  

•  Random  Decision  Forests  •  Ensemble  method  •  Numeric,  categorical    features  and  target    

•  Very  parallel  •  Nodes  updateable  • Works  well  on  many  problems  

age$>$30

female? Yes

income$>$20000 Yes

Yes No

Page 79: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

PMML  

79  

•  PredicKve  Modeling  Markup  Language  

•  XML-­‐based  format  for  predicKve  models  

•  Standardized  by  Data  Mining  Group  (www.dmg.org)  

• Wide  tool  support  

<PMML xmlns="http://www.dmg.org/PMML-4_1"! version="4.1">! <Header copyright="www.dmg.org"/>! <DataDictionary numberOfFields="5">! <DataField name="temperature"! optype="continuous"! dataType="double"/>! …! </DataDictionary>! <TreeModel modelName="golfing"! functionName="classification">! <MiningSchema>! <MiningField name="temperature"/>! … ! </MiningSchema>! <Node score="will play">! <Node score="will play">! <SimplePredicate field="outlook"! operator="equal" ! value="sunny"/>! …! </Node>! </Node>! </TreeModel>!</PMML>!

www.dmg.org/v4-­‐1/TreeModel.html  

Page 80: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

HTTP  REST  API  

80  

•  ConvenKon  for  RPC-­‐like  request  /  response  

•  HTTP  verbs,  transport  •  GET  :  query  •  POST  :  add  input  •  Easy  from  browser,  CLI,  Java,  Python,  Scala,  etc.  

GET /recommend/jwills!

HTTP/1.1 200 OK!Content-Type: text/plain!!"Ray LaMontagne",0.951 "Fleet Foxes",0.7905!"The National",0.688!"Shearwater",0.3017!

 

Page 81: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Wish  List  

81  

•  Revamp  workflow  •  Spark  /  Crunch-­‐like  API,  not  raw  M/R  

•  De-­‐emphasize  model  building  •  Well-­‐solved  •  Bring  your  own  

• More  component-­‐ized    •  Less  black-­‐box  service  •  Emphasize  integraKon  

•  PMML,  etc.  

•  “Pull”  opKons  •  Ka�a?  •  Hive  /  Impala  ?  

Page 82: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14  

Open  Source  

82  

github.com/cloudera/oryx!

100%  Apache  License  2.0  

Page 83: Lecture 7 - CS 246h

Stanford  CS  246H  Winter  ‘14