Upload
josh-patterson
View
897
Download
0
Embed Size (px)
Citation preview
1
KNITTING BOARMachine Learning, Mahout, and Parallel Iterative Algorithms
Josh PattersonPrincipal Solutions Architect
Hello
✛ Josh Patterson> Master’s Thesis: self-organizing mesh networks
∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
> Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)
> Twitter: @jpatanooga> Email: [email protected]
Outline
✛ Introduction to Machine Learning✛ Mahout✛ Knitting Boar and YARN✛ Parting Thoughts
4
MACHINE LEARNINGIntroduction to
Basic Concepts
✛ What is Data Mining?> “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?> Raw data essentially useless
∗ Data is simply recorded facts∗ Information is the patterns underlying the data
✛ Machine Learning> Algorithms for acquiring structural descriptions from
data “examples”∗ Process of learning “concepts”
Shades of Gray
✛ Information Retrieval> information science, information architecture,
cognitive psychology, linguistics, and statistics.✛ Natural Language Processing
> grounded in machine learning, especially statistical machine learning
✛ Statistics> Math and stuff
✛ Machine Learning> Considered a branch of artificial intelligence
Hadoop in Traditional Enterprises Today
✛ ETL✛ Joining multiple disparate data sources✛ Filtering data✛ Aggregation✛ Cube materialization
“Descriptive Statistics”
Hadoop All The Time?
✛ Don’t always assume you need “scale” and parallelization> Try it out on a single machine first> See if it becomes a bottleneck!
✛ Will the data fit in memory on a beefy machine?
✛ We can always use the constructed model back in MapReduce to score a ton of new data
Twitter Pipeline
✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
> Looks to study data with descriptive statistics in the hopes of building models for predictive analytics
✛ Does majority of ML work via Pig custom integrations
> Pipeline is very “Pig-centric”
> Example: https://github.com/tdunning/pig-vector
> They use SGD and Ensemble methods mostly being conducive to large scale data mining
✛ Questions they try to answer
> Is this tweet spam?> What star rating might this user give this movie?
Typical Pipeline for Cloudera Customer
✛ Data collection performed w Flume✛ Data cleansing / ETL performed with Hive
or Pig✛ ML work performed with
> SAS> SPSS> R> Mahout
11 MAHOUTIntroduction to
Algorithm Groups in Apache Mahout
Copyright 2010 Cloudera Inc. All rights reserved12
✛ Classification> “Fraud detection”
✛ Recommendation> “Collaborative
Filtering”✛ Clustering
> “Segmentation”✛ Frequent Itemset
Mining
Classification
✛ Stochastic Gradient Descent > Single process> Logistic Regression Model Construction
✛ Naïve Bayes> MapReduce-based> Text Classification
✛ Random Forests> MapReduce-based
Copyright 2010 Cloudera Inc. All rights reserved13
What Are Recommenders?
✛ An algorithm that looks at a user’s past actions and suggests> Products> Services> People
✛ Advertisement> Cloudera has a great Data Science training course on
this topic> http://university.cloudera.com/training/data_science/in
troduction_to_data_science_-_building_recommender_systems.html
Clustering: Topic Modeling
✛ Cluster words across docs to identify topics✛ Latent Dirichlet Allocation
Taking a Breath For a Minute
✛ Why Machine Learning?> Growing interest in predictive modeling
✛ Linear Models are Simple, Useful> Stochastic Gradient Descent is a very popular tool for
building linear models like Logistic Regression
✛ Building Models Still is Time Consuming> The “Need for speed”
> “More data beats a cleverer algorithm”
17
KNITTING BOARIntroducing
Goals
✛ Parallelize Mahout’s Stochastic Gradient Descent> With as few extra dependencies as possible
✛ Wanted to explore parallel iterative algorithms using YARN> Wanted a first class Hadoop-Yarn citizen
> Work through dev progressions towards a stable state
> Worry about “frameworks” later
19
Stochastic Gradient Descent
✛ Training> Simple gradient descent
procedure> Loss functions needs to be
convex✛ Prediction
> Logistic Regression:∗ Sigmoid function using
parameter vector (dot) example as exponential parameter
SGD
Model
Training Data
20
Current Limitations
✛ Sequential algorithms on a single node only goes so far
✛ The “Data Deluge”> Presents algorithmic challenges when combined with
large data sets> need to design algorithms that are able to perform in
a distributed fashion✛ MapReduce only fits certain types of algorithms
21
Distributed Learning Strategies
✛ Langford, 2007> Vowpal Wabbit
✛ McDonald 2010> Distributed Training Strategies for the Structured
Perceptron✛ Dekel 2010
> Optimal Distributed Online Prediction Using Mini-Batches
22
MapReduce vs. Parallel Iterative
Input
Output
Map Map Map
Reduce Reduce
Processor Processor Processor
Superstep 1
Processor Processor
Superstep 2
. . .
Processor
23
Why Stay on Hadoop?
“Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution?
If no, then operationally, we can consider the Hadoop stack …
there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.”
–– Lin, 2012
24
The Boar
✛ Parallel Iterative implementation of SGD on YARN
✛ Workers work on partitions of the data✛ Master keeps global copy of merged parameter
vector
25
Worker
✛ Each given a split of the total dataset> Similar to a map task
✛ Using a modified OLR> process N samples in a epoch (subset of split)
✛ Local parameter vector sent to master node> Master averages all workers’ vectors together
26
Master
✛ Gathers and averages worker parameter vectors> From worker OLR runs
✛ Produces new global parameter vector> By averaging workers’ vectors
✛ Sends update to all workers> Workers replace local parameter vector with new
global parameter vector
27
IterativeReduce
✛ ComputableMaster> Setup()> Compute()> Complete()
✛ ComputableWorker> Setup()> Compute()
Worker Worker Worker
Master
Worker Worker
Master
. . .
Worker
28
Comparison: OLR vs POLR
OnlineLogisticRegression
Model
Training Data
Worker 1
Master
Partial Model
Global Model
Worker 2
Partial Model
Worker N
Partial Model
OnlineLogisticRegression Knitting Boar’s POLRSplit 1 Split 2 Split 3
…
29
20Newsgroups
Input Size vs Processing Time
4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 410
50
100
150
200
250
300
OLRPOLR
30
PARTING THOUGHTSKnitting Boar
31
Knitting Boar Lessons Learned
✛ Parallel SGD> The Boar is temperamental, experimental
∗ Linear speedup (roughly)
✛ Developing YARN Applications> More complex the just MapReduce> Requires lots of “plumbing”
✛ IterativeReduce> Great native-Hadoop way to implement algorithms> Easy to use and well integrated
32
Bits
✛ Knitting Boar> https://github.com/jpatanooga/KnittingBoar> 100% Java> ASF 2.0 Licensed> Quick Start
∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start
✛ IterativeReduce> https://github.com/emsixteeen/IterativeReduce> 100% Java> ASF 2.0 Licensed
33
✛ Machine Learning is hard> Don’t believe the hype> Do the work
✛ Model development takes time> Lots of iterations> Speed is key here
Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg
34
References
✛ Strata / Hadoop World 2012 Slides> http://www.cloudera.com/content/cloudera/en/resourc
es/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html
✛ Mahout’s SGD implementation> http://lingpipe.files.wordpress.com/2008/04/lazysgdre
gression.pdf✛ MapReduce is Good Enough? If All You Have is
a Hammer, Throw Away Everything That’s Not a Nail!> http://arxiv.org/pdf/1209.2191v1.pdf
35
References
✛ Langford > http://hunch.net/~vw/
✛ McDonald, 2010> http://dl.acm.org/citation.cfm?id=1858068
36
Photo Credits
✛ http://eteamjournal.files.wordpress.com/2011/03/photos-of-mount-everest-pictures.jpg
✛ http://images.fineartamerica.com/images-medium-large/-say-hello-to-my-little-friend--luis-ludzska.jpg
✛ http://freewallpaper.in/wallpaper2/2202-2-2001_space_odyssey_-_5.jpg