Knitting boar - Toronto and Boston HUGs - Nov 2012

1

KNITTING BOARMachine Learning, Mahout, and Parallel Iterative Algorithms

Josh PattersonPrincipal Solutions Architect

Hello

✛ Josh Patterson> Master’s Thesis: self-organizing mesh networks

∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm

> Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)

> Twitter: @jpatanooga> Email: [email protected]

mailto:[email protected]

Outline

✛ Introduction to Machine Learning✛ Mahout✛ Knitting Boar and YARN✛ Parting Thoughts

4

MACHINE LEARNINGIntroduction to

Basic Concepts

✛ What is Data Mining?> “the process of extracting patterns from data”

✛ Why are we interested in Data Mining?> Raw data essentially useless

∗ Data is simply recorded facts∗ Information is the patterns underlying the data

✛ Machine Learning> Algorithms for acquiring structural descriptions from

data “examples”∗ Process of learning “concepts”

Shades of Gray

✛ Information Retrieval> information science, information architecture,

cognitive psychology, linguistics, and statistics.✛ Natural Language Processing

> grounded in machine learning, especially statistical machine learning

✛ Statistics> Math and stuff

✛ Machine Learning> Considered a branch of artificial intelligence

Hadoop in Traditional Enterprises Today

✛ ETL✛ Joining multiple disparate data sources✛ Filtering data✛ Aggregation✛ Cube materialization

“Descriptive Statistics”

Hadoop All The Time?

✛ Don’t always assume you need “scale” and parallelization> Try it out on a single machine first> See if it becomes a bottleneck!

✛ Will the data fit in memory on a beefy machine?

✛ We can always use the constructed model back in MapReduce to score a ton of new data

Twitter Pipeline

✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

> Looks to study data with descriptive statistics in the hopes of building models for predictive analytics

✛ Does majority of ML work via Pig custom integrations

> Pipeline is very “Pig-centric”

> Example: https://github.com/tdunning/pig-vector

> They use SGD and Ensemble methods mostly being conducive to large scale data mining

✛ Questions they try to answer

> Is this tweet spam?> What star rating might this user give this movie?

http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

https://github.com/tdunning/pig-vector

Typical Pipeline for Cloudera Customer

✛ Data collection performed w Flume✛ Data cleansing / ETL performed with Hive

or Pig✛ ML work performed with

> SAS> SPSS> R> Mahout

11 MAHOUTIntroduction to

Algorithm Groups in Apache Mahout

Copyright 2010 Cloudera Inc. All rights reserved12

✛ Classification> “Fraud detection”

✛ Recommendation> “Collaborative

Filtering”✛ Clustering

> “Segmentation”✛ Frequent Itemset

Mining

Classification

✛ Stochastic Gradient Descent > Single process> Logistic Regression Model Construction

✛ Naïve Bayes> MapReduce-based> Text Classification

✛ Random Forests> MapReduce-based

Copyright 2010 Cloudera Inc. All rights reserved13

What Are Recommenders?

✛ An algorithm that looks at a user’s past actions and suggests> Products> Services> People

✛ Advertisement> Cloudera has a great Data Science training course on

this topic> http://university.cloudera.com/training/data_science/in

troduction_to_data_science_-_building_recommender_systems.html

http://university.cloudera.com/training/data_science/introduction_to_data_science_-_building_recommender_systems.html



Clustering: Topic Modeling

✛ Cluster words across docs to identify topics✛ Latent Dirichlet Allocation

Taking a Breath For a Minute

✛ Why Machine Learning?> Growing interest in predictive modeling

✛ Linear Models are Simple, Useful> Stochastic Gradient Descent is a very popular tool for

building linear models like Logistic Regression

✛ Building Models Still is Time Consuming> The “Need for speed”

> “More data beats a cleverer algorithm”

17

KNITTING BOARIntroducing

Goals

✛ Parallelize Mahout’s Stochastic Gradient Descent> With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms using YARN> Wanted a first class Hadoop-Yarn citizen

> Work through dev progressions towards a stable state

> Worry about “frameworks” later

19

Stochastic Gradient Descent

✛ Training> Simple gradient descent

procedure> Loss functions needs to be

convex✛ Prediction

> Logistic Regression:∗ Sigmoid function using

parameter vector (dot) example as exponential parameter

SGD

Model

Training Data

20

Current Limitations

✛ Sequential algorithms on a single node only goes so far

✛ The “Data Deluge”> Presents algorithmic challenges when combined with

large data sets> need to design algorithms that are able to perform in

a distributed fashion✛ MapReduce only fits certain types of algorithms

21

Distributed Learning Strategies

✛ Langford, 2007> Vowpal Wabbit

✛ McDonald 2010> Distributed Training Strategies for the Structured

Perceptron✛ Dekel 2010

> Optimal Distributed Online Prediction Using Mini-Batches

22

MapReduce vs. Parallel Iterative

Input

Output

Map Map Map

Reduce Reduce

Processor Processor Processor

Superstep 1

Processor Processor

Superstep 2

. . .

Processor

23

Why Stay on Hadoop?

“Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution?

If no, then operationally, we can consider the Hadoop stack …

there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.”

–– Lin, 2012

24

The Boar

✛ Parallel Iterative implementation of SGD on YARN

✛ Workers work on partitions of the data✛ Master keeps global copy of merged parameter

vector

25

Worker

✛ Each given a split of the total dataset> Similar to a map task

✛ Using a modified OLR> process N samples in a epoch (subset of split)

✛ Local parameter vector sent to master node> Master averages all workers’ vectors together

26

Master

✛ Gathers and averages worker parameter vectors> From worker OLR runs

✛ Produces new global parameter vector> By averaging workers’ vectors

✛ Sends update to all workers> Workers replace local parameter vector with new

global parameter vector

27

IterativeReduce

✛ ComputableMaster> Setup()> Compute()> Complete()

✛ ComputableWorker> Setup()> Compute()

Worker Worker Worker

Master

Worker Worker

Master

. . .

Worker

28

Comparison: OLR vs POLR

OnlineLogisticRegression

Model

Training Data

Worker 1

Master

Partial Model

Global Model

Worker 2

Partial Model

Worker N

Partial Model

OnlineLogisticRegression Knitting Boar’s POLRSplit 1 Split 2 Split 3

…

29

20Newsgroups

Input Size vs Processing Time

4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 410

50

100

150

200

250

300

OLRPOLR

30

PARTING THOUGHTSKnitting Boar

31

Knitting Boar Lessons Learned

✛ Parallel SGD> The Boar is temperamental, experimental

∗ Linear speedup (roughly)

✛ Developing YARN Applications> More complex the just MapReduce> Requires lots of “plumbing”

✛ IterativeReduce> Great native-Hadoop way to implement algorithms> Easy to use and well integrated

32

Bits

✛ Knitting Boar> https://github.com/jpatanooga/KnittingBoar> 100% Java> ASF 2.0 Licensed> Quick Start

∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

✛ IterativeReduce> https://github.com/emsixteeen/IterativeReduce> 100% Java> ASF 2.0 Licensed

https://github.com/jpatanooga/KnittingBoar

https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

https://github.com/emsixteeen/IterativeReduce

33

✛ Machine Learning is hard> Don’t believe the hype> Do the work

✛ Model development takes time> Lots of iterations> Speed is key here

Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg

34

References

✛ Strata / Hadoop World 2012 Slides> http://www.cloudera.com/content/cloudera/en/resourc

es/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html

✛ Mahout’s SGD implementation> http://lingpipe.files.wordpress.com/2008/04/lazysgdre

gression.pdf✛ MapReduce is Good Enough? If All You Have is

a Hammer, Throw Away Everything That’s Not a Nail!> http://arxiv.org/pdf/1209.2191v1.pdf

http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html



http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf

http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf

http://arxiv.org/pdf/1209.2191v1.pdf

35

References

✛ Langford > http://hunch.net/~vw/

✛ McDonald, 2010> http://dl.acm.org/citation.cfm?id=1858068

http://hunch.net/~vw/

http://hunch.net/~vw/

http://dl.acm.org/citation.cfm?id=1858068

http://dl.acm.org/citation.cfm?id=1858068

36

Photo Credits

✛ http://eteamjournal.files.wordpress.com/2011/03/photos-of-mount-everest-pictures.jpg

✛ http://images.fineartamerica.com/images-medium-large/-say-hello-to-my-little-friend--luis-ludzska.jpg

✛ http://freewallpaper.in/wallpaper2/2202-2-2001_space_odyssey_-_5.jpg

http://eteamjournal.files.wordpress.com/2011/03/photos-of-mount-everest-pictures.jpg

http://eteamjournal.files.wordpress.com/2011/03/photos-of-mount-everest-pictures.jpg

http://images.fineartamerica.com/images-medium-large/-say-hello-to-my-little-friend--luis-ludzska.jpg



http://freewallpaper.in/wallpaper2/2202-2-2001_space_odyssey_-_5.jpg

http://freewallpaper.in/wallpaper2/2202-2-2001_space_odyssey_-_5.jpg

Technology

Knitting boar - Toronto and Boston HUGs - Nov 2012