29
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination

Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination

Embed Size (px)

Citation preview

Introducing Apache Mahout

Scalable Machine Learning for All!

Grant Ingersoll

Lucid Imagination

Overview

• What is Machine Learning?

• Mahout

Definition• “Machine Learning is programming

computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E.

Alpaydin

• Subset of Artificial Intelligence– Many other fields: comp sci., biology,

math, psychology, etc.

Types• Supervised

– Using labeled training data, create function that predicts output of unseen inputs

• Unsupervised– Using unlabeled data, create function

that predicts output

• Semi-Supervised– Uses labeled and unlabeled data

Characterizations

• Lots of Data

• Identifiable Features in that Data

• Too big/costly for people to handle– People still can help

Clustering

• Unsupervised

• Find Natural Groupings– Documents– Search Results– People– Genetic traits in groups– Many, many more uses

Example: Clustering

Google News

Collaborative Filtering

• Unsupervised

• Recommend people and products– User-User

• User likes X, you might too

– Item-Item• People who bought X also bought Y

Example: Collab Filtering

Amazon.com

Classification/Categorization

• Many, many types

• Spam Filtering

• Named Entity Recognition

• Phrase Identification

• Sentiment Analysis

• Classification into a Taxonomy

Example: NER

NER?

Excerpt from Yahoo News

Example: Categorization

Info. Retrieval

• Learning Ranking Functions

• Learning Spelling Corrections

• User Click Analysis and Tracking

Other

• Image Analysis

• Robotics

• Games

• Higher level natural language processing

• Many, many others

What is Apache Mahout?

• A Mahout is an elephant trainer/driver/keeper, hence…

+Machine Learning

=

(and other distributed techniques)

What?

• Hadoop brings:– Map/Reduce API– HDFS– In other words, scalability and fault-

tolerance

• Mahout brings:– Library of machine learning algorithms– Examples

Why Mahout?• Many Open Source ML libraries either:

– Lack Community

– Lack Documentation and Examples

– Lack Scalability

– Lack the Apache License ;-)

– Or are research-oriented

Why Mahout?• Intelligent Apps are the Present and

Future

• Thus, Mahout’s Goal is:– Scalable Machine Learning with Apache

License

Current Status• What’s in it:

– Simple Matrix/Vector library– Taste Collaborative Filtering– Clustering

• Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet

– Classifiers• Naïve Bayes• Complementary NB

– Evolutionary• Integration with Watchmaker for fitness function

How?

• Examples– Taste– Clustering– Classification– Evolutionary

Taste: Movie Recommendations

• Given ratings by users of movies, recommend other movies

• http://lucene.apache.org/mahout/taste.html#demo

Taste Demo

• http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=12&debug=true

• http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=43&debug=true

Clustering: Synthetic Control Data

• http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series

• Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples– o.a.mahout.clustering.syntheticcontrol.*

• Outputs clusters…

Classification: NB and CNB Examples

• 20 Newsgroups– http://cwiki.apache.org/confluence/

display/MAHOUT/TwentyNewsgroups

• Wikipedia– http://cwiki.apache.org/confluence/

display/MAHOUT/WikipediaBayesExample

Evolutionary

• Traveling Salesman– http://cwiki.apache.org/confluence/

display/MAHOUT/Traveling+Salesman

• Class Discovery– http://cwiki.apache.org/confluence/

display/MAHOUT/Class+Discovery

What’s Next?• More Examples• Winnow/Perceptron (MAHOUT-85)• Text Clustering• Association Rules (MAHOUT-108)• Logistic Regression• Solr Integration (SOLR-769)• GSOC

When, Who• When? Now!

– Mahout is growing

• Who? You!– We want programmers who:

• Are comfortable with math• Like to work on hard problems

– We want others to:• Kick the tires

Where?

• http://lucene.apache.org/mahout– Hadoop - http://hadoop.apache.org

• http://cwiki.apache.org/MAHOUT

• mahout-{user|dev}@lucene.apache.org– http://www.lucidimagination.com/search/p:mahout

Resources

• “Programming Collective Intelligence” by Segaran

• “Data Mining - Practical Machine Learning Tools and Techniques” by Witten and Frank

• “Taming Text” by Ingersoll and Morton