30
The what, the why and the how Speaker: Robin Anil, Apache Mahout PMC Member SDEC, Seoul, Korea, June 2011

SDEC2011 Mahout - the what, the how and the why

Embed Size (px)

DESCRIPTION

Mahout is an open source machine learning library from Apache. From its humble beginnings at Apache Lucene, the project has grown into a active community of developers, machine learning experts and enthusiasts. With v0.5 released recently, the project has been focussing full steam on developing stable APIs with an eye on our major milestone of v1.0. The speaker has been with Mahout from his days in college as a computer science student. The talk will focus on the major use cases of Mahout. The design decisions, things that worked, things that didn't, and things to expect in the future releases.http://sdec.kr/

Citation preview

Page 1: SDEC2011 Mahout - the what, the how and the why

The what, the why and the howSpeaker: Robin Anil, Apache Mahout PMC Member

SDEC, Seoul, Korea, June 2011

Page 2: SDEC2011 Mahout - the what, the how and the why

about:me● Apache Mahout PMC member● A ML Believer● Author of Mahout in Action ● Software Engineer @ Google

*in no particular order

● Previous Life: Google Summer of Code student for 2 years.

Page 3: SDEC2011 Mahout - the what, the how and the why

about:agenda● Introducing Mahout● Why Mahout?● Birds eye view of Mahout ● Classic Machine learning problems● Short overview of Mahout clustering

Page 4: SDEC2011 Mahout - the what, the how and the why

about:mission

To build a scalable machine learning library

Page 5: SDEC2011 Mahout - the what, the how and the why

Scale!● Scale to large datasets

○ Hadoop MapReduce implementations that scales linearly with data.

○ Fast sequential algorithms whose runtime doesn’t depend on the size of the data

○ Goal: To be as fast as possible for any algorithm● Scalable to support your business case

○ Apache Software License 2● Scalable community

○ Vibrant, responsive and diverse○ Come to the mailing list and find out more

Page 6: SDEC2011 Mahout - the what, the how and the why

about:why● Lack community● Lack scalability● Lack documentations and examples● Lack Apache licensing● Are not well tested● Are Research oriented● Not built over existing production quality libraries

Page 7: SDEC2011 Mahout - the what, the how and the why

Birds eye view of Mahout● If you want to:

○ Encode ○ Analyze○ Predict○ Get top best

Page 8: SDEC2011 Mahout - the what, the how and the why

about:encode● Process data and convert to vectors ● Dictionary based v/s Randomizer based ● Get best signals for generating vectors

○ Collocation information (ngrams)○ Lp Normalization

Data Engineering Camp

1:1.0 2:1.0 3:1.0

Page 9: SDEC2011 Mahout - the what, the how and the why

about:analyze● Cluster and group data to ● Cluster data

○ K-Means○ Fuzzy K-Means○ Canopy○ Mean Shift○ Dirichlet process clustering ○ Spectral Clustering

● Co-cluster features / dimensionality reduction○ Latent Dirichlet Allocation (LDA)○ Singular Value Decomposition

Page 10: SDEC2011 Mahout - the what, the how and the why

about:clusteringNews clusters

Page 11: SDEC2011 Mahout - the what, the how and the why

about:lda● Grouping similar or co-occurring features into a topic

○ Topic “Lol Cat”:■ Cat■ Meow■ Purr■ Haz■ Cheeseburger■ Lol

Page 12: SDEC2011 Mahout - the what, the how and the why

about:predict● Classification and Recommendation

● Classification:○ Use features learn model○ Apply model on unknown

● Recommendation○ Use pairwise(user-item) information to learn model○ For a given user return highly likely items

Page 13: SDEC2011 Mahout - the what, the how and the why

about:classify● Predicting the type of a new object based on its features● The types are predetermined

Dog Cat

Page 14: SDEC2011 Mahout - the what, the how and the why

about:classify ● Plenty of algorithms

○ Naïve Bayes○ Complementary Naïve Bayes○ Random Forests○ Stocastic Gradient Descent(regression)

● Learn a model from a manually classified data● Predict the class of a new object based on its

features and the learned model

Page 15: SDEC2011 Mahout - the what, the how and the why

about:recommend● Predict what the user likes based on

○ His/Her historical behavior○ Aggregate behavior of people similar to him

Page 16: SDEC2011 Mahout - the what, the how and the why

about:recommend ● Different types of recommenders

○ User based○ Item based○ Co-occurrence based

● Full framework for storage, onlineonline and offline computation of recommendations

● Like clustering, there is a notion of similarity in users or items○ Cosine, Tanimoto, Pearson and LLR

Page 17: SDEC2011 Mahout - the what, the how and the why

about:top-n● Frequent Pattern Mining:

○ Identify top-K patterns

Page 18: SDEC2011 Mahout - the what, the how and the why

about:frequent-pattern-mining● Find interesting groups of items based on how they co-occur in

a dataset

Page 19: SDEC2011 Mahout - the what, the how and the why

about:parallel-fp-growth● Identify the most commonly

occurring patterns from○ Sales Transactions

buy “Milk, eggs and bread”○ Query Logs

ipad -> apple, tablet, iphone○ Spam Detection

Yahoo! http://www.slideshare.net/hadoopusergroup/mail-antispam

Page 20: SDEC2011 Mahout - the what, the how and the why

summary:in-short● Plenty of overlap. There is no one algorithm to fit all problems. ● Analyze and iterate fast● MapReduce implementations makes these Fly!

Page 21: SDEC2011 Mahout - the what, the how and the why

Did you know?● Apache Mahout uses Colt high-performance collections

○ Open HashMaps instead of Chained HashMaps○ Arrays of Primitive types○ Available as Mahout Math library

● Mahout Vector uses integer encoding techniques to reduce space.

● Fastest classifier in Mahout doesn’t use MapReduce!○ And it learns online○ And It doesn’t look at all the data.

Page 22: SDEC2011 Mahout - the what, the how and the why

How to use mahout● Command line launcher bin/mahout● See the list of tools and algorithms by running bin/mahout● Run any algorithm by its shortname:

○ bin/mahout kmeans –help● By default runs locally● export HADOOP_HOME = /pathto/hadoop-0.20.2/

○ Runs on the cluster configured as per the conf files in the hadoop directory

● Use driver classes to launch jobs: ○ KMeansDriver.runjob(Path input, Path output …)

Page 23: SDEC2011 Mahout - the what, the how and the why

Clustering Walkthrough (tiny example)● Input: set of text files in a directory● Download Mahout and unzip

○ mvn install○ bin/mahout seqdirectory –i <input> –o <seq-output>○ bin/mahout seq2sparse –i seq-output –o <vector-output>○ bin/mahout kmeans –i<vector-output>

-c <cluster-temp> -o <cluster-output> -k 10 –cd 0.01 –x 20

Page 24: SDEC2011 Mahout - the what, the how and the why

Clustering Walkthrough (a bit more)

● Use bigrams: -ng 2● Prune low frequency: –s 10● Normalize: -n 2

● Use a distance measure : -dm org.apache.mahout.common.distance.CosineDistanceMeasure

Page 25: SDEC2011 Mahout - the what, the how and the why

Clustering Walkthrough (viewing results)● bin/mahout clusterdump

–s cluster-output/clusters-9/part-00000 -d vector-output/dictionary.file-* -dt sequencefile -n 5 -b 100

● Top terms in a typical clustercomic => 9.793121272867376comics => 6.115341078151356con => 5.015090566692931sdcc => 3.927590843402978webcomics => 2.916910980686997

Page 26: SDEC2011 Mahout - the what, the how and the why

Road to Mahout v1.0● Guiding Principles

○ Use the stable Hadoop API○ Make vector the de-factor input format for all parts of code○ Provide stable API for developers

Page 27: SDEC2011 Mahout - the what, the how and the why

Get Started● http://mahout.apache.org● [email protected] - Developer mailing list● [email protected] - User mailing list● Check out the documentations and wiki for quickstart● http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code

Page 28: SDEC2011 Mahout - the what, the how and the why

Resources● “Mahout in Action” Owen, Anil, Dunning, Friedman

http://www.manning.com/owen

● “Taming Text” Ingersoll, Morton, Farrishttp://www.manning.com/ingersoll

● “Introducing Apache Mahout”http://www.ibm.com/developerworks/java/library/j-mahout/

Page 29: SDEC2011 Mahout - the what, the how and the why

Thanks to● Apache Foundation● Mahout Committers● Google Summer of Code Organizers● And Students● Open source!

● And NHN for hosting this at Seoul!● and the wonderful engineers (present and future) in the room.

Page 30: SDEC2011 Mahout - the what, the how and the why

References● news.google.com● Cat http://www.flickr.com/photos/gattou/3178745634/● Dog http://www.flickr.com/photos/30800139@N04/3879737638/ ● Milk Eggs Bread http://www.flickr.

com/photos/nauright/4792775946/ ● Amazon Recommendations● twitter