MAHOUT classifier tour

Preview:

DESCRIPTION

This ta

Citation preview

Mahout

1Wednesday, March 16, 2011

MahoutScalable Data Mining for Everybody

1Wednesday, March 16, 2011

What is Mahout

• Recommendations (people who x this also x that)

• Clustering (segment data into groups of)

• Classification (learn decision making from examples)

• Stuff (LDA, SVD, frequent item-set, math)

2Wednesday, March 16, 2011

What is Mahout?

• Recommendations (people who x this also x that)

• Clustering (segment data into groups of)

• Classification (learn decision making from examples)

• Stuff (LDA, SVM, frequent item-set, math)

3Wednesday, March 16, 2011

Classification in Detail

• Naive Bayes Family

• Hadoop based training

• Decision Forests

• Hadoop based training

• Logistic Regression (aka SGD)

• fast on-line (sequential) training

4Wednesday, March 16, 2011

Classification in Detail

• Naive Bayes Family

• Hadoop based training

• Decision Forests

• Hadoop based training

• Logistic Regression (aka SGD)

• fast on-line (sequential) training

5Wednesday, March 16, 2011

So What?

Online training has low overhead for small and moderate size data-sets

6Wednesday, March 16, 2011

So What?

Online training has low overhead for small and moderate size data-sets

6Wednesday, March 16, 2011

So What?

Online training has low overhead for small and moderate size data-sets

6Wednesday, March 16, 2011

So What?

Online training has low overhead for small and moderate size data-sets

6Wednesday, March 16, 2011

So What?

Online training has low overhead for small and moderate size data-sets

big starts here

6Wednesday, March 16, 2011

An Example

7Wednesday, March 16, 2011

An Example

7Wednesday, March 16, 2011

An Example

7Wednesday, March 16, 2011

An Example

7Wednesday, March 16, 2011

An Example

7Wednesday, March 16, 2011

An Example

7Wednesday, March 16, 2011

An Example

7Wednesday, March 16, 2011

And Another

From: Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence

Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....

8Wednesday, March 16, 2011

And Another

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <george@fumble-tech.com>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

8Wednesday, March 16, 2011

And Another

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <george@fumble-tech.com>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

8Wednesday, March 16, 2011

And Another

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <george@fumble-tech.com>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

8Wednesday, March 16, 2011

Mahout’s SGD

• Learns on-line per example

• O(1) memory

• O(1) time per training example

• Sequential implementation

• fast, but not parallel

9Wednesday, March 16, 2011

Special Features

• Hashed feature encoding

• Per-term annealing

• learn the boring stuff once

• Auto-magical learning knob turning

• learns correct learning rate, learns correct learning rate for learning learning rate, ...

10Wednesday, March 16, 2011

Feature Encoding

11Wednesday, March 16, 2011

Feature Encoding

11Wednesday, March 16, 2011

Hashed Encoding

12Wednesday, March 16, 2011

Feature Collisions

13Wednesday, March 16, 2011

Learning Rate AnnealingLe

arni

ng R

ate

# training examples seen

14Wednesday, March 16, 2011

Per-term AnnealingLe

arni

ng R

ate

# training examples seen

15Wednesday, March 16, 2011

Per-term AnnealingLe

arni

ng R

ate

# training examples seen

Common Feature

15Wednesday, March 16, 2011

Per-term AnnealingLe

arni

ng R

ate

# training examples seen

Rare Feature

15Wednesday, March 16, 2011

General Structure

• OnlineLogisticRegression

• Traditional logistic regression

• Stochastic Gradient Descent

• Per term annealing

• Too fast (for the disk + encoder)

16Wednesday, March 16, 2011

Next Level

• CrossFoldLearner

• contains multiple primitive learners

• online cross validation

• 5x more work

17Wednesday, March 16, 2011

And again

• AdaptiveLogisticRegression

• 20 x CrossFoldLearner

• evolves good learning and regularization rates

• 100 x more work than basic learner

• still faster than disk + encoding

18Wednesday, March 16, 2011

A comparison

• Traditional view

• 400 x (read + OLR)

• Revised Mahout view

• 1 x (read + mu x 100 x OLR) x eta

• mu = efficiency from killing losers early

• eta = efficiency from stopping early

19Wednesday, March 16, 2011

Deployment

• Training

• ModelSerializer.writeBinary(..., model)

• Deployment

• m = ModelSerializer.readBinary(...)

• r = m.classifyScalar(featureVector)

20Wednesday, March 16, 2011

The Upshot

• One machine can go fast

• SITM trains in 2 billion examples in 3 hours

• Deployability pays off big

• simple sample server farm

21Wednesday, March 16, 2011