Tamara Berg Machine Learning

1

Tamara BergMachine Learning

790-133Recognizing People, Objects, & Actions

Announcements

• Topic presentation groups posted. Anyone not have a group yet?

• Last day of background material

• For Monday - Object recognition papers will be posted online. Please read!

Slide 2 of 113

What is machine learning?

• Computer programs that can learn from data

• Two key components– Representation: how should we represent the

data?– Generalization: the system should generalize from

its past experience (observed data items) to perform well on unseen data items.

Slide 3 of 113

Types of ML algorithms

• Unsupervised– Algorithms operate on unlabeled examples

• Supervised– Algorithms operate on labeled examples

• Semi/Partially-supervised– Algorithms combine both labeled and unlabeled examples

Slide 4 of 113

Unsupervised Learning

Slide 5 of 113

Slide 6 of 113

K-means clustering• Want to minimize sum of squared Euclidean

distances between points xi and their nearest cluster centers mk

Algorithm:• Randomly initialize K cluster centers• Iterate until convergence:

• Assign each data point to the nearest center• Recompute each cluster center as the mean of all points assigned

to it

k

ki

ki mxMXDcluster

clusterinpoint

2)(),(

source: Svetlana Lazebnik Slide 7 of 113

Slide 8 of 113

Slide 9 of 113

Slide 10 of 113

Slide 11 of 113

Slide 12 of 113

Slide 13 of 113

Slide 14 of 113

Slide 15 of 113

Slide 16 of 113

Slide 17 of 113

Slide 18 of 113

Different clustering strategies• Agglomerative clustering

• Start with each point in a separate cluster• At each iteration, merge two of the “closest” clusters

• Divisive clustering• Start with all points grouped into a single cluster• At each iteration, split the “largest” cluster

• K-means clustering• Iterate: assign points to clusters, compute means

• K-medoids• Same as k-means, only cluster center cannot be computed by

averaging• The “medoid” of each cluster is the most centrally located point in

that cluster (i.e., point with lowest average distance to the other points)

source: Svetlana Lazebnik Slide 19 of 113

Supervised Learning

Slide 20 of 113

Slide from Dan KleinSlide 21 of 113




Example: Image classification

apple

pear

tomato

cow

dog

horse

input desired output

Slide credit: Svetlana LazebnikSlide 25 of 113

Slide from Dan Kleinhttp://yann.lecun.com/exdb/mnist/index.html Slide 26 of 113

http://yann.lecun.com/exdb/mnist/index.html

Example: Seismic data

Body wave magnitude

Surfa

ce w

ave

mag

nitu

de

Nuclear explosions

Earthquakes



The basic classification framework

y = f(x)

• Learning: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the parameters of the prediction function f

• Inference: apply f to a never before seen test example x and output the predicted value y = f(x)

output classification function

input


30

Some ML classification methods

106 examples

Nearest neighbor

Shakhnarovich, Viola, Darrell 2003Berg, Berg, Malik 2005…

Neural networks

LeCun, Bottou, Bengio, Haffner 1998Rowley, Baluja, Kanade 1998…

Support Vector Machines and Kernels Conditional Random Fields

McCallum, Freitag, Pereira 2000Kumar, Hebert 2003…

Guyon, VapnikHeisele, Serre, Poggio, 2001…

Slide credit: Antonio Torralba

Example: Training and testing

• Key challenge: generalization to unseen examples

Training set (labels known) Test set (labels unknown)


Slide credit: Dan KleinSlide 32 of 113

Slide from Min-Yen Kan

Classification by Nearest Neighbor

Word vector document classification – here the vector space is illustrated as having 2 dimensions. How many dimensions would the data actually live in?

Slide 33 of 113



Slide 34 of 113


Classify the test document as the class of the document “nearest” to the query document (use vector similarity to find most similar doc)


Slide 35 of 113

Classification by kNN

Classify the test document as the majority class of the k documents “nearest” to the query document. Slide from Min-Yen Kan

Slide 36 of 113


What are the features? What’s the training data? Testing data? Parameters?


Slide 37 of 113

Slide from Min-Yen KanSlide 38 of 113






What are the features? What’s the training data? Testing data? Parameters?


Slide 43 of 113

44

NN for vision

Fast Pose Estimation with Parameter Sensitive HashingShakhnarovich, Viola, Darrell

J. Hays and A. Efros, Scene Completion using Millions of Photographs, SIGGRAPH 2007

NN for vision

http://graphics.cs.cmu.edu/projects/scene-completion/

http://graphics.cs.cmu.edu/projects/scene-completion/

J. Hays and A. Efros, IM2GPS: estimating geographic information from a single image, CVPR 2008

NN for vision

Decision tree classifierExample problem: decide whether to wait for a table at a

restaurant, based on the following attributes:1. Alternate: is there an alternative restaurant nearby?2. Bar: is there a comfortable bar area to wait in?3. Fri/Sat: is today Friday or Saturday?4. Hungry: are we hungry?5. Patrons: number of people in the restaurant (None, Some, Full)6. Price: price range ($, $$, $$$)7. Raining: is it raining outside?8. Reservation: have we made a reservation?9. Type: kind of restaurant (French, Italian, Thai, Burger)10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)


Decision tree classifier


Decision tree classifier


Linear classifier

• Find a linear function to separate the classes

f(x) = sgn(w1x1 + w2x2 + … + wDxD) = sgn(w x)


Discriminant Function• It can be arbitrary functions of x, such as:

Nearest Neighbor

Decision Tree

LinearFunctions

( ) Tg b x w x

Slide credit: Jinwei GuSlide 51 of 113

Linear Discriminant Function• g(x) is a linear function:

( ) Tg b x w x

x1

x2

wT x + b = 0

wT x + b < 0

wT x + b > 0

A hyper-plane in the feature space

Slide credit: Jinwei Gu

denotes +1denotes -1

x1

Slide 52 of 113

• How would you classify these points using a linear discriminant function in order to minimize the error rate?

Linear Discriminant Function


x1

x2

Infinite number of answers!




x1

x2






x1

x2




x1

x2• How would you classify these points using a linear discriminant function in order to minimize the error rate?



Which one is the best?



Large Margin Linear Classifier

“safe zone”• The linear discriminant

function (classifier) with the maximum margin is the best

Margin is defined as the width that the boundary could be increased by before hitting a data point

Why it is the best? strong generalization ability

Margin

x1

x2

Linear SVMSlide credit: Jinwei Gu

Slide 57 of 113

Large Margin Linear Classifier

x1

x2 Margin

wT x + b = 0

wT x + b = -1w

T x + b = 1

x+

x+

x-

Support Vectors


Large Margin Linear Classifier • Formulation:

x1

x2 Margin

wT x + b = 0

wT x + b = -1w

T x + b = 1

x+

x+

x-n

21minimize 2

w

such that

For 1, 1

For 1, 1

Ti i

Ti i

y b

y b

w x

w x


Large Margin Linear Classifier • Formulation:

x1

x2 Margin

wT x + b = 0

wT x + b = -1w

T x + b = 1

x+

x+

x-n( ) 1T

i iy b w x

21minimize 2

w

such that


Solving the Optimization Problem

( ) 1Ti iy b w x

21minimize 2

w

s.t.

Quadratic programming

with linear constraints


Solving the Optimization Problem The linear discriminant function is:

Notice it relies on a dot product between the test point x and the support vectors xi


Linear separability


68

Non-linear SVMs: Feature Space General idea: the original input space can be mapped to

some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

Slide courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

69

Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function becomes:

SV

( ) ( ) ( ) ( )T Ti i

i

g b b

x w x x x

No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test.

A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:

( , ) ( ) ( )Ti j i jK x x x x


71

Nonlinear SVMs: The Kernel Trick

Linear kernel:

2

2( , ) exp( )2i j

i jK

x xx x

( , ) Ti j i jK x x x x

( , ) (1 )T pi j i jK x x x x

0 1( , ) tanh( )Ti j i jK x x x x

Examples of commonly-used kernel functions:

Polynomial kernel:

Gaussian (Radial-Basis Function (RBF) ) kernel:

Sigmoid:


Support Vector Machine: Algorithm

1. Choose a kernel function

2. Choose a value for C and any other parameters (e.g. σ)

3. Solve the quadratic programming problem (many software packages available)

4. Classify held out validation instances using the learned model

5. Select the best learned model based on validation accuracy 6. Classify test instances using the final selected model

Slide 72 of 113

Some Issues• Choice of kernel - Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating appropriate similarity

measures

• Choice of kernel parameters - e.g. σ in Gaussian kernel - In the absence of reliable criteria, applications rely on the use of a

validation set or cross-validation to set such parameters.

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt Slide 73 of 113

Summary: Support Vector Machine

• 1. Large Margin Classifier – Better generalization ability & less over-fitting

• 2. The Kernel Trick– Map data points to higher dimensional space in

order to make them linearly separable.– Since only dot product is used, we do not need to

represent the mapping explicitly.


• A simple algorithm for learning robust classifiers– Freund & Shapire, 1995– Friedman, Hastie, Tibshhirani, 1998

• Provides efficient algorithm for sparse visual feature selection– Tieu & Viola, 2000– Viola & Jones, 2003

• Easy to implement, doesn’t require external optimization tools.

Boosting

Slide credit: Antonio TorralbaSlide 75 of 113

• Defines a classifier using an additive model:

Boosting

Strong classifier

Weak classifier

WeightFeaturesvector


• Defines a classifier using an additive model:

• We need to define a family of weak classifiers

Boosting

Strong classifier

Weak classifier

WeightFeaturesvector

from a family of weak classifiers


Adaboost


Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

Boosting• It is a sequential procedure:

xt=1

xt=2

xt


Toy exampleWeak learners from the family of lines

h => p(error) = 0.5 it is at chance

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =


Toy example

This one seems to be the best

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

This is a ‘weak classifier’: It performs slightly better than chance.Slide credit: Antonio Torralba

Slide 81 of 113

Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( )yt =


Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}


+1 ( )

-1 ( )yt =


Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}


+1 ( )

-1 ( )yt =


Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}


+1 ( )

-1 ( )yt =


Toy example

The strong (nonlinear) classifier is built as the combination of all the weak (linear) classifiers.

f1 f2

f3

f4


Adaboost


Semi-Supervised Learning

Slide 88 of 113

Supervised learning has many successes• recognize speech,• steer a car,• classify documents• classify proteins• recognizing faces, objects in images• ...

Slide Credit: Avrim Blum Slide 89 of 113

http://images.google.com/imgres?imgurl=http://www.ebgm.jussieu.fr/~debrevern/PBs/images/protein_04.jpg&imgrefurl=http://www.ebgm.jussieu.fr/~debrevern/PBs/coding.html&h=496&w=709&sz=50&hl=en&start=8&tbnid=RCESdcwRtVouHM:&tbnh=98&tbnw=140&prev=/images?q=protein&gbv=2&hl=en

http://images.google.com/imgres?imgurl=http://www.tfe.umu.se/courses/systemteknik/Audio_signal_processing/04vasa/..%255C..%255CMediesignaler%255C03v4%255CLAB-di8.gif&imgrefurl=http://www.tfe.umu.se/courses/systemteknik/Audio_signal_processing/04vasa/LAB-dig_inspelning.html&h=504&w=603&sz=11&hl=en&start=1&tbnid=2f4H6Bmll4VRNM:&tbnh=113&tbnw=135&prev=/images?q=speech+signal&gbv=2&hl=en

90

However, for many problems, labeled data can be rare or expensive.

Unlabeled data is much cheaper.Need to pay someone to do it, requires special testing,…

Slide Credit: Avrim Blum

91


Unlabeled data is much cheaper.

Speech

Images

Medical outcomes

Customer modeling

Protein sequences

Web pages

Need to pay someone to do it, requires special testing,…


92



[From Jerry Zhu]



93




Can we make use of cheap unlabeled data?


Semi-Supervised LearningCan we use unlabeled data to augment a small

labeled sample to improve learning?

But unlabeled data is missing the most important info!!But maybe still has

useful regularities that we can use.

But…But…But…Slide Credit: Avrim Blum Slide 94 of 113

95

Method 1:

EM

How to use unlabeled data • One way is to use the EM algorithm

– EM: Expectation Maximization• The EM algorithm is a popular iterative algorithm for

maximum likelihood estimation in problems with missing data.

• The EM algorithm consists of two steps, – Expectation step, i.e., filling in the missing data – Maximization step – calculate a new maximum a posteriori

estimate for the parameters.

Slide 96 of 113

Algorithm Outline

1. Train a classifier with only the labeled documents.

2. Use it to probabilistically classify the unlabeled documents.

3. Use ALL the documents to train a new classifier.4. Iterate steps 2 and 3 to convergence.

Slide 97 of 113

98

Method 2:

Co-Training

Co-training[Blum&Mitchell’98] Many problems have two different sources of info

(“features/views”) you can use to determine label.E.g., classifying faculty webpages: can use words on page or words on links pointing to the page.

My AdvisorProf. Avrim Blum My AdvisorProf. Avrim Blum

x2- Text infox1- Link infox - Link info & Text info

Slide Credit: Avrim BlumSlide 99 of 113

Co-trainingIdea: Use small labeled sample to learn initial rules.

– E.g., “my advisor” pointing to a page is a good indicator it is a faculty home page.

– E.g., “I am teaching” on a page is a good indicator it is a faculty home page.

my advisor


Co-trainingIdea: Use small labeled sample to learn initial rules.

– E.g., “my advisor” pointing to a page is a good indicator it is a faculty home page.

– E.g., “I am teaching” on a page is a good indicator it is a faculty home page.

Then look for unlabeled examples where one view is confident and the other is not. Have it label the example for the other.

Training 2 classifiers, one on each type of info. Using each to help train the other.

hx1,x2ihx1,x2ihx1,x2i

hx1,x2ihx1,x2ihx1,x2i


102

Co-training Algorithm [Blum and Mitchell, 1998]

Given: labeled data L,

unlabeled data U

Loop:

Train h1 (e.g., hyperlink classifier) using L

Train h2 (e.g., page classifier) using L

Allow h1 to label p positive, n negative examples from U

Allow h2 to label p positive, n negative examples from U

Add these most confident self-labeled examples to L

103

Watch, Listen & Learn: Co-training on Captioned Images and Videos

Sonal Gupta, Joohyun Kim, Kristen Grauman, Raymond MooneyThe University of Texas at Austin, U.S.A.

Goals• Classify images and videos with the help

of visual information and associated text captions

• Use unlabeled image and video examples

Slide 104 of 113

Image Examples

105

Cultivating farming at Nabataean Ruins of the Ancient Avdat

Bedouin Leads His Donkey That Carries Load Of Straw

Ibex Eating In The Nature Entrance To Mikveh Israel Agricultural School

Desert

Trees

Slide 105 of 113

Approach• Combining two views of images and videos using Co-

training (Blum and Mitchell ‘98) learning algorithm

• Views: Text and Visual

• Text View – Caption of image or video– Readily available

• Visual View– Color, texture, temporal information in image/video

Slide 106 of 113

Co-training

107

++-+

Initially Labeled Instances

Visual Classifier

Text Classifier

Text View Visual View




Slide 107 of 113

Co-training

108

Initially Labeled Instances

Visual Classifier

Text Classifier

Supervised Learning

Text ViewText ViewText ViewText View

Visual ViewVisual ViewVisual ViewVisual View

++-+

++-+

Slide 108 of 113

Co-training

109

Unlabeled Instances

Visual Classifier

Text Classifier



Slide 109 of 113

Co-training

110

ClassifierLabeled

Instances

Classify most confident instances

Text Classifier

Visual Classifier



++--

++--

Slide 110 of 113

Co-training

111

Retrain Classifiers

Text Classifier

Visual Classifier



++--

++--

Slide 111 of 113

Video FeaturesDetect Interest Points

Harris-Forstener Corner Detector for both spatial and temporal space

Describe Interest PointsHistogram of Oriented Gradients (HoG)

Create Spatio-Temporal VocabularyQuantize interest points to create 200

visual words dictionary

Represent each video as histogram of visual words

[Laptev, IJCV ‘05]

…

N 72

Slide 112 of 113

Textual Features

113

• That was a very nice forward camel.• Well I remember her performance last time.• He has some delicate hand movement.• She gave a small jump while gliding• He runs in to chip the ball with his right foot.• He runs in to take the instep drive and executes it well.• The small kid pushes the ball ahead with his tiny kicks.

Standard Bag-of-Words Representation

Raw Text Commentary

Porter Stemmer Remove Stop Words

Slide 113 of 113

Conclusion• Combining textual and visual features

can help improve accuracy• Co-training can be useful to combine

textual and visual features to classify images and videos

• Co-training helps in reducing labeling of images and videos

[More information on http://www.cs.utexas.edu/users/ml/co-training]

114 Slide 114 of 113

Co-training vs. EM

• Co-training splits features, EM does not.

• Co-training incrementally uses the unlabeled data.

• EM probabilistically labels all the data at each round; EM iteratively uses the unlabeled data.

Slide 115 of 113

Documents

Tamara Berg Machine Learning