Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Prof. Feng Liu
Winter 2020
http://www.cs.pdx.edu/~fliu/courses/cs410/
02/27/2020
Last Time
Introduction to object recognition
2
The slides for this topic are used from Prof. S. Lazebnik.
Today
Machine learning approach to object recognition
◼ Classifiers
◼ Bag-of-features models
3
The slides for this topic are used from Prof. S. Lazebnik.
Recognition: A machine learning approach
Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, Kristen Grauman, and Derek Hoiem
The machine learning framework
Apply a prediction function to a feature representation of
the image to get the desired output:
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
The machine learning framework
y = f(x)
Training: given a training set of labeled examples {(x1,y1),
…, (xN,yN)}, estimate the prediction function f by minimizing
the prediction error on the training set
Testing: apply f to a never before seen test example x and
output the predicted value y = f(x)
output prediction function
Image feature
Prediction
Steps
Training
LabelsTraining Images
Training
Training
Image
Features
Image
Features
Testing
Test Image
Learned
model
Learned
model
Slide credit: D. Hoiem
Features
Raw pixels
Histograms
GIST descriptors
…
Classifiers: Nearest neighbor
f(x) = label of the training example nearest to x
All we need is a distance function for our inputs
No training required!
Test example
Training examples from class
1
Training examples from class
2
Classifiers: Linear
Find a linear function to separate the classes:
f(x) = sgn(w x + b)
Images in the training set must be annotated with the
“correct answer” that the model is expected to produce
Contains a motorbike
Recognition task and supervision
Unsupervised “Weakly” supervised Fully supervised
Definition depends on task
Generalization
How well does a learned model generalize from
the data it was trained on to a new test set?
Training set (labels known) Test set (labels unknown)
Generalization
Components of generalization error
◼ Bias: how much the average model over all training sets differ from
the true model?
Error due to inaccurate assumptions/simplifications made by the model
◼ Variance: how much models estimated from different training sets
differ from each other
Underfitting: model is too “simple” to represent all the relevant
class characteristics
◼ High bias and low variance
◼ High training error and high test error
Overfitting: model is too “complex” and fits irrelevant
characteristics (noise) in the data
◼ Low bias and high variance
◼ Low training error and high test error
Bias-variance tradeoff
Training error
Test error
Slide credit: D. Hoiem
Underfitting Overfitting
Complexity Low BiasHigh Variance
High BiasLow Variance
Err
or
Bias-variance tradeoff
Many training examples
Few training examples
Complexity Low BiasHigh Variance
High BiasLow Variance
Test
Err
or
Slide credit: D. Hoiem
Effect of Training Size
Testing
Training
Generalization Error
Slide credit: D. Hoiem
Number of Training Examples
Err
or
Fixed prediction model
Datasets
Circa 2001: 5 categories, 100s of images per
category
Circa 2004: 101 categories
Today: up to thousands of categories, millions
of images
Caltech 101 & 256
Griffin, Holub, Perona, 2007
Fei-Fei, Fergus, Perona, 2004
http://www.vision.caltech.edu/Image_Datasets/Caltech101/http://www.vision.caltech.edu/Image_Datasets/Caltech256/
Caltech-101: Intraclass variability
The PASCAL Visual Object Classes
Challenge (2005-present)
Challenge classes:Person: person Animal: bird, cat, cow, dog, horse, sheep Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor
http://host.robots.ox.ac.uk/pascal/VOC/
Main competitions
◼ Classification: For each of the twenty classes,
predicting presence/absence of an example of that
class in the test image
◼ Detection: Predicting the bounding box and label of
each object from the twenty target classes in the test
image
The PASCAL Visual Object Classes
Challenge (2005-present)
http://pascallin.ecs.soton.ac.uk/challenges/VOC/
“Taster” challenges
◼ Segmentation:
Generating pixel-wise
segmentations giving
the class of the object
visible at each pixel, or
"background"
otherwise
◼ Person layout:
Predicting the
bounding box and label
of each part of a
person (head, hands,
feet)
The PASCAL Visual Object Classes
Challenge (2005-present)
http://pascallin.ecs.soton.ac.uk/challenges/VOC/
“Taster” challenges
◼ Action classification
The PASCAL Visual Object Classes
Challenge (2005-present)
http://pascallin.ecs.soton.ac.uk/challenges/VOC/
Russell, Torralba, Murphy, Freeman, 2008
LabelMehttp://labelme.csail.mit.edu/
80 Million Tiny Images
http://people.csail.mit.edu/torralba/tinyimages/
ImageNet http://www.image-net.org/
Today
Machine learning approach to object recognition
◼ Classifiers
◼ Bag-of-features models
28
The slides for this topic are used from Prof. S. Lazebnik.
Bag-of-features models
Origin 1: Texture recognition
Texture is characterized by the repetition of basic
elements or textons
For stochastic textures, it is the identity of the
textons, not their spatial arrangement, that matters
Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003
Origin 1: Texture recognition
Universal texton dictionary
histogram
Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003
Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)
Origin 2: Bag-of-words models
US Presidential Speeches Tag Cloudhttp://chir.ag/phernalia/preztags/
Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)
Origin 2: Bag-of-words models
US Presidential Speeches Tag Cloudhttp://chir.ag/phernalia/preztags/
Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)
Origin 2: Bag-of-words models
US Presidential Speeches Tag Cloudhttp://chir.ag/phernalia/preztags/
Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)
Origin 2: Bag-of-words models
1. Extract features
2. Learn “visual vocabulary”
3. Quantize features using visual vocabulary
4. Represent images by frequencies of “visual words”
Bag-of-features steps
1. Feature extraction
Regular grid or interest regions
Normalize patch
Detect patches
Compute descriptor
Slide credit: Josef Sivic
1. Feature extraction
…
1. Feature extraction
Slide credit: Josef Sivic
2. Learning the visual vocabulary
…
Slide credit: Josef Sivic
2. Learning the visual vocabulary
Clustering
…
Slide credit: Josef Sivic
2. Learning the visual vocabulary
Clustering
…
Visual vocabulary
Slide credit: Josef Sivic
K-means clustering
• Want to minimize sum of squared Euclidean
distances between points xi and their
nearest cluster centers mk
Algorithm:
• Randomly initialize K cluster centers
• Iterate until convergence:
◼ Assign each data point to the nearest center
◼ Re-compute each cluster center as the mean of
all points assigned to it
−=k
ki
ki mxMXDcluster
clusterinpoint
2)(),(
Clustering and vector quantization
• Clustering is a common method for learning a visual
vocabulary or codebook
◼ Unsupervised learning process
◼ Each cluster center produced by k-means becomes a codevector
◼ Codebook can be learned on separate training set
◼ Provided the training set is sufficiently representative, the
codebook will be “universal”
• The codebook is used for quantizing features
◼ A vector quantizer takes a feature vector and maps it to the index
of the nearest codevector in a codebook
◼ Codebook = visual vocabulary
◼ Codevector = visual word
Example codebook
…
Source: B. LeibeAppearance codebook
Another codebook
Appearance codebook…
………
…
Source: B. Leibe
Yet another codebook
Fei-Fei et al. 2005
Visual vocabularies: Issues
• How to choose vocabulary size?
◼ Too small: visual words not representative of all
patches
◼ Too large: quantization artifacts, overfitting
• Computational efficiency
◼ Vocabulary trees
(Nister & Stewenius, 2006)
Spatial pyramid representation
level 0
Extension of a bag of features
Locally orderless representation at several levels of resolution
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene CategoriesLazebnik, Schmid & Ponce (CVPR 2006)
Spatial pyramid representation
level 0 level 1
Extension of a bag of features
Locally orderless representation at several levels of resolution
Lazebnik, Schmid & Ponce (CVPR 2006)
Spatial pyramid representation
level 0 level 1 level 2
Extension of a bag of features
Locally orderless representation at several levels of resolution
Lazebnik, Schmid & Ponce (CVPR 2006)
Scene category dataset
Multi-class classification results(100 training images per class)
Caltech101 dataset
Multi-class classification results (30 training images per class)
Next Time
More classification
Visual saliency
54