Prof. Feng Liu

Prof. Feng Liu

Winter 2020

http://www.cs.pdx.edu/~fliu/courses/cs410/

02/27/2020

http://www.cs.pdx.edu/~fliu/courses/cs510/

Last Time

Introduction to object recognition

2

The slides for this topic are used from Prof. S. Lazebnik.

Today

Machine learning approach to object recognition

◼ Classifiers

◼ Bag-of-features models

3


Recognition: A machine learning approach

Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, Kristen Grauman, and Derek Hoiem

The machine learning framework

Apply a prediction function to a feature representation of

the image to get the desired output:

f( ) = “apple”

f( ) = “tomato”

f( ) = “cow”

The machine learning framework

y = f(x)

Training: given a training set of labeled examples {(x1,y1),

…, (xN,yN)}, estimate the prediction function f by minimizing

the prediction error on the training set

Testing: apply f to a never before seen test example x and

output the predicted value y = f(x)

output prediction function

Image feature

Prediction

Steps

Training

LabelsTraining Images

Training

Training

Image

Features

Image

Features

Testing

Test Image

Learned

model

Learned

model

Slide credit: D. Hoiem

Features

Raw pixels

Histograms

GIST descriptors

…

Classifiers: Nearest neighbor

f(x) = label of the training example nearest to x

All we need is a distance function for our inputs

No training required!

Test example

Training examples from class

1

Training examples from class

2

Classifiers: Linear

Find a linear function to separate the classes:

f(x) = sgn(w x + b)

Images in the training set must be annotated with the

“correct answer” that the model is expected to produce

Contains a motorbike

Recognition task and supervision

Unsupervised “Weakly” supervised Fully supervised

Definition depends on task

Generalization

How well does a learned model generalize from

the data it was trained on to a new test set?

Training set (labels known) Test set (labels unknown)

Generalization

Components of generalization error

◼ Bias: how much the average model over all training sets differ from

the true model?

Error due to inaccurate assumptions/simplifications made by the model

◼ Variance: how much models estimated from different training sets

differ from each other

Underfitting: model is too “simple” to represent all the relevant

class characteristics

◼ High bias and low variance

◼ High training error and high test error

Overfitting: model is too “complex” and fits irrelevant

characteristics (noise) in the data

◼ Low bias and high variance

◼ Low training error and high test error

Bias-variance tradeoff

Training error

Test error


Underfitting Overfitting

Complexity Low BiasHigh Variance

High BiasLow Variance

Err

or

Bias-variance tradeoff

Many training examples

Few training examples

Complexity Low BiasHigh Variance

High BiasLow Variance

Test

Err

or


Effect of Training Size

Testing

Training

Generalization Error


Number of Training Examples

Err

or

Fixed prediction model

Datasets

Circa 2001: 5 categories, 100s of images per

category

Circa 2004: 101 categories

Today: up to thousands of categories, millions

of images

Caltech 101 & 256

Griffin, Holub, Perona, 2007

Fei-Fei, Fergus, Perona, 2004

http://www.vision.caltech.edu/Image_Datasets/Caltech101/http://www.vision.caltech.edu/Image_Datasets/Caltech256/

http://www.vision.caltech.edu/Image_Datasets/Caltech101/

Caltech-101: Intraclass variability

The PASCAL Visual Object Classes

Challenge (2005-present)

Challenge classes:Person: person Animal: bird, cat, cow, dog, horse, sheep Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

http://host.robots.ox.ac.uk/pascal/VOC/

http://www.pascal-network.org/challenges/VOC/voc2007/examples/aeroplane_03.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/bicycle_08.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/bird_07.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/boat_02.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/bottle_05.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/bus_01.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/car_01.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/cat_02.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/chair_02.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/cow_02.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/diningtable_05.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/dog_08.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/horse_07.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/motorbike_04.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/person_06.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/pottedplant_03.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/sheep_07.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/sofa_03.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/train_05.jpg

http://www.pascal-network.org/challenges/VOC/voc2007/examples/tvmonitor_01.jpg

http://host.robots.ox.ac.uk/pascal/VOC/

Main competitions

◼ Classification: For each of the twenty classes,

predicting presence/absence of an example of that

class in the test image

◼ Detection: Predicting the bounding box and label of

each object from the twenty target classes in the test

image



http://pascallin.ecs.soton.ac.uk/challenges/VOC/


“Taster” challenges

◼ Segmentation:

Generating pixel-wise

segmentations giving

the class of the object

visible at each pixel, or

"background"

otherwise

◼ Person layout:

Predicting the

bounding box and label

of each part of a

person (head, hands,

feet)





“Taster” challenges

◼ Action classification





Russell, Torralba, Murphy, Freeman, 2008

LabelMehttp://labelme.csail.mit.edu/

http://labelme.csail.mit.edu/

80 Million Tiny Images

http://people.csail.mit.edu/torralba/tinyimages/

http://people.csail.mit.edu/torralba/tinyimages/

ImageNet http://www.image-net.org/

http://www.image-net.org/

Today

Machine learning approach to object recognition

◼ Classifiers

◼ Bag-of-features models

28


Bag-of-features models

Origin 1: Texture recognition

Texture is characterized by the repetition of basic

elements or textons

For stochastic textures, it is the identity of the

textons, not their spatial arrangement, that matters

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003

Origin 1: Texture recognition

Universal texton dictionary

histogram

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003

Orderless document representation: frequencies of words

from a dictionary Salton & McGill (1983)

Origin 2: Bag-of-words models

US Presidential Speeches Tag Cloudhttp://chir.ag/phernalia/preztags/












1. Extract features

2. Learn “visual vocabulary”

3. Quantize features using visual vocabulary

4. Represent images by frequencies of “visual words”

Bag-of-features steps

1. Feature extraction

Regular grid or interest regions

Normalize patch

Detect patches

Compute descriptor

Slide credit: Josef Sivic


…



2. Learning the visual vocabulary

…



Clustering

…



Clustering

…

Visual vocabulary


K-means clustering

• Want to minimize sum of squared Euclidean

distances between points xi and their

nearest cluster centers mk

Algorithm:

• Randomly initialize K cluster centers

• Iterate until convergence:

◼ Assign each data point to the nearest center

◼ Re-compute each cluster center as the mean of

all points assigned to it

−=k

ki

ki mxMXDcluster

clusterinpoint

2)(),(

Clustering and vector quantization

• Clustering is a common method for learning a visual

vocabulary or codebook

◼ Unsupervised learning process

◼ Each cluster center produced by k-means becomes a codevector

◼ Codebook can be learned on separate training set

◼ Provided the training set is sufficiently representative, the

codebook will be “universal”

• The codebook is used for quantizing features

◼ A vector quantizer takes a feature vector and maps it to the index

of the nearest codevector in a codebook

◼ Codebook = visual vocabulary

◼ Codevector = visual word

Example codebook

…

Source: B. LeibeAppearance codebook

Another codebook

Appearance codebook…

………

…

Source: B. Leibe

Yet another codebook

Fei-Fei et al. 2005

Visual vocabularies: Issues

• How to choose vocabulary size?

◼ Too small: visual words not representative of all

patches

◼ Too large: quantization artifacts, overfitting

• Computational efficiency

◼ Vocabulary trees

(Nister & Stewenius, 2006)

Spatial pyramid representation

level 0

Extension of a bag of features

Locally orderless representation at several levels of resolution

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene CategoriesLazebnik, Schmid & Ponce (CVPR 2006)


level 0 level 1



Lazebnik, Schmid & Ponce (CVPR 2006)


level 0 level 1 level 2



Lazebnik, Schmid & Ponce (CVPR 2006)

Scene category dataset

Multi-class classification results(100 training images per class)

Caltech101 dataset

Multi-class classification results (30 training images per class)

Next Time

More classification

Visual saliency

54

Documents

Prof. Feng Liu