Loss-based Learning with Latent Variables

Preview:

DESCRIPTION

Loss-based Learning with Latent Variables. M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay , Île-de-France. Joint work with Ben Packer, Daphne Koller, Kevin Miller and Danny Goodman. Image Classification. Images x i. Boxes h i. Labels y i. Image x. - PowerPoint PPT Presentation

Citation preview

Loss-based Learningwith Latent Variables

M. Pawan KumarÉcole Centrale Paris

École des Ponts ParisTechINRIA Saclay, Île-de-France

Joint work withBen Packer, Daphne Koller, Kevin Miller and Danny Goodman

Image Classification

Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x

Image Classification

0.00 0.00 0.000.00 0.75 0.000.00 0.00 0.00

Feature Ψ(x,y,h) (e.g. HOG)

Score f : Ψ(x,y,h) (-∞, +∞)

f (Ψ(x,y1,h))

Image Classification

0.00 0.23 0.000.00 0.00 0.010.01 0.00 0.00

f (Ψ(x,y2,h))0.00 0.00 0.000.00 0.75 0.000.00 0.00 0.00

f (Ψ(x,y1,h))

Feature Ψ(x,y,h) (e.g. HOG)

Score f : Ψ(x,y,h) (-∞, +∞)

Prediction y(f)

Learn f

Loss-based Learning

User defined loss function Δ(y,y(f))

f* = argminf Σi Δ(yi,yi(f))

Minimize loss between predicted and ground-truth output

No restriction on the loss function

General framework (object detection, segmentation, …)

• Latent SVM

• Max-Margin Min-Entropy Models

• Dissimilarity Coefficient Learning

Outline

Andrews et al., NIPS 2001; Smola et al., AISTATS 2005;Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Image Classification

Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x

Latent SVM

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h))

Parameters Features

Learning Latent SVM

Training data {(xi,yi), i = 1,2,…,n}

Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Learning Latent SVM

Δ(yi,yi(w))wTΨ(x,yi(w),hi(w)) + - wTΨ(x,yi(w),hi(w))

Δ(yi,yi(w))≤ wTΨ(x,yi(w),hi(w)) + - maxhi wTΨ(x,yi,hi)

Δ(yi,y)}≤ maxy,h{wTΨ(x,y,h) + - maxhi wTΨ(x,yi,hi)

Training data {(xi,yi), i = 1,2,…,n}

Learning Latent SVM

Training data {(xi,yi), i = 1,2,…,n}

minw ||w||2 + C Σiξi

wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Difference-of-convex program in w

Local minimum or saddle point solution (CCCP)

Self-Paced Learning, NIPS 2010

Recap

minw ||w||2 + C Σiξi

wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h))

Learning

Image Classification

Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x

Image Classification

0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

Score wTΨ(x,y,h) (-∞, +∞)

wTΨ(x,y1,h)

Image Classification

0.00 0.24 0.000.00 0.00 0.000.01 0.00 0.00

wTΨ(x,y2,h)0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

wTΨ(x,y1,h)

Only maximum score used

No other useful cue?

Score wTΨ(x,y,h) (-∞, +∞)

Uncertainty in h

• Latent SVM

• Max-Margin Min-Entropy (M3E) Models

• Dissimilarity Coefficient Learning

Outline

Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

M3E

Scoring function

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

Prediction

y(w) = argminy Hα(Pw(h|y,x)) – log Pw(y|x)

Partition Function

MarginalizedProbability

RényiEntropy

Rényi Entropy of Generalized Distribution Gα(y;x,w)

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = 1. Shannon Entropy of Generalized Distribution

- Σh Pw(y,h|x) log(Pw(y,h|x))

Σh Pw(y,h|x)

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh log(Pw(y,h|x))

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh wTΨ(x,y,h)

Same prediction as latent SVM

Learning M3E

Training data {(xi,yi), i = 1,2,…,n}

Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Learning M3E

Δ(yi,yi(w))Gα(yi(w);xi,w) +

Training data {(xi,yi), i = 1,2,…,n}

- Gα(yi(w);xi,w)

Δ(yi,yi(w))≤ Gα(yi;xi,w) + - Gα(yi(w);xi,w)

maxy{Δ(yi,y)≤ Gα(yi;xi,w) + - Gα(y;xi,w)}

Learning M3E

Training data {(xi,yi), i = 1,2,…,n}

minw ||w||2 + C Σiξi

Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

When α tends to infinity, M3E = Latent SVM

Other values can give better results

Image Classification

271 images, 6 classes

90/10 train/test split

5 folds

Mammals Dataset

0/1 loss

Image Classification

HOG-Based Model. Dalal and Triggs, 2005

Motif Finding

~ 40,000 sequences

50/50 train/test split

5 Proteins, 5 folds

UniProbe Dataset

Binding vs. Not-Binding

Motif + Markov Background Model. Yu and Joachims, 2009

Motif Finding

RecapScoring function

Prediction

Learning

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

y(w) = argminy Gα(y;x,w)

minw ||w||2 + C Σiξi

Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

• Latent SVM

• Max-Margin Min-Entropy Models

• Dissimilarity Coefficient Learning

Outline

Kumar, Packer and Koller, ICML 2012

Object Detection

Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

Unknown latent variable values

Supervised Samples

+ Σi Δ’(yi,yi(w),hi(w))Weakly

Supervised Samples

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

A single distribution to achieve two objectives

Pw(hi|xi,yi)Σhi

Problem

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Solution

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Use two different distributions for the two different tasks

Solution

Model Accuracy of Latent Variable Predictions

Use two different distributions for the two different tasks

Pθ(hi|yi,xi)

hi

SolutionUse two different distributions for the two different tasks

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

The Ideal CaseNo latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)

(yi,hi)(yi,hi(w))

Pθ(hi|yi,xi)

hi(w)

In PracticeRestrictions in the representation power of models

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Our FrameworkMinimize the dissimilarity between the two distributions

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

User-defined dissimilarity measure

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)

Hi(w,θ)

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))

- β Hi(θ,θ)Hi(w,θ)

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- β Hi(θ,θ)Hi(w,θ)minw,θ Σi

Optimization

minw,θ Σi Hi(w,θ) - β Hi(θ,θ)

Initialize the parameters to w0 and θ0

Repeat until convergence

End

Fix w and optimize θ

Fix θ and optimize w

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Stochastic subgradient descent

Optimization of w

minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Expected loss, models uncertainty

Form of optimization similar to Latent SVM

Δ independent of h, implies latent SVM

Concave-Convex Procedure (CCCP)

Object Detection

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Input x

Output y = “Deer”Latent Variable h

Mammals Dataset

60/40 Train/Test Split

5 Folds

Train Input xi Output yi

Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.10.20.30.40.50.60.70.80.9

Average Test Loss

LSVMOur

Statistically Significant

Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.1

0.2

0.3

0.4

0.5

0.6Average Test Loss

LSVMOur

Action DetectionInput x

Output y = “Using Computer”Latent Variable h

PASCAL VOC 2011

60/40 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.2

0.4

0.6

0.8

1

1.2Average Test Loss

LSVMOur

Statistically Significant

Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62

0.64

0.66

0.68

0.7

0.72

0.74Average Test Loss

LSVMOur

Statistically Significant

• Latent SVM ignores latent variable uncertainty

• M3E for latent variable independent loss

• DISC for latent variable dependent loss

• Strict generalizations of Latent SVM

• Code available online

Conclusions

Questions?

http://cvc.centrale-ponts.fr/personnel/pawan

Recommended