Loss-based Learningwith Latent Variables
M. Pawan KumarÉcole Centrale Paris
École des Ponts ParisTechINRIA Saclay, Île-de-France
Joint work withBen Packer, Daphne Koller, Kevin Miller and Danny Goodman
Image Classification
Images xi Labels yi
Bison
Deer
Elephant
Giraffe
Llama
Rhino
Boxes hi
y = “Deer”
Image x
Image Classification
0.00 0.00 0.000.00 0.75 0.000.00 0.00 0.00
Feature Ψ(x,y,h) (e.g. HOG)
Score f : Ψ(x,y,h) (-∞, +∞)
f (Ψ(x,y1,h))
Image Classification
0.00 0.23 0.000.00 0.00 0.010.01 0.00 0.00
f (Ψ(x,y2,h))0.00 0.00 0.000.00 0.75 0.000.00 0.00 0.00
f (Ψ(x,y1,h))
Feature Ψ(x,y,h) (e.g. HOG)
Score f : Ψ(x,y,h) (-∞, +∞)
Prediction y(f)
Learn f
Loss-based Learning
User defined loss function Δ(y,y(f))
f* = argminf Σi Δ(yi,yi(f))
Minimize loss between predicted and ground-truth output
No restriction on the loss function
General framework (object detection, segmentation, …)
• Latent SVM
• Max-Margin Min-Entropy Models
• Dissimilarity Coefficient Learning
Outline
Andrews et al., NIPS 2001; Smola et al., AISTATS 2005;Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009
Image Classification
Images xi Labels yi
Bison
Deer
Elephant
Giraffe
Llama
Rhino
Boxes hi
y = “Deer”
Image x
Latent SVM
Scoring function
wTΨ(x,y,h)
Prediction
y(w),h(w) = argmaxy,h wTΨ(x,y,h))
Parameters Features
Learning Latent SVM
Training data {(xi,yi), i = 1,2,…,n}
Highly non-convex in w
Cannot regularize w to prevent overfitting
w* = argminw Σi Δ(yi,yi(w))
Learning Latent SVM
Δ(yi,yi(w))wTΨ(x,yi(w),hi(w)) + - wTΨ(x,yi(w),hi(w))
Δ(yi,yi(w))≤ wTΨ(x,yi(w),hi(w)) + - maxhi wTΨ(x,yi,hi)
Δ(yi,y)}≤ maxy,h{wTΨ(x,y,h) + - maxhi wTΨ(x,yi,hi)
Training data {(xi,yi), i = 1,2,…,n}
Learning Latent SVM
Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi
wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi
Difference-of-convex program in w
Local minimum or saddle point solution (CCCP)
Self-Paced Learning, NIPS 2010
Recap
minw ||w||2 + C Σiξi
wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi
Scoring function
wTΨ(x,y,h)
Prediction
y(w),h(w) = argmaxy,h wTΨ(x,y,h))
Learning
Image Classification
Images xi Labels yi
Bison
Deer
Elephant
Giraffe
Llama
Rhino
Boxes hi
y = “Deer”
Image x
Image Classification
0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00
Score wTΨ(x,y,h) (-∞, +∞)
wTΨ(x,y1,h)
Image Classification
0.00 0.24 0.000.00 0.00 0.000.01 0.00 0.00
wTΨ(x,y2,h)0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00
wTΨ(x,y1,h)
Only maximum score used
No other useful cue?
Score wTΨ(x,y,h) (-∞, +∞)
Uncertainty in h
• Latent SVM
• Max-Margin Min-Entropy (M3E) Models
• Dissimilarity Coefficient Learning
Outline
Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012
M3E
Scoring function
Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)
Prediction
y(w) = argminy Hα(Pw(h|y,x)) – log Pw(y|x)
Partition Function
MarginalizedProbability
RényiEntropy
Rényi Entropy of Generalized Distribution Gα(y;x,w)
Rényi Entropy
Gα(y;x,w) =1
1-αlog
Σh Pw(y,h|x)α
Σh Pw(y,h|x)
α = 1. Shannon Entropy of Generalized Distribution
- Σh Pw(y,h|x) log(Pw(y,h|x))
Σh Pw(y,h|x)
Rényi Entropy
Gα(y;x,w) =1
1-αlog
Σh Pw(y,h|x)α
Σh Pw(y,h|x)
α = Infinity. Minimum Entropy of Generalized Distribution
- maxh log(Pw(y,h|x))
Rényi Entropy
Gα(y;x,w) =1
1-αlog
Σh Pw(y,h|x)α
Σh Pw(y,h|x)
α = Infinity. Minimum Entropy of Generalized Distribution
- maxh wTΨ(x,y,h)
Same prediction as latent SVM
Learning M3E
Training data {(xi,yi), i = 1,2,…,n}
Highly non-convex in w
Cannot regularize w to prevent overfitting
w* = argminw Σi Δ(yi,yi(w))
Learning M3E
Δ(yi,yi(w))Gα(yi(w);xi,w) +
Training data {(xi,yi), i = 1,2,…,n}
- Gα(yi(w);xi,w)
Δ(yi,yi(w))≤ Gα(yi;xi,w) + - Gα(yi(w);xi,w)
maxy{Δ(yi,y)≤ Gα(yi;xi,w) + - Gα(y;xi,w)}
Learning M3E
Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi
Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi
When α tends to infinity, M3E = Latent SVM
Other values can give better results
Image Classification
271 images, 6 classes
90/10 train/test split
5 folds
Mammals Dataset
0/1 loss
Image Classification
HOG-Based Model. Dalal and Triggs, 2005
Motif Finding
~ 40,000 sequences
50/50 train/test split
5 Proteins, 5 folds
UniProbe Dataset
Binding vs. Not-Binding
Motif + Markov Background Model. Yu and Joachims, 2009
Motif Finding
RecapScoring function
Prediction
Learning
Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)
y(w) = argminy Gα(y;x,w)
minw ||w||2 + C Σiξi
Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi
• Latent SVM
• Max-Margin Min-Entropy Models
• Dissimilarity Coefficient Learning
Outline
Kumar, Packer and Koller, ICML 2012
Object Detection
Images xi Labels yi
Bison
Deer
Elephant
Giraffe
Llama
Rhino
Boxes hi
y = “Deer”
Image x
Minimizing General Loss
minw Σi Δ(yi,hi,yi(w),hi(w))
Unknown latent variable values
Supervised Samples
+ Σi Δ’(yi,yi(w),hi(w))Weakly
Supervised Samples
Minimizing General Loss
minw Σi Δ(yi,hi,yi(w),hi(w))
A single distribution to achieve two objectives
Pw(hi|xi,yi)Σhi
Problem
Model Uncertainty in Latent Variables
Model Accuracy of Latent Variable Predictions
Solution
Model Uncertainty in Latent Variables
Model Accuracy of Latent Variable Predictions
Use two different distributions for the two different tasks
Solution
Model Accuracy of Latent Variable Predictions
Use two different distributions for the two different tasks
Pθ(hi|yi,xi)
hi
SolutionUse two different distributions for the two different tasks
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
The Ideal CaseNo latent variable uncertainty, correct prediction
hi
Pw(yi,hi|xi)
(yi,hi)(yi,hi(w))
Pθ(hi|yi,xi)
hi(w)
In PracticeRestrictions in the representation power of models
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
Our FrameworkMinimize the dissimilarity between the two distributions
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
User-defined dissimilarity measure
Our FrameworkMinimize Rao’s Dissimilarity Coefficient
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)
Our FrameworkMinimize Rao’s Dissimilarity Coefficient
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)
Hi(w,θ)
Our FrameworkMinimize Rao’s Dissimilarity Coefficient
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))
- β Hi(θ,θ)Hi(w,θ)
Our FrameworkMinimize Rao’s Dissimilarity Coefficient
hi
Pw(yi,hi|xi)
(yi,hi)(yi(w),hi(w))
Pθ(hi|yi,xi)
- β Hi(θ,θ)Hi(w,θ)minw,θ Σi
Optimization
minw,θ Σi Hi(w,θ) - β Hi(θ,θ)
Initialize the parameters to w0 and θ0
Repeat until convergence
End
Fix w and optimize θ
Fix θ and optimize w
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
hi
Pθ(hi|yi,xi)
Case I: yi(w) = yi
hi(w)
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
hi
Pθ(hi|yi,xi)
Case I: yi(w) = yi
hi(w)
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
hi
Pθ(hi|yi,xi)
Case II: yi(w) ≠ yi
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
hi
Pθ(hi|yi,xi)
Case II: yi(w) ≠ yi
Stochastic subgradient descent
Optimization of w
minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)
Expected loss, models uncertainty
Form of optimization similar to Latent SVM
Δ independent of h, implies latent SVM
Concave-Convex Procedure (CCCP)
Object Detection
Bison
Deer
Elephant
Giraffe
Llama
Rhino
Input x
Output y = “Deer”Latent Variable h
Mammals Dataset
60/40 Train/Test Split
5 Folds
Train Input xi Output yi
Results – 0/1 Loss
Fold 1 Fold 2 Fold 3 Fold 4 Fold 50
0.10.20.30.40.50.60.70.80.9
Average Test Loss
LSVMOur
Statistically Significant
Results – Overlap Loss
Fold 1 Fold 2 Fold 3 Fold 4 Fold 50
0.1
0.2
0.3
0.4
0.5
0.6Average Test Loss
LSVMOur
Action DetectionInput x
Output y = “Using Computer”Latent Variable h
PASCAL VOC 2011
60/40 Train/Test Split
5 Folds
Jumping
Phoning
Playing Instrument
Reading
Riding Bike
Riding Horse
Running
Taking Photo
Using Computer
Walking
Train Input xi Output yi
Results – 0/1 Loss
Fold 1 Fold 2 Fold 3 Fold 4 Fold 50
0.2
0.4
0.6
0.8
1
1.2Average Test Loss
LSVMOur
Statistically Significant
Results – Overlap Loss
Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62
0.64
0.66
0.68
0.7
0.72
0.74Average Test Loss
LSVMOur
Statistically Significant
• Latent SVM ignores latent variable uncertainty
• M3E for latent variable independent loss
• DISC for latent variable dependent loss
• Strict generalizations of Latent SVM
• Code available online
Conclusions
Questions?
http://cvc.centrale-ponts.fr/personnel/pawan