Loss-based Learning with Latent Variables

Loss-based Learningwith Latent Variables

M. Pawan KumarÉcole Centrale Paris

École des Ponts ParisTechINRIA Saclay, Île-de-France

Joint work withBen Packer, Daphne Koller, Kevin Miller and Danny Goodman

Image Classification

Images xi Labels yi

Elephant

Giraffe

Boxes hi

y = “Deer”

Image x

0.00 0.00 0.000.00 0.75 0.000.00 0.00 0.00

Feature Ψ(x,y,h) (e.g. HOG)

Score f : Ψ(x,y,h) (-∞, +∞)

f (Ψ(x,y1,h))

0.00 0.23 0.000.00 0.00 0.010.01 0.00 0.00

f (Ψ(x,y2,h))0.00 0.00 0.000.00 0.75 0.000.00 0.00 0.00

f (Ψ(x,y1,h))

Feature Ψ(x,y,h) (e.g. HOG)

Score f : Ψ(x,y,h) (-∞, +∞)

Prediction y(f)

Learn f

Loss-based Learning

User defined loss function Δ(y,y(f))

f* = argminf Σi Δ(yi,yi(f))

Minimize loss between predicted and ground-truth output

No restriction on the loss function

General framework (object detection, segmentation, …)

• Latent SVM

• Max-Margin Min-Entropy Models

• Dissimilarity Coefficient Learning

Outline

Andrews et al., NIPS 2001; Smola et al., AISTATS 2005;Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Images xi Labels yi

Elephant

Giraffe

Boxes hi

y = “Deer”

Image x

Latent SVM

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h))

Parameters Features

Learning Latent SVM

Training data {(xi,yi), i = 1,2,…,n}

Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Learning Latent SVM

Δ(yi,yi(w))wTΨ(x,yi(w),hi(w)) + - wTΨ(x,yi(w),hi(w))

Δ(yi,yi(w))≤ wTΨ(x,yi(w),hi(w)) + - maxhi wTΨ(x,yi,hi)

Δ(yi,y)}≤ maxy,h{wTΨ(x,y,h) + - maxhi wTΨ(x,yi,hi)

Learning Latent SVM

minw ||w||2 + C Σiξi

wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Difference-of-convex program in w

Local minimum or saddle point solution (CCCP)

Self-Paced Learning, NIPS 2010

wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h))

Learning

Images xi Labels yi

Elephant

Giraffe

Boxes hi

y = “Deer”

Image x

0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

Score wTΨ(x,y,h) (-∞, +∞)

wTΨ(x,y1,h)

0.00 0.24 0.000.00 0.00 0.000.01 0.00 0.00

wTΨ(x,y2,h)0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

wTΨ(x,y1,h)

Only maximum score used

No other useful cue?

Score wTΨ(x,y,h) (-∞, +∞)

Uncertainty in h

• Latent SVM

• Max-Margin Min-Entropy (M3E) Models

Outline

Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

Scoring function

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

Prediction

y(w) = argminy Hα(Pw(h|y,x)) – log Pw(y|x)

Partition Function

MarginalizedProbability

RényiEntropy

Rényi Entropy of Generalized Distribution Gα(y;x,w)

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = 1. Shannon Entropy of Generalized Distribution

- Σh Pw(y,h|x) log(Pw(y,h|x))

Σh Pw(y,h|x)

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh log(Pw(y,h|x))

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh wTΨ(x,y,h)

Same prediction as latent SVM

Learning M3E

Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Learning M3E

Δ(yi,yi(w))Gα(yi(w);xi,w) +

- Gα(yi(w);xi,w)

Δ(yi,yi(w))≤ Gα(yi;xi,w) + - Gα(yi(w);xi,w)

maxy{Δ(yi,y)≤ Gα(yi;xi,w) + - Gα(y;xi,w)}

Learning M3E

Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

When α tends to infinity, M3E = Latent SVM

Other values can give better results

271 images, 6 classes

90/10 train/test split

5 folds

Mammals Dataset

0/1 loss

HOG-Based Model. Dalal and Triggs, 2005

Motif Finding

~ 40,000 sequences

50/50 train/test split

5 Proteins, 5 folds

UniProbe Dataset

Binding vs. Not-Binding

Motif + Markov Background Model. Yu and Joachims, 2009

Motif Finding

RecapScoring function

Prediction

Learning

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

y(w) = argminy Gα(y;x,w)

Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

• Latent SVM

• Max-Margin Min-Entropy Models

Outline

Kumar, Packer and Koller, ICML 2012

Object Detection

Images xi Labels yi

Elephant

Giraffe

Boxes hi

y = “Deer”

Image x

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

Unknown latent variable values

Supervised Samples

+ Σi Δ’(yi,yi(w),hi(w))Weakly

Supervised Samples

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

A single distribution to achieve two objectives

Pw(hi|xi,yi)Σhi

Problem

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Solution

Model Uncertainty in Latent Variables

Use two different distributions for the two different tasks

Solution

Use two different distributions for the two different tasks

Pθ(hi|yi,xi)

SolutionUse two different distributions for the two different tasks

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

The Ideal CaseNo latent variable uncertainty, correct prediction

Pw(yi,hi|xi)

(yi,hi)(yi,hi(w))

Pθ(hi|yi,xi)

In PracticeRestrictions in the representation power of models

Pw(yi,hi|xi)

Pθ(hi|yi,xi)

Our FrameworkMinimize the dissimilarity between the two distributions

Pw(yi,hi|xi)

Pθ(hi|yi,xi)

User-defined dissimilarity measure

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

Pw(yi,hi|xi)

Pθ(hi|yi,xi)

Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Pw(yi,hi|xi)

Pθ(hi|yi,xi)

- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)

Hi(w,θ)

Pw(yi,hi|xi)

Pθ(hi|yi,xi)

- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))

- β Hi(θ,θ)Hi(w,θ)

Pw(yi,hi|xi)

Pθ(hi|yi,xi)

- β Hi(θ,θ)Hi(w,θ)minw,θ Σi

Optimization

minw,θ Σi Hi(w,θ) - β Hi(θ,θ)

Initialize the parameters to w0 and θ0

Repeat until convergence

Fix w and optimize θ

Fix θ and optimize w

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

Pθ(hi|yi,xi)

Case I: yi(w) = yi

Optimization of θ

Pθ(hi|yi,xi)

Case I: yi(w) = yi

Optimization of θ

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Optimization of θ

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Stochastic subgradient descent

Optimization of w

minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Expected loss, models uncertainty

Form of optimization similar to Latent SVM

Δ independent of h, implies latent SVM

Concave-Convex Procedure (CCCP)

Object Detection

Elephant

Giraffe

Input x

Output y = “Deer”Latent Variable h

Mammals Dataset

60/40 Train/Test Split

5 Folds

Train Input xi Output yi

Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.10.20.30.40.50.60.70.80.9

Average Test Loss

LSVMOur

Statistically Significant

Results – Overlap Loss

0.6Average Test Loss

LSVMOur

Action DetectionInput x

Output y = “Using Computer”Latent Variable h

PASCAL VOC 2011

60/40 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Results – 0/1 Loss

LSVMOur

Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62

LSVMOur

• Latent SVM ignores latent variable uncertainty

• M3E for latent variable independent loss

• DISC for latent variable dependent loss

• Strict generalizations of Latent SVM

• Code available online

Conclusions

Questions?

http://cvc.centrale-ponts.fr/personnel/pawan

Loss-based Learning with Latent Variables

Documents

Learning Structural SVMs with Latent Variables

Statistical Analysis With Latent Variables User’s Guide · Statistical Analysis With Latent Variables User’s Guide Linda K. Muthén Bengt O. Muthén . Following is the correct

I Integrating latent variables in L discrete choice models

Modeling Latent Variable Uncertainty for Loss-based Learning

Identification of latent variables in Mode choice.pdf

Incorporating Latent Variables into Discrete Choice Models - Mplus

Detecting Latent Variables of Interest in Geo-localized ...homepages.dcc.ufmg.br/~erickson/publications/saldana_sbr2015.pdf · Detecting Latent Variables of Interest in Geo-localized

Continuous Latent Variables --Bishop Xue Tian. 2 Continuous Latent Variables Explore models in which some, or all of the latent variables are continuous

Normality tests for latent variables

Variational Inference for Latent Variables and …...of MAP estimation over the latent variables which is currently the standard way to train GP-LVMs. 2.1 Gaussian Processes for Latent

“Ghost Chasing”: Demystifying Latent Variables and SEM

Use of Latent Variables in the Parameter Estimation Process

Chris Bishop’s PRML Ch. XII: Continuous Latent Variables

Latent variables - Webs

Latent Variables Naman Agarwal Michael Nute May 1, 2013

Growth Modeling With Latent Variables Using Mplus

Non-Normality Propagation among Latent Variables and Indicators

Latent Variable Models · Latent Variable Models: Motivation 1 Only shaded variables x are observed in the data (pixel values) 2 Latent variables z correspond to high level features

Growth Modeling With Latent Variables Using Mplus: Introductory

Lecture 3: Directed Graphical Models and Latent Variables ... › ~duvenaud › courses › csc... · Lecture 3: Directed Graphical Models and Latent Variables Based on slides by