Modeling Latent Variable Uncertainty for Loss-based Learning

Modeling Latent Variable Uncertainty for Loss-based Learning

Daphne KollerStanford University

Ben PackerStanford University

M. Pawan KumarÉcole Centrale Paris

École des Ponts ParisTechINRIA Saclay, Île-de-France

AimAccurate learning with weakly supervised data

Train Input xi Output yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino Object Detection

Input x

Output y = “Deer”Latent Variable h

(y(f),h(f)) = argmaxy,h f(Ψ(x,y,h))


Feature Ψ(x,y,h) (e.g. HOG)Input x

Output y = “Deer”

Prediction

Function f : Ψ(x,y,h) (-∞, +∞)

Latent Variable h

f* = argminf Objective(f)





Learning

Latent Variable h

AimFind a suitable objective function to learn f*




Learning

Encourages accurate prediction

User-specified criterion for accuracy

f* = argminf Objective(f)

Latent Variable h

• Previous Methods

• Our Framework

• Optimization

• Results

• Ongoing and Future Work

Outline

Latent SVM

Linear function parameterized by w

Prediction (y(w), h(w)) = argmaxy,h wTΨ(x,y,h)

Learning minw Σi Δ(yi,yi(w),hi(w))

✔ Loss based learning

✖ Loss independent of true (unknown) latent variable

✖ Doesn’t model uncertainty in latent variables

User-defined loss

Expectation Maximization

Joint probability Pθ(y,h|x) =exp(θTΨ(x,y,h))

Z

Prediction (y(θ), h(θ)) = argmaxy,h Pθ(y,h|x)



Z

Prediction (y(θ), h(θ)) = argmaxy,h θTΨ(x,y,h)

Learning maxθ Σi log (Pθ(yi|xi))



Z

Prediction (y(θ), h(θ)) = argmaxy,h θTΨ(x,y,h)

Learning maxθ Σi Σhi log (Pθ(yi,hi|xi))

✔ Models uncertainty in latent variables

✖ Doesn’t model accuracy of latent variable prediction

✖ No user-defined loss function


• Our Framework

• Optimization

• Results


Outline

Problem

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Solution

Model Uncertainty in Latent Variables


Use two different distributions for the two different tasks

Solution


Use two different distributions for the two different tasks

Pθ(hi|yi,xi)

hi

SolutionUse two different distributions for the two different tasks

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

The Ideal CaseNo latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

hi(w)


hi

Pw(yi,hi|xi)

(yi,hi)(yi,hi(w))

Pθ(hi|yi,xi)

hi(w)

In PracticeRestrictions in the representation power of models

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

Our FrameworkMinimize the dissimilarity between the two distributions

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

User-defined dissimilarity measure

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)

Hi(w,θ)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))

- β Hi(θ,θ)Hi(w,θ)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- β Hi(θ,θ)Hi(w,θ)minw,θ Σi


• Our Framework

• Optimization

• Results


Outline

Optimizationminw,θ Σi Hi(w,θ) - β Hi(θ,θ)

Initialize the parameters to w0 and θ0

Repeat until convergence

End

Fix w and optimize θ

Fix θ and optimize w

Optimization of θminθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)


hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)


hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi


hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Stochastic subgradient descent

Optimization of wminw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Expected loss, models uncertainty

Form of optimization similar to Latent SVM

Observation: When Δ is independent of true h,our framework is equivalent to Latent SVM

Concave-Convex Procedure (CCCP)


• Our Framework

• Optimization

• Results


Outline

Object Detection

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Input x

Output y = “Deer”Latent Variable h

Mammals Dataset

60/40 Train/Test Split

5 Folds


Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.10.20.30.40.50.60.70.80.9

Average Test Loss

LSVMOur

Statistically Significant

Results – Overlap Loss


0.1

0.2

0.3

0.4

0.5

0.6Average Test Loss

LSVMOur

Action DetectionInput x

Output y = “Using Computer”Latent Variable h

PASCAL VOC 2011

60/40 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking


Results – 0/1 Loss


0.2

0.4

0.6

0.8

1


LSVMOur


Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62

0.64

0.66

0.68

0.7

0.72


LSVMOur



• Our Framework

• Optimization

• Results


Outline

Slides Deleted !!!

Documents

Modeling Latent Variable Uncertainty for Loss-based Learning