18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The...

Preview:

Citation preview

Deep Learning Srihari

1

The Log-likelihood Gradient

Sargur N. Sriharisrihari@cedar.buffalo.edu

Deep Learning Srihari

Topics• Definition of Partition Function1.The log-likelihood gradient2.Stochastic maximum likelihood and

contrastive divergence3.Pseudolikelihood4.Score matching and Ratio matching5.Denoising score matching6.Noise-contrastive estimation7.Estimating the partition function

2

Deep Learning Srihari

Undirected models in deep learning• pmodel(x) is an undirected model

• We study how the parameters are to be determined 3

A deep Boltzmann machineA Restricted Boltzmann machine

Deep Learning Srihari

Finding most likely parameters 𝛳

• Task of interest: – Determine parameters θ of a Gibbs distribution

• where is the partition function

• Learning undirected model by MLE is difficult because partition function depends on parameters

• First recall the Maximum Likelihood principle4

p(x;θ) = 1

Z(θ)!p(x,θ)

Z(θ) = !p(x,θ)

x∑

Deep Learning SrihariMaximum Likelihood Expression• Given m i.i.d. examples X={x(1), x(2),..x(m)}

– From true but unknown distribution pdata(x)• Let pmodel(x ; θ) be parametric indexed by θ

• i.e., pmodel(x;θ) maps any x to the true probability pdata(x)– MLE for θ is:

• Equivalently, by taking logarithms• Replacing summation with expectation

• Which is solved using gradient descent• Where g is gradient with terms

θ ← θ + εg

gm= ∇θ log p(x

(m);θ) = ∇θ log !p(x(m);θ) − ∇θ logZ(θ)

Deep Learning Srihari

Gradient has two phases• Positive and Negative phases of learning

– Gradient of the log-likelihood wrt parameters has a term corresponding to gradient of partition function

6

∇θ log p(x;θ) = ∇θ log !p(x;θ)− ∇θ logZ(θ)

p(x;θ) = 1

Z(θ)!p(x,θ)

Deep Learning Srihari

Tractability: Positive, Negative phases• For most undirected models: negative phase

is difficult• Models with no latent variables or few

interactions between latent variables have a tractable positive phase– RBM: straight-forward positive phase, difficult

negative phase

• This chapter is about difficulties with the negative phase

7

∇θ logZ(θ)

Deep Learning Srihari

Computing Gradient for Negative Phase

• For models that guarantee p(x) > 0 for all x we can substitute for

Derivation made use of summation over discrete xSimilar result applies using integration over continuous xIn the continuous version we use Leibniz rule for differentiation

Deep Learning Srihari

MLE for RBM

p(x;θ) = 1

Z(θ)!p(x,θ)

Z(θ) = !p(x,θ)

x∑

gm= ∇θ log p(x

(m);θ) = ∇θ log !p(x(m);θ) − ∇θ logZ(θ)

For an RBM: x={v,h}

bj

ai

h

v

W

E(v,h) = −hTWv −aTv −bTh = Wi,jvi

j∑

i∑ h

j− a

ivi

i∑ − b

jj∑ h

j

p(v,h) = 1

Zexp(−E(v,h))

θ={W,a,b}

L({x (1),..x (M )};θ) = log !p(x (m);θ)m∑ − logZ(θ)

m∑

∇θ logZ(θ) = Ex~p(x )∇θ log !p(x)

Ex~p(x )

∇θ log !p(x) =1M

∇θi=1

M

∑ log !p(x (m);θ)

Binaryunits

Binaryunits

Connectionsbias

Determine parameters θ that maximize log-likelihood (negative loss)

maxθL({x (1),..x (M )};θ) = log p(x (m)

m∑ ;θ)

IntractablePartition function

∂∂W

i,j

E(v,h) = −vih

j

Probability Distribution of Undirected model (Gibbs)

An identity

For stochastic gradient ascent, take derivatives:

Derivative of negative phase:Derivative of positive phase:

1M

∇θm=1

M

∑ log !p(x (m);θ)

Summation is over samplesfrom the training setSince it is summed m times 1/m has no effect

Summation is over samples from the RBM

RBM

θ ← θ + εg

!p(x) = exp(−E(v,h))

Z = exp

v,h∑ (−E(v,h))

Deep Learning Srihari

Monte Carlo methods

• The identity– is the basis for Monte Carlo methods for MLE of

models with intractable partition functions• MC approach provides intuitive framework for

both positive and negative phases– Positive phase: increase for x drawn from data– Negative phase: decrease partition function by

decreasing drawn from the model distribution

10

∇θ logZ(θ) = Ex~p(x )

∇θ log !p(x)

log !p(x)

log !p(x)

Recommended