18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The...

Deep Learning Srihari

The Log-likelihood Gradient

Sargur N. Sriharisrihari@cedar.buffalo.edu

Topics• Definition of Partition Function1.The log-likelihood gradient2.Stochastic maximum likelihood and

contrastive divergence3.Pseudolikelihood4.Score matching and Ratio matching5.Denoising score matching6.Noise-contrastive estimation7.Estimating the partition function

Undirected models in deep learning• pmodel(x) is an undirected model

• We study how the parameters are to be determined 3

A deep Boltzmann machineA Restricted Boltzmann machine

Finding most likely parameters 𝛳

• Task of interest: – Determine parameters θ of a Gibbs distribution

• where is the partition function

• Learning undirected model by MLE is difficult because partition function depends on parameters

• First recall the Maximum Likelihood principle4

p(x;θ) = 1

Z(θ)!p(x,θ)

Z(θ) = !p(x,θ)

Deep Learning SrihariMaximum Likelihood Expression• Given m i.i.d. examples X={x(1), x(2),..x(m)}

– From true but unknown distribution pdata(x)• Let pmodel(x ; θ) be parametric indexed by θ

• i.e., pmodel(x;θ) maps any x to the true probability pdata(x)– MLE for θ is:

• Equivalently, by taking logarithms• Replacing summation with expectation

• Which is solved using gradient descent• Where g is gradient with terms

θ ← θ + εg

gm= ∇θ log p(x

(m);θ) = ∇θ log !p(x(m);θ) − ∇θ logZ(θ)

Gradient has two phases• Positive and Negative phases of learning

– Gradient of the log-likelihood wrt parameters has a term corresponding to gradient of partition function

∇θ log p(x;θ) = ∇θ log !p(x;θ)− ∇θ logZ(θ)

p(x;θ) = 1

Z(θ)!p(x,θ)

Tractability: Positive, Negative phases• For most undirected models: negative phase

is difficult• Models with no latent variables or few

interactions between latent variables have a tractable positive phase– RBM: straight-forward positive phase, difficult

negative phase

• This chapter is about difficulties with the negative phase

∇θ logZ(θ)

Computing Gradient for Negative Phase

• For models that guarantee p(x) > 0 for all x we can substitute for

Derivation made use of summation over discrete xSimilar result applies using integration over continuous xIn the continuous version we use Leibniz rule for differentiation

MLE for RBM

p(x;θ) = 1

Z(θ)!p(x,θ)

Z(θ) = !p(x,θ)

gm= ∇θ log p(x

(m);θ) = ∇θ log !p(x(m);θ) − ∇θ logZ(θ)

For an RBM: x={v,h}

E(v,h) = −hTWv −aTv −bTh = Wi,jvi

i∑ h

j− a

i∑ − b

jj∑ h

p(v,h) = 1

Zexp(−E(v,h))

θ={W,a,b}

L({x (1),..x (M )};θ) = log !p(x (m);θ)m∑ − logZ(θ)

∇θ logZ(θ) = Ex~p(x )∇θ log !p(x)

Ex~p(x )

∇θ log !p(x) =1M

∇θi=1

∑ log !p(x (m);θ)

Binaryunits

Connectionsbias

Determine parameters θ that maximize log-likelihood (negative loss)

maxθL({x (1),..x (M )};θ) = log p(x (m)

m∑ ;θ)

IntractablePartition function

∂∂W

E(v,h) = −vih

Probability Distribution of Undirected model (Gibbs)

An identity

For stochastic gradient ascent, take derivatives:

Derivative of negative phase:Derivative of positive phase:

∇θm=1

∑ log !p(x (m);θ)

Summation is over samplesfrom the training setSince it is summed m times 1/m has no effect

Summation is over samples from the RBM

θ ← θ + εg

!p(x) = exp(−E(v,h))

Z = exp

v,h∑ (−E(v,h))

Monte Carlo methods

• The identity– is the basis for Monte Carlo methods for MLE of

models with intractable partition functions• MC approach provides intuitive framework for

both positive and negative phases– Positive phase: increase for x drawn from data– Negative phase: decrease partition function by

decreasing drawn from the model distribution

∇θ logZ(θ) = Ex~p(x )

∇θ log !p(x)

log !p(x)

18.1 Log-likelihood Gradient - University at Buffalosrihari/CSE676/18.1 Log... · The...

Documents

Patthipati Srihari horo

Bb06 srihari six_sigma project

Bayesian Neural Networks - cedar.buffalo.edusrihari/CSE574/Chap5/Chap5.7... · Machine Learning Srihari Bayesian Neural Networks Sargur Srihari srihari@cedar.buffalo.edu 1

Sargur N. Srihari srihari@cedar.buffalo

Chemistry 18.1

Sanorita Dey, Nirupam Roy, Wenyuan Xu, Srihari Nelakuditi

Proceedings Eighth International Conference on …srihari/papers/DocumentImage...DOCUMENT IMAGE ANALYSIS Sargur N. Srihari Department of Computer Science State University of New York

18.1 Master

srihari@cedar.buffalosrihari/CSE676/8.3 BasicOptimizn.pdfBasic Optimization Algorithms Sargur N. Srihari ... •Approximate second-order methods •Optimization strategies and meta-algorithms

Deep Learning Srihari Common Probability Distributionsclgiles.ist.psu.edu/IST597/materials/slides/lect2/... · Deep Learning Multinoulli Distribution Srihari • Distribution over

Srihari Techsoft Software Testing An overview. Srihari Techsoft Introduction & Fundamentals What is Software Testing? Why testing is necessary? Who does

Introduction to Pattern Recognition - cedar.buffalo.edusrihari/CSE555/Chap1.Part1.pdf · CSE 555: Sargur Srihari 1 Introduction to Pattern Recognition Sargur N. Srihari srihari@cedar.buffalo.edu

Srihari Stotram in Telugu

Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap9/Ch... · – From Bayes theorem – View as prior probability of component k and as the posterior probability

18.1 Obratlovci

Chapter 18.1

srihari@cedar.buffalosrihari/CSE676/12.5... · 2020. 4. 15. · 5.Education: •Personalizing education, recommend materials 3. Deep Learning Srihari Commercial importance •Internet

Sargur N. Srihari - University at Buffalo · 2020-04-17 · Sargur Srihari April 16, 2020 3 Sargur N. Srihari 1. PERSONAL Born: May 7, 1950 Contact: Dept. of CSE, University at Buffalo,

Sargur Srihari srihari@cedar.buffalocv.znu.ac.ir/afsharchim/lectures/Ch9.2-MixturesofGaussians.pdfSrihari 6 Latent Variable • If we have observations x 1,..,x N • Because marginal

18.1 birds