Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Deﬁnition: let and be two probability distributions

Foundations of Machine Learning page

Foundations of Machine Learning Maximum Entropy Models,

Logistic Regression

1

Mehryar MohriCourant Institute and Google Research

[email protected]

mailto:[email protected]

pageFoundations of Machine Learning

MotivationProbabilistic models:

• density estimation.

• classification.

2


Notions of information theory.

Introduction to density estimation.

Maxent models.

Conditional Maxent models.

This Lecture

3


EntropyDefinition: the entropy of a discrete random variable with probability mass distribution is

Properties:

• measure of uncertainty of .

•

• maximal for uniform distribution. For a finite support, by Jensen’s inequality:

4

p(x) = Pr[X = x]X

H(X) = �E[log p(X)] = �X

x2X

p(x) log p(x).

X

H(X) � 0.

H(X) = E

log

1

p(X)

� log E

1

p(X)

�= logN.

(Shannon, 1948)


EntropyBase of logarithm: not critical; for base 2, is the number of bits needed to represent .

Definition and notation: the entropy of a distribution is defined by the same quantity and denoted by .

Special case of Rényi entropy (Rényi, 1961).

Binary entropy: .

5

� log2(p(x))

p(x)

p

H(p)

H(p) = �p log p� (1� p) log(1� p)


Relative EntropyDefinition: the relative entropy (or Kullback-Leibler divergence) between two distributions and (discrete case) is

with

Properties: • asymmetric: in general, for .

• non-negative: for all and .

• definite: .

6

0 log

0

q= 0 and p log

p

0

= +1.

D(p k q) 6= D(q k p) p 6= q

p q

D(p k q) � 0 p q

(D(p k q) = 0) ) (p = q)

D(p k q) = Ep

log

p(X)

q(X)

�=

X

x2Xp(x) log

p(x)

q(x),

(Shannon, 1948; Kullback and Leibler, 1951)


Non-Negativity of Rel. EntropyBy the concavity of log and Jensen's inequality,

7

�D(p k q) =X

x : p(x)>0

p(x) log

✓q(x)

p(x)

◆

log

✓ X

x : p(x)>0

p(x)q(x)

p(x)

◆

= log

✓ X

x : p(x)>0

q(x)

◆ log(1) = 0.


x

y

}F (y)

BF (y||x)

F (x) + (y � x) ·⇥F (x)

Bregman DivergenceDefinition: let be a convex and differentiable function defined over a convex set in a Hilbert space . Then, the Bregman divergence associated to is defined by

8

F

C HBF F

BF (x k y) = F (x)� F (y)� hrF (y), x� yi .

(Bregman, 1967)


Bregman DivergenceExamples:

• note: relative entropy not a Bregman divergence since not defined over an open set; but, on the simplex, coincides with unnormalized relative entropy

9

BF (x k y) F (x)

Squared L2-distance kx� yk2 kxk2Mahalanobis distance (x� y)

>K

�1(x� y) x

>K

�1x

Unnormalized relative entropy

eD(x k y)

Pi2I xi log xi � xi

eD(p k q) =

X

x2Xp(x) log

p(x)

q(x)

�+

�q(x)� p(x)

�.


Conditional Relative EntropyDefinition: let and be two probability distributions over . Then, the conditional relative entropy of and with respect to distribution over is defined by

with , , and the conventions , , and .

• note: the definition of conditional relative entropy is not intrinsic, it depends on a third distribution .

10

p q

X ⇥ Y p q

r X

E

X⇠r

hD

�p(·|X) k q(·|X)

�i=

X

x2Xr(x)

X

y2Yp(y|x) log p(y|x)

q(y|x)

= D(

ep k eq),

ep(x, y) = r(x)p(y|x) eq(x, y) = r(x)q(y|x)0 log 0 = 0

0 log

00 = 0

p log p0 = +1

r




Maxent models.


This Lecture

11


Density Estimation ProblemTraining data: sample of size drawn i.i.d. from set according to some distribution ,

Problem: find distribution out of hypothesis set that best estimates .

12

S

D

S = (x1, . . . , xm).

PD

m X

p


Maximum Likelihood SolutionMaximum Likelihood principle: select distribution maximizing likelihood of observed sample ,

13

S

pML = argmax

p2PPr[S|p]

= argmax

p2P

mY

i=1

p(xi)

= argmax

p2P

mX

i=1

log p(xi).

p 2 P


Relative Entropy FormulationLemma: let be the empirical distribution for sample , then

Proof:

14

S

pML = argminp2P

D(bpS k p).

bpS

D(

bpS

k p) =X

x

bpS

(x) logbpS

(x)�X

x

bpS

(x) log p(x)

= �H(

bpS

)�X

x

Pm

i=1 1x=xi

mlog p(x)

= �H(

bpS

)�mX

i=1

X

x

1

x=xi

mlog p(x)

= �H(

bpS

)�mX

i=1

log p(xi

)

m.


Maximum a Posteriori (MAP)Maximum a Posteriori principle: select distribution that is the most likely, given the observed sample and assuming a prior distribution over ,

• note: for a uniform prior, ML = MAP.

15

p 2 PS

Pr[p] P

pMAP = argmax

p2PPr[p|S]

= argmax

p2P

Pr[S|p] Pr[p]Pr[S]

= argmax

p2PPr[S|p] Pr[p].




Maxent models.


This Lecture

16


Density Estimation + FeaturesTraining data: sample of size drawn i.i.d. from set according to some distribution ,

Features: associated to elements of ,

Problem: find distribution out of hypothesis set that best estimates .

• for simplicity, in what follows, is assumed to be finite.

17

S

D

S = (x1, . . . , xm).

PD

p

m

X

X

X

� : X ! RN

x 7! �(x) =

�1(x)...�N (x)

�.


FeaturesFeature functions assumed to be in and .

Examples of :

• family of threshold functions defined over variables.

• functions defined via decision trees with larger depths.

• -degree monomials of the original features.

• zero-one features (often used in NLP, e.g., presence/absence of a word or POS tag).

18

{x 7! 1xi✓

: x 2 RN , ✓ 2 R}N

�j H

H

k�k1 ⇤

k


Maximum Entropy PrincipleIdea: empirical feature vector average close to expectation. For any , with probability at least

Maxent principle: find distribution that is closest to a prior distribution (typically uniform distribution) while verifying .

Closeness is measured using relative entropy.

• note: no set needed to be specified.

19

(E. T. Jaynes, 1957, 1983)

�� E

x⇠D[�(x)]� E

x⇠ bD[�(x)]

��1

2Rm

(H) + ⇤

slog

2�

2m,

� > 0 1� �

p

p0��Ex⇠p[�(x)]� E

x⇠ bD[�(x)]��1

�

P


Maxent FormulationOptimization problem:

• convex optimization problem, unique solution.

• : standard Maxent (or unregularized Maxent).

• : regularized Maxent.

20

min

p2�D(p k p0)

subject to:

�� E

x⇠p[�(x)]� E

x⇠S

[�(x)]

��1

�.

� = 0

� > 0


Relation with EntropyRelationship with entropy: for a uniform prior ,

21

D(p k p0) =X

x2Xp(x) log

p(x)

p0(x)

= �X

x2Xp(x) log p0(x) +

X

x2Xp(x) log p(x)

= log |X |�H(p).

p0


Maxent ProblemOptimization: convex optimization problem.

22

min

p

X

x2Xp(x) log p(x)

subject to: p(x) � 0, 8x 2 XX

x2Xp(x) = 1

��X

x2Xp(x)�

j

(x)� 1

m

mX

i=1

�

j

(x

i

)

�� , 8j 2 [1, N ].


Gibbs DistributionsGibbs distributions: set of distributions with ,

Rich family:

• for linear and quadratic features: includes Gaussians and other distributions with non-PSD quadratic forms in exponents.

• for higher-degree polynomials of raw features: more complex multi-modal distributions.

23

Q w�RN

with

pw

Z =

X

x

p0[x] exp�w ·�(x)

�.

pw[x] =

p0[x] exp�w ·�(x)

�

Z

=

p0[x] exp�PN

j=1 wj�j(x)�

Z

,


Dual ProblemsRegularized Maxent problem:

if , otherwise;

;

if , otherwise.

Regularized Maximum Likelihood problem with Gibbs distributions:

24

D(p k p0) = D(p k p0) p 2 � +1C = {u : kuk � E

S[�k] k1 �}

IC(x) = 0 x 2 C

IC(x) = +1

minp

F (p) = D(p k p0) + IC(Ep[�]),

sup

wG(w) =

1

m

mX

i=1

log

pw[xi]

p0[xi]

�� kwk1.

(with


Duality TheoremTheorem: the regularized Maxent and ML with Gibbs distributions problems are equivalent,

• furthemore, let , then, for any ,

25

(Della Pietra et al., 1997; Dudík et al., 2007; Cortes et al., 2015)

supw2RN

G(w) = minp

F (p).

p⇤ = argminp

F (p) ✏ > 0

⇣|G(w)� sup

w2RN

G(w)| < ✏⌘)

⇣D(p⇤ k pw) ✏

⌘.


NotesMaxent formulation:

• no explicit restriction to a family of distributions .

• but solution coincides with regularized ML with a specific family !

• more general Bregman divergence-based formulation.

26

P

P


L1-Regularized MaxentOptimization problem:

Bayesian interpretation: equivalent to MAP with Laplacian prior (Williams, 1994),

27

qprior(w)

with qprior(w) =N�

j=1

�j

2exp(��j |wj |).

where

inf

w2RN�kwk1 �

1

m

mX

i=1

log pw[xi].

max

wlog

⇣ mY

i=1

pw[xi] qprior(w)

⌘

pw[x] =

1

Z

exp

�w ·�(x)

�.

(Kazama and Tsujii, 2003)


Generalization GuaranteeNotation: , .

Theorem: Fix . Let be the solution of the L1-reg. Maxent problem for . Then, with probability at least ,

28

LD(w) = E

x⇠D[� log pw[x]]

LS

(w) = E

x⇠S

[� log pw[x]]

bw� > 0

1� �

� = 2Rm(H) + ⇤

qlog(

2� )/2m

LD(bw) infw LD(w) + 2kwk1

2Rm(H) + ⇤

qlog

2�

2m

�.

(Dudík et al., 2007)


ProofBy Hölder’s inequality and the concentration bound for average feature vectors,

Since is a minimizer,

29

LD(bw)� LS(bw) = bw · [ES[�]� E

D[�]]

kbwk1 kES[�]� E

D[�]k1 �kbwk1.

bw

LD(bw)� LD(w) = LD(bw)� LS(bw) + LS(bw)� LD(w)

�kbwk1 + LS(bw)� LD(w)

�kwk1 + LS(w)� LD(w) 2�kwk1.


L2-Regularized MaxentDifferent relaxations:

• L1 constraints:

• L2 constraints:

30

�j � [1, N ],�� E

x�p[�j(x)] � E

x�bp[�j(x)]

�� j .

�� Ex⇠p

[�(x)]� Ex⇠bp

[�(x)]��2 B.

(Chen and Rosenfeld, 2000; Lebanon and Lafferty, 2001)


L2-Regularized MaxentOptimization problem:

Bayesian interpretation: equivalent to MAP with Gaussian prior (Goodman, 2004),

31

qprior(w)

with

where

max

wlog

⇣ mY

i=1

pw[xi] qprior(w)

⌘

pw[x] =

1

Z

exp

�w ·�(x)

�.

qprior

(w) =NY

j=1

1p2⇡�2

e�w2

j

2�2 .

inf

w2RN�kwk22 �

1

m

mX

i=1

log pw[xi].




Maxent models.


This Lecture

32


Conditional Maxent ModelsMaxent models for conditional probabilities:

• conditional probability modeling each class.

• use in multi-class classification.

• can use different features for each class.

• a.k.a. multinomial logistic regression.

• logistic regression: special case of two classes.

33


ProblemData: sample drawn i.i.d. according to some distribution ,

• , or in multi-label case.

Features: mapping .

Problem: find accurate conditional probability models , , based on .

34

D

S = ((x1, y1), . . . , (xm, ym))�(X�Y)m.

Y={1, . . . , k} Y={0, 1}k

� : X � Y � RN

�Pr[· | x] x�X


Conditional Maxent PrincipleIdea: empirical feature vector average close to expectation. For any , with probability at least ,

Maxent principle: find conditional distributions that are closest to priors (typically uniform distributions) while verifying .

Closeness is measured using conditional relative entropy based on .

35

(Berger et al., 1996; Cortes et al., 2015)

� > 0 1� �� E

x⇠bpy⇠D[·|x]

[�(x, y)]� E

x⇠bpy⇠bp[·|x]

[�(x, y)]

��1

2Rm

(H) +

slog

2�

2m.

p[·|x]p0[·|x]

bp

��Ex⇠bp

y⇠p[·|x][�(x, y)]� E

x⇠bpy⇠bp[·|x]

[�(x, y)]��1

�


Cond. Maxent FormulationOptimization problem: find distribution solution of

• convex optimization problem, unique solution.

• : unregularized conditional Maxent.

• : regularized conditional Maxent.

36

p

� = 0

� > 0

minp[·|x]2�

X

x2Xbp[x]D

�p[·|x] k p0[·|x]

�

s.t.�� E

x⇠bp

hE

y⇠p[·|x][�(x, y)]

i� E

(x,y)⇠S

[�(x, y)]��1

�.



Dual ProblemsRegularized conditional Maxent problem:

Regularized Maximum Likelihood problem with conditional Gibbs distributions:

37

eF (p) = E

x⇠bp

D

�p[·|x] k p0[·|x]

�+ I�

�p[·|x]

��+ I

C

⇣E

x⇠bpy⇠p[·|x]

[�]⌘.

eG(w) =

1

m

mX

i=1

log

pw[yi|xi]

p0[yi|xi]

�� kwk1 ,

where 8(x, y) 2 X ⇥ Y,

pw[y|x] =p0[y|x] exp

�w ·�(x, y)

�

Z(x)

Z(x)=

X

y2Yp0[y|x] exp(w ·�(x, y)).


Duality TheoremTheorem: the regularized conditional Maxent and ML with conditional Gibbs distributions problems are equivalent,

• furthemore, let , then, for any ,

38

(Cortes et al., 2015)

✏ > 0

supw2RN

eG(w) = minp

eF (p).

p⇤ = argminp

eF (p)

⇣| eG(w)� sup

w2RN

eG(w)| < ✏

⌘) E

x⇠bp

hD

�p⇤[·|x] k pw[·|x]

�i ✏.


Regularized Cond. MaxentOptimization problem: convex optimizations, regularization parameter .

39

� � 0

where

min

w2RN�kwk1 �

1

m

mX

i=1

log pw[yi|xi]

or min

w2RN�kwk22 �

1

m

mX

i=1

log pw[yi|xi],

8(x, y) 2 X ⇥ Y,

pw[y|x]= exp(w ·�(x, y))

Z(x)

Z(x)=

X

y2Yexp(w ·�(x, y)).



More Explicit FormsOptimization problem:

40

minw�RN

��w�1

��w�22

+1m

m�

i=1

log� �

y�Yexp

�w · �(xi, y)�w · �(xi, yi)

��.

min

w2RN

(�kwk1�kwk22

�w · 1

m

mX

i=1

�(x

i

, y

i

) +

1

m

mX

i=1

log

X

y2Ye

w·�(xi,y)i.


Related ProblemOptimization problem: log-sum-exp replaced by max.

41

min

w2RN

(�kwk1�kwk22

+

1

m

mX

i=1

max

y2Y

⇣w ·�(x

i

, y)�w ·�(x

i

, y

i

)

⌘

| {z }�⇢w(xi,yi)

.


Common Feature ChoiceMulti-class features:

L2-regularized cond. maxent optimization:

42

w =

�

��

w1...wy�1

wy

wy+1...w|Y|

�

��

�(x, y) =

2

666666664

0...0

�(x)0...0

3

777777775

w ·�(x, y) = wy · �(x).

min

w2RN�

X

y2Ykwyk22 +

1

m

mX

i=1

log

X

y2Yexp

⇣wy · �(xi)�wyi · �(xi)

⌘�.


PredictionPrediction with :

43

pw[y|x]= exp(w·�(x,y))

Z(x)

by(x) = argmaxy2Y pw[y|x] = argmaxy2Y w ·�(x, y).


Binary ClassificationSimpler expression:

44

with (x) = �(x,+1)��(x,�1).

X

y2Yexp

⇣w ·�(x

i

, y)�w ·�(x

i

, y

i

)

⌘

= e

w·�(xi,+1)�w·�(xi,yi)+ e

w·�(xi,�1)�w·�(xi,yi)

= 1 + e

�yiw·[�(xi,+1)��(xi,�1)]

= 1 + e

�yiw· (xi),


Logistic RegressionBinary case of conditional Maxent.

Optimization problem:

• convex optimization.

• variety of solutions: SGD, coordinate descent, etc.

• coordinate descent: similar to AdaBoost with logistic loss instead of exponential loss.

45

(Berkson, 1944)

�(�u) = log2(1 + e�u) � 1u0

min

w2RN

(�kwk1�kwk22

+

1

m

mX

i=1

log

h1 + e�yiw· (xi)

i.


Generalization BoundTheorem: assume that for all . Then, for any , with probability at least over the draw of a sample of size , for all ,

46

j 2 [1, N ]

� > 0 1� �

S m

±�j 2 H

f : x 7! w · �(x)

where .u0 = log(1 + 1/e)

R(f) 1

m

mX

i=1

log

u0

⇣1 + e�yiw·�(xi)

⌘+ 4kwk1Rm

(H)

+

rlog log2 2kwk1

m+

slog

2�

m,


ProofProof: by the learning bound for convex ensembles holding uniformly for all , with probability at least , for all and ,

Choosing and using yields immediately the learning bound of the theorem.

47

⇢ 1� � f⇢ > 0

⇢ = 1kwk1

1u0 logu0(1 + e�u�1

)

R(f) 1

m

mX

i=1

1

y

i

w·�(xi

)⇢kwk1

�10+

4

⇢Rm(H) +

slog log2

2⇢

m+

slog

2�

m.


Logistic RegressionLogistic model:

Properties:

• linear decision rule, sign of log-odds ratio:

• logistic form:

48

(Berkson, 1944)

Pr[y=+1 | x] =ew·�(x,+1)

Z(x),

where Z(x) = ew·�(x,+1) + ew·�(x,�1)

log

Pr[y = +1 | x]Pr[y=�1 | x] = w ·

��(x,+1)��(x,�1)

�= w · (x).

Pr[y=+1 | x] = 1

1 + e

�w·[�(x,+1)��(x,�1)]=

1

1 + e

�w· (x).


Logistic/Sigmoid Function

49

f: x �� 11 + e�x

Pr[y=+1 | x] = f

�w · (x)

�.


ApplicationsNatural language processing (Berger et al., 1996; Rosenfeld, 1996; Pietra et al., 1997; Malouf, 2002; Manning and Klein, 2003; Mann et al., 2009;

Ratnaparkhi, 2010).

Species habitat modeling (Phillips et al., 2004, 2006; Dudík et al., 2007;

Elith et al, 2011).

Computer vision (Jeon and Manmatha, 2004).

50


ExtensionsExtensive theoretical study of alternative regularizations: (Dudík et al., 2007) (see also (Altun and Smola, 2006) though some proofs unclear).

Maxent models with other Bregman divergences (see for example (Altun and Smola, 2006)).

Structural Maxent models (Cortes et al., 2015):

• extension to the case of multiple feature families.

• empirically outperform Maxent and L1-Maxent.

• conditional structural Maxent: coincide with deep boosting using the logistic loss.

51


ConclusionLogistic regression/maxent models:

• theoretical foundation.

• natural solution when probabilites are required.

• widely used for density estimation/classification.

• often very effective in practice.

• distributed optimization solutions.

• no natural non-linear L1-version (use of kernels).

• connections with boosting.

• connections with neural networks.

52


References• Yasemin Altun, Alexander J. Smola. Unifying Divergence Minimization and

Statistical Inference Via Convex Duality. COLT 2006: 139-153

• Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (22-1), March 1996;

• Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, 357–365.

• Lev M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200–217, 1967.

• Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Umar Syed. Structural Maxent. In Proceedings of ICML, 2015.


References• Imre Csiszar and Tusnady. Information geometry and alternating minimization

procedures. Statistics and Decisions, Supplement Issue 1, 205-237, 1984.

• Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp. 1409-1413. 1989.

• J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp. 1470-1480, 1972.

• Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp.380--393, April, 1997.

• Dudík, Miroslav, Phillips, Steven J., and Schapire, Robert E. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. JMLR, 8, 2007.

http://www.jstor.org/view/00905364/di983943/98p03344/0?config=jstor&frame=noframe&[email protected]/018dd553180050786e44&dpi=3




http://www.cs.cmu.edu/~lafferty/ps/ifrf.ps

http://www.cs.cmu.edu/~lafferty/ps/ifrf.ps


References• E. Jaynes. Information theory and statistical mechanics. Physics Reviews,

106:620–630, 1957.

• E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, 1983.

• Solomon Kullback and Richard A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1):79–86, 1951.

• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.

http://bayes.wustl.edu/etj/etj.html

http://www.essrl.wustl.edu/~jao/


References• Alfréd Rényi. On measures of entropy and information. In Proceedings of the

Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 547–561. University of California Press, 1961.

• Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer, Speech and Language 10:187--228, 1996.

• Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379423, 1948.

http://www.cs.cmu.edu/~roni/me-csl-revised.ps

http://www.cs.cmu.edu/~roni/me-csl-revised.ps

Documents

Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Deﬁnition: let and be two probability distributions