Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
Foundations of Machine Learning page
Foundations of Machine Learning Maximum Entropy Models,
Logistic Regression
1
Mehryar MohriCourant Institute and Google Research
pageFoundations of Machine Learning
MotivationProbabilistic models:
• density estimation.
• classification.
2
pageFoundations of Machine Learning
Notions of information theory.
Introduction to density estimation.
Maxent models.
Conditional Maxent models.
This Lecture
3
pageFoundations of Machine Learning
EntropyDefinition: the entropy of a discrete random variable with probability mass distribution is
Properties:
• measure of uncertainty of .
•
• maximal for uniform distribution. For a finite support, by Jensen’s inequality:
4
p(x) = Pr[X = x]X
H(X) = �E[log p(X)] = �X
x2X
p(x) log p(x).
X
H(X) � 0.
H(X) = E
log
1
p(X)
� log E
1
p(X)
�= logN.
(Shannon, 1948)
pageFoundations of Machine Learning
EntropyBase of logarithm: not critical; for base 2, is the number of bits needed to represent .
Definition and notation: the entropy of a distribution is defined by the same quantity and denoted by .
Special case of Rényi entropy (Rényi, 1961).
Binary entropy: .
5
� log2(p(x))
p(x)
p
H(p)
H(p) = �p log p� (1� p) log(1� p)
pageFoundations of Machine Learning
Relative EntropyDefinition: the relative entropy (or Kullback-Leibler divergence) between two distributions and (discrete case) is
with
Properties: • asymmetric: in general, for .
• non-negative: for all and .
• definite: .
6
0 log
0
q= 0 and p log
p
0
= +1.
D(p k q) 6= D(q k p) p 6= q
p q
D(p k q) � 0 p q
(D(p k q) = 0) ) (p = q)
D(p k q) = Ep
log
p(X)
q(X)
�=
X
x2Xp(x) log
p(x)
q(x),
(Shannon, 1948; Kullback and Leibler, 1951)
pageFoundations of Machine Learning
Non-Negativity of Rel. EntropyBy the concavity of log and Jensen's inequality,
7
�D(p k q) =X
x : p(x)>0
p(x) log
✓q(x)
p(x)
◆
log
✓ X
x : p(x)>0
p(x)q(x)
p(x)
◆
= log
✓ X
x : p(x)>0
q(x)
◆ log(1) = 0.
pageFoundations of Machine Learning
x
y
}F (y)
BF (y||x)
F (x) + (y � x) ·⇥F (x)
Bregman DivergenceDefinition: let be a convex and differentiable function defined over a convex set in a Hilbert space . Then, the Bregman divergence associated to is defined by
8
F
C HBF F
BF (x k y) = F (x)� F (y)� hrF (y), x� yi .
(Bregman, 1967)
pageFoundations of Machine Learning
Bregman DivergenceExamples:
• note: relative entropy not a Bregman divergence since not defined over an open set; but, on the simplex, coincides with unnormalized relative entropy
9
BF (x k y) F (x)
Squared L2-distance kx� yk2 kxk2Mahalanobis distance (x� y)
>K
�1(x� y) x
>K
�1x
Unnormalized relative entropy
eD(x k y)
Pi2I xi log xi � xi
eD(p k q) =
X
x2Xp(x) log
p(x)
q(x)
�+
�q(x)� p(x)
�.
pageFoundations of Machine Learning
Conditional Relative EntropyDefinition: let and be two probability distributions over . Then, the conditional relative entropy of and with respect to distribution over is defined by
with , , and the conventions , , and .
• note: the definition of conditional relative entropy is not intrinsic, it depends on a third distribution .
10
p q
X ⇥ Y p q
r X
E
X⇠r
hD
�p(·|X) k q(·|X)
�i=
X
x2Xr(x)
X
y2Yp(y|x) log p(y|x)
q(y|x)
= D(
ep k eq),
ep(x, y) = r(x)p(y|x) eq(x, y) = r(x)q(y|x)0 log 0 = 0
0 log
00 = 0
p log p0 = +1
r
pageFoundations of Machine Learning
Notions of information theory.
Introduction to density estimation.
Maxent models.
Conditional Maxent models.
This Lecture
11
pageFoundations of Machine Learning
Density Estimation ProblemTraining data: sample of size drawn i.i.d. from set according to some distribution ,
Problem: find distribution out of hypothesis set that best estimates .
12
S
D
S = (x1, . . . , xm).
PD
m X
p
pageFoundations of Machine Learning
Maximum Likelihood SolutionMaximum Likelihood principle: select distribution maximizing likelihood of observed sample ,
13
S
pML = argmax
p2PPr[S|p]
= argmax
p2P
mY
i=1
p(xi)
= argmax
p2P
mX
i=1
log p(xi).
p 2 P
pageFoundations of Machine Learning
Relative Entropy FormulationLemma: let be the empirical distribution for sample , then
Proof:
14
S
pML = argminp2P
D(bpS k p).
bpS
D(
bpS
k p) =X
x
bpS
(x) logbpS
(x)�X
x
bpS
(x) log p(x)
= �H(
bpS
)�X
x
Pm
i=1 1x=xi
mlog p(x)
= �H(
bpS
)�mX
i=1
X
x
1
x=xi
mlog p(x)
= �H(
bpS
)�mX
i=1
log p(xi
)
m.
pageFoundations of Machine Learning
Maximum a Posteriori (MAP)Maximum a Posteriori principle: select distribution that is the most likely, given the observed sample and assuming a prior distribution over ,
• note: for a uniform prior, ML = MAP.
15
p 2 PS
Pr[p] P
pMAP = argmax
p2PPr[p|S]
= argmax
p2P
Pr[S|p] Pr[p]Pr[S]
= argmax
p2PPr[S|p] Pr[p].
pageFoundations of Machine Learning
Notions of information theory.
Introduction to density estimation.
Maxent models.
Conditional Maxent models.
This Lecture
16
pageFoundations of Machine Learning
Density Estimation + FeaturesTraining data: sample of size drawn i.i.d. from set according to some distribution ,
Features: associated to elements of ,
Problem: find distribution out of hypothesis set that best estimates .
• for simplicity, in what follows, is assumed to be finite.
17
S
D
S = (x1, . . . , xm).
PD
p
m
X
X
X
� : X ! RN
x 7! �(x) =
�1(x)...�N (x)
�.
pageFoundations of Machine Learning
FeaturesFeature functions assumed to be in and .
Examples of :
• family of threshold functions defined over variables.
• functions defined via decision trees with larger depths.
• -degree monomials of the original features.
• zero-one features (often used in NLP, e.g., presence/absence of a word or POS tag).
18
{x 7! 1xi✓
: x 2 RN , ✓ 2 R}N
�j H
H
k�k1 ⇤
k
pageFoundations of Machine Learning
Maximum Entropy PrincipleIdea: empirical feature vector average close to expectation. For any , with probability at least
Maxent principle: find distribution that is closest to a prior distribution (typically uniform distribution) while verifying .
Closeness is measured using relative entropy.
• note: no set needed to be specified.
19
(E. T. Jaynes, 1957, 1983)
��� E
x⇠D[�(x)]� E
x⇠ bD[�(x)]
���1
2Rm
(H) + ⇤
slog
2�
2m,
� > 0 1� �
p
p0���Ex⇠p[�(x)]� E
x⇠ bD[�(x)]���1
�
P
pageFoundations of Machine Learning
Maxent FormulationOptimization problem:
• convex optimization problem, unique solution.
• : standard Maxent (or unregularized Maxent).
• : regularized Maxent.
20
min
p2�D(p k p0)
subject to:
��� E
x⇠p[�(x)]� E
x⇠S
[�(x)]
���1
�.
� = 0
� > 0
pageFoundations of Machine Learning
Relation with EntropyRelationship with entropy: for a uniform prior ,
21
D(p k p0) =X
x2Xp(x) log
p(x)
p0(x)
= �X
x2Xp(x) log p0(x) +
X
x2Xp(x) log p(x)
= log |X |�H(p).
p0
pageFoundations of Machine Learning
Maxent ProblemOptimization: convex optimization problem.
22
min
p
X
x2Xp(x) log p(x)
subject to: p(x) � 0, 8x 2 XX
x2Xp(x) = 1
�����X
x2Xp(x)�
j
(x)� 1
m
mX
i=1
�
j
(x
i
)
����� �, 8j 2 [1, N ].
pageFoundations of Machine Learning
Gibbs DistributionsGibbs distributions: set of distributions with ,
Rich family:
• for linear and quadratic features: includes Gaussians and other distributions with non-PSD quadratic forms in exponents.
• for higher-degree polynomials of raw features: more complex multi-modal distributions.
23
Q w�RN
with
pw
Z =
X
x
p0[x] exp�w ·�(x)
�.
pw[x] =
p0[x] exp�w ·�(x)
�
Z
=
p0[x] exp�PN
j=1 wj�j(x)�
Z
,
pageFoundations of Machine Learning
Dual ProblemsRegularized Maxent problem:
if , otherwise;
;
if , otherwise.
Regularized Maximum Likelihood problem with Gibbs distributions:
24
D(p k p0) = D(p k p0) p 2 � +1C = {u : kuk � E
S[�k] k1 �}
IC(x) = 0 x 2 C
IC(x) = +1
minp
F (p) = D(p k p0) + IC(Ep[�]),
sup
wG(w) =
1
m
mX
i=1
log
pw[xi]
p0[xi]
�� �kwk1.
(with
pageFoundations of Machine Learning
Duality TheoremTheorem: the regularized Maxent and ML with Gibbs distributions problems are equivalent,
• furthemore, let , then, for any ,
25
(Della Pietra et al., 1997; Dudík et al., 2007; Cortes et al., 2015)
supw2RN
G(w) = minp
F (p).
p⇤ = argminp
F (p) ✏ > 0
⇣|G(w)� sup
w2RN
G(w)| < ✏⌘)
⇣D(p⇤ k pw) ✏
⌘.
pageFoundations of Machine Learning
NotesMaxent formulation:
• no explicit restriction to a family of distributions .
• but solution coincides with regularized ML with a specific family !
• more general Bregman divergence-based formulation.
26
P
P
pageFoundations of Machine Learning
L1-Regularized MaxentOptimization problem:
Bayesian interpretation: equivalent to MAP with Laplacian prior (Williams, 1994),
27
qprior(w)
with qprior(w) =N�
j=1
�j
2exp(��j |wj |).
where
inf
w2RN�kwk1 �
1
m
mX
i=1
log pw[xi].
max
wlog
⇣ mY
i=1
pw[xi] qprior(w)
⌘
pw[x] =
1
Z
exp
�w ·�(x)
�.
(Kazama and Tsujii, 2003)
pageFoundations of Machine Learning
Generalization GuaranteeNotation: , .
Theorem: Fix . Let be the solution of the L1-reg. Maxent problem for . Then, with probability at least ,
28
LD(w) = E
x⇠D[� log pw[x]]
LS
(w) = E
x⇠S
[� log pw[x]]
bw� > 0
1� �
� = 2Rm(H) + ⇤
qlog(
2� )/2m
LD(bw) infw LD(w) + 2kwk1
2Rm(H) + ⇤
qlog
2�
2m
�.
(Dudík et al., 2007)
pageFoundations of Machine Learning
ProofBy Hölder’s inequality and the concentration bound for average feature vectors,
Since is a minimizer,
29
LD(bw)� LS(bw) = bw · [ES[�]� E
D[�]]
kbwk1 kES[�]� E
D[�]k1 �kbwk1.
bw
LD(bw)� LD(w) = LD(bw)� LS(bw) + LS(bw)� LD(w)
�kbwk1 + LS(bw)� LD(w)
�kwk1 + LS(w)� LD(w) 2�kwk1.
pageFoundations of Machine Learning
L2-Regularized MaxentDifferent relaxations:
• L1 constraints:
• L2 constraints:
30
�j � [1, N ],��� E
x�p[�j(x)] � E
x�bp[�j(x)]
��� � �j .
��� Ex⇠p
[�(x)]� Ex⇠bp
[�(x)]���2 B.
(Chen and Rosenfeld, 2000; Lebanon and Lafferty, 2001)
pageFoundations of Machine Learning
L2-Regularized MaxentOptimization problem:
Bayesian interpretation: equivalent to MAP with Gaussian prior (Goodman, 2004),
31
qprior(w)
with
where
max
wlog
⇣ mY
i=1
pw[xi] qprior(w)
⌘
pw[x] =
1
Z
exp
�w ·�(x)
�.
qprior
(w) =NY
j=1
1p2⇡�2
e�w2
j
2�2 .
inf
w2RN�kwk22 �
1
m
mX
i=1
log pw[xi].
pageFoundations of Machine Learning
Notions of information theory.
Introduction to density estimation.
Maxent models.
Conditional Maxent models.
This Lecture
32
pageFoundations of Machine Learning
Conditional Maxent ModelsMaxent models for conditional probabilities:
• conditional probability modeling each class.
• use in multi-class classification.
• can use different features for each class.
• a.k.a. multinomial logistic regression.
• logistic regression: special case of two classes.
33
pageFoundations of Machine Learning
ProblemData: sample drawn i.i.d. according to some distribution ,
• , or in multi-label case.
Features: mapping .
Problem: find accurate conditional probability models , , based on .
34
D
S = ((x1, y1), . . . , (xm, ym))�(X�Y)m.
Y={1, . . . , k} Y={0, 1}k
� : X � Y � RN
�Pr[· | x] x�X
pageFoundations of Machine Learning
Conditional Maxent PrincipleIdea: empirical feature vector average close to expectation. For any , with probability at least ,
Maxent principle: find conditional distributions that are closest to priors (typically uniform distributions) while verifying .
Closeness is measured using conditional relative entropy based on .
35
(Berger et al., 1996; Cortes et al., 2015)
� > 0 1� ����� E
x⇠bpy⇠D[·|x]
[�(x, y)]� E
x⇠bpy⇠bp[·|x]
[�(x, y)]
����1
2Rm
(H) +
slog
2�
2m.
p[·|x]p0[·|x]
bp
���Ex⇠bp
y⇠p[·|x][�(x, y)]� E
x⇠bpy⇠bp[·|x]
[�(x, y)]���1
�
pageFoundations of Machine Learning
Cond. Maxent FormulationOptimization problem: find distribution solution of
• convex optimization problem, unique solution.
• : unregularized conditional Maxent.
• : regularized conditional Maxent.
36
p
� = 0
� > 0
minp[·|x]2�
X
x2Xbp[x]D
�p[·|x] k p0[·|x]
�
s.t.��� E
x⇠bp
hE
y⇠p[·|x][�(x, y)]
i� E
(x,y)⇠S
[�(x, y)]���1
�.
(Berger et al., 1996; Cortes et al., 2015)
pageFoundations of Machine Learning
Dual ProblemsRegularized conditional Maxent problem:
Regularized Maximum Likelihood problem with conditional Gibbs distributions:
37
eF (p) = E
x⇠bp
D
�p[·|x] k p0[·|x]
�+ I�
�p[·|x]
��+ I
C
⇣E
x⇠bpy⇠p[·|x]
[�]⌘.
eG(w) =
1
m
mX
i=1
log
pw[yi|xi]
p0[yi|xi]
�� �kwk1 ,
where 8(x, y) 2 X ⇥ Y,
pw[y|x] =p0[y|x] exp
�w ·�(x, y)
�
Z(x)
Z(x)=
X
y2Yp0[y|x] exp(w ·�(x, y)).
pageFoundations of Machine Learning
Duality TheoremTheorem: the regularized conditional Maxent and ML with conditional Gibbs distributions problems are equivalent,
• furthemore, let , then, for any ,
38
(Cortes et al., 2015)
✏ > 0
supw2RN
eG(w) = minp
eF (p).
p⇤ = argminp
eF (p)
⇣| eG(w)� sup
w2RN
eG(w)| < ✏
⌘) E
x⇠bp
hD
�p⇤[·|x] k pw[·|x]
�i ✏.
pageFoundations of Machine Learning
Regularized Cond. MaxentOptimization problem: convex optimizations, regularization parameter .
39
� � 0
where
min
w2RN�kwk1 �
1
m
mX
i=1
log pw[yi|xi]
or min
w2RN�kwk22 �
1
m
mX
i=1
log pw[yi|xi],
8(x, y) 2 X ⇥ Y,
pw[y|x]= exp(w ·�(x, y))
Z(x)
Z(x)=
X
y2Yexp(w ·�(x, y)).
(Berger et al., 1996; Cortes et al., 2015)
pageFoundations of Machine Learning
More Explicit FormsOptimization problem:
40
minw�RN
���w�1
��w�22
+1m
m�
i=1
log� �
y�Yexp
�w · �(xi, y)�w · �(xi, yi)
��.
min
w2RN
(�kwk1�kwk22
�w · 1
m
mX
i=1
�(x
i
, y
i
) +
1
m
mX
i=1
log
X
y2Ye
w·�(xi,y)i.
pageFoundations of Machine Learning
Related ProblemOptimization problem: log-sum-exp replaced by max.
41
min
w2RN
(�kwk1�kwk22
+
1
m
mX
i=1
max
y2Y
⇣w ·�(x
i
, y)�w ·�(x
i
, y
i
)
⌘
| {z }�⇢w(xi,yi)
.
pageFoundations of Machine Learning
Common Feature ChoiceMulti-class features:
L2-regularized cond. maxent optimization:
42
w =
�
���������
w1...wy�1
wy
wy+1...w|Y|
�
���������
�(x, y) =
2
666666664
0...0
�(x)0...0
3
777777775
w ·�(x, y) = wy · �(x).
min
w2RN�
X
y2Ykwyk22 +
1
m
mX
i=1
log
X
y2Yexp
⇣wy · �(xi)�wyi · �(xi)
⌘�.
pageFoundations of Machine Learning
PredictionPrediction with :
43
pw[y|x]= exp(w·�(x,y))
Z(x)
by(x) = argmaxy2Y pw[y|x] = argmaxy2Y w ·�(x, y).
pageFoundations of Machine Learning
Binary ClassificationSimpler expression:
44
with (x) = �(x,+1)��(x,�1).
X
y2Yexp
⇣w ·�(x
i
, y)�w ·�(x
i
, y
i
)
⌘
= e
w·�(xi,+1)�w·�(xi,yi)+ e
w·�(xi,�1)�w·�(xi,yi)
= 1 + e
�yiw·[�(xi,+1)��(xi,�1)]
= 1 + e
�yiw· (xi),
pageFoundations of Machine Learning
Logistic RegressionBinary case of conditional Maxent.
Optimization problem:
• convex optimization.
• variety of solutions: SGD, coordinate descent, etc.
• coordinate descent: similar to AdaBoost with logistic loss instead of exponential loss.
45
(Berkson, 1944)
�(�u) = log2(1 + e�u) � 1u0
min
w2RN
(�kwk1�kwk22
+
1
m
mX
i=1
log
h1 + e�yiw· (xi)
i.
pageFoundations of Machine Learning
Generalization BoundTheorem: assume that for all . Then, for any , with probability at least over the draw of a sample of size , for all ,
46
j 2 [1, N ]
� > 0 1� �
S m
±�j 2 H
f : x 7! w · �(x)
where .u0 = log(1 + 1/e)
R(f) 1
m
mX
i=1
log
u0
⇣1 + e�yiw·�(xi)
⌘+ 4kwk1Rm
(H)
+
rlog log2 2kwk1
m+
slog
2�
m,
pageFoundations of Machine Learning
ProofProof: by the learning bound for convex ensembles holding uniformly for all , with probability at least , for all and ,
Choosing and using yields immediately the learning bound of the theorem.
47
⇢ 1� � f⇢ > 0
⇢ = 1kwk1
1u0 logu0(1 + e�u�1
)
R(f) 1
m
mX
i=1
1
y
i
w·�(xi
)⇢kwk1
�10+
4
⇢Rm(H) +
slog log2
2⇢
m+
slog
2�
m.
pageFoundations of Machine Learning
Logistic RegressionLogistic model:
Properties:
• linear decision rule, sign of log-odds ratio:
• logistic form:
48
(Berkson, 1944)
Pr[y=+1 | x] =ew·�(x,+1)
Z(x),
where Z(x) = ew·�(x,+1) + ew·�(x,�1)
log
Pr[y = +1 | x]Pr[y=�1 | x] = w ·
��(x,+1)��(x,�1)
�= w · (x).
Pr[y=+1 | x] = 1
1 + e
�w·[�(x,+1)��(x,�1)]=
1
1 + e
�w· (x).
pageFoundations of Machine Learning
Logistic/Sigmoid Function
49
f: x �� 11 + e�x
Pr[y=+1 | x] = f
�w · (x)
�.
pageFoundations of Machine Learning
ApplicationsNatural language processing (Berger et al., 1996; Rosenfeld, 1996; Pietra et al., 1997; Malouf, 2002; Manning and Klein, 2003; Mann et al., 2009;
Ratnaparkhi, 2010).
Species habitat modeling (Phillips et al., 2004, 2006; Dudík et al., 2007;
Elith et al, 2011).
Computer vision (Jeon and Manmatha, 2004).
50
pageFoundations of Machine Learning
ExtensionsExtensive theoretical study of alternative regularizations: (Dudík et al., 2007) (see also (Altun and Smola, 2006) though some proofs unclear).
Maxent models with other Bregman divergences (see for example (Altun and Smola, 2006)).
Structural Maxent models (Cortes et al., 2015):
• extension to the case of multiple feature families.
• empirically outperform Maxent and L1-Maxent.
• conditional structural Maxent: coincide with deep boosting using the logistic loss.
51
pageFoundations of Machine Learning
ConclusionLogistic regression/maxent models:
• theoretical foundation.
• natural solution when probabilites are required.
• widely used for density estimation/classification.
• often very effective in practice.
• distributed optimization solutions.
• no natural non-linear L1-version (use of kernels).
• connections with boosting.
• connections with neural networks.
52
Foundations of Machine Learning page
References• Yasemin Altun, Alexander J. Smola. Unifying Divergence Minimization and
Statistical Inference Via Convex Duality. COLT 2006: 139-153
• Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (22-1), March 1996;
• Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, 357–365.
• Lev M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200–217, 1967.
• Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Umar Syed. Structural Maxent. In Proceedings of ICML, 2015.
Foundations of Machine Learning page
References• Imre Csiszar and Tusnady. Information geometry and alternating minimization
procedures. Statistics and Decisions, Supplement Issue 1, 205-237, 1984.
• Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp. 1409-1413. 1989.
• J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp. 1470-1480, 1972.
• Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp.380--393, April, 1997.
• Dudík, Miroslav, Phillips, Steven J., and Schapire, Robert E. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. JMLR, 8, 2007.
Foundations of Machine Learning page
References• E. Jaynes. Information theory and statistical mechanics. Physics Reviews,
106:620–630, 1957.
• E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, 1983.
• Solomon Kullback and Richard A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1):79–86, 1951.
• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.
Foundations of Machine Learning page
References• Alfréd Rényi. On measures of entropy and information. In Proceedings of the
Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 547–561. University of California Press, 1961.
• Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer, Speech and Language 10:187--228, 1996.
• Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379423, 1948.