Corso di Interazione Naturale
Prof. Giuseppe Boccignone
Dipartimento di InformaticaUniversità di Milano
[email protected]/IN_2016.html
Computazione per l’interazione naturale: Richiami di ottimizzazione (3)
(e primi esempi di Machine Learning)
2
• Minimizzare f(x) sotto il vincolo (s.t.) g(x) = 0
• Condizione necessaria affinchè x0 sia una soluzione:
• α: moltiplicatore di Lagrange (Lagrange multiplier)
• Vincoli multipli (multiple constraints) gi(x) = 0, i=1, …, m, un moltiplicatore αi per ciascun vincolo
Un po’ di ottimizzazione di base // ottimizzazione vincolata
3
• Per vincoli del tipo gi(x)≤0 il Lagrange multiplier αi deve essere positivo
• Se x0 è soluzione per il problema
• esistono αi≥0 per i=1, …, m tali che x0 soddisfa
• La funzione è la Lagrangiana; il suo gradiente deve essere messo a 0
Lagrangiana
Un po’ di ottimizzazione di base // ottimizzazione vincolata
Un po’ di ottimizzazione di base // ottimizzazione vincolata
• In generale valgono le condizioni di Karush-Kuhn-Tucker (KKT)
CHAPTER 4. NUMERICAL COMPUTATION
tion to constrained optimization. With the KKT approach, we introduce a newfunction called the generalized Lagrangian or generalized Lagrange function.
To define the Lagrangian, we first need to describe S in terms of equationsand inequalities. We want a description of S in terms of m functions gi and nfunctions hj so that S = {x | ∀i, gi(x) = 0 and ∀j, hj(x) ≤ 0}. The equationsinvolving gi are called the equality constraints and the inequalities involving h j
are called inequality constraints.We introduce new variables λi and α j for each constraint, these are called the
KKT multipliers. The generalized Lagrangian is then defined as
L(x,λ,α) = f(x) +X
i
λ i gi(x) +X
j
αjhj(x).
We can now solve a constrained minimization problem using unconstrainedoptimization of the generalized Lagrangian. Observe that, so long as at least onefeasible point exists and f(x) is not permitted to have value ∞, then
minx
maxλ
maxα,α≥0
L(x,λ,α).
has the same optimal objective function value and set of optimal points x as
minx∈S
f(x).
This follows because any time the constraints are satisfied,
maxλ
maxα,α≥0
L(x,λ,α) = f(x),
while any time a constraint is violated,
maxλ
maxα,α≥0
L(x,λ,α) = ∞.
These properties guarantee that no infeasible point will ever be optimal, and thatthe optimum within the feasible points is unchanged.
To perform constrained maximization, we can construct the generalized La-grange function of −f(x), which leads to this optimization problem:
minx
maxλ
maxα,α≥0
−f(x) +X
i
λigi(x) +X
j
αjh j(x).
We may also convert this to a problem with maximization in the outer loop:
maxx
minλ
minα,α≥0
f(x) +X
i
λ igi (x) −X
j
αjh j(x).
ity constraints
89
CHAPTER 4. NUMERICAL COMPUTATION
tion to constrained optimization. With the KKT approach, we introduce a newfunction called the generalized Lagrangian or generalized Lagrange function.
To define the Lagrangian, we first need to describe S in terms of equationsand inequalities. We want a description of S in terms of m functions gi and nfunctions hj so that S = {x | ∀i, gi(x) = 0 and ∀j, hj(x) ≤ 0}. The equationsinvolving gi are called the equality constraints and the inequalities involving h j
are called inequality constraints.We introduce new variables λi and α j for each constraint, these are called the
KKT multipliers. The generalized Lagrangian is then defined as
L(x,λ,α) = f(x) +X
i
λ i gi(x) +X
j
αjhj(x).
We can now solve a constrained minimization problem using unconstrainedoptimization of the generalized Lagrangian. Observe that, so long as at least onefeasible point exists and f(x) is not permitted to have value ∞, then
minx
maxλ
maxα,α≥0
L(x,λ,α).
has the same optimal objective function value and set of optimal points x as
minx∈S
f(x).
This follows because any time the constraints are satisfied,
maxλ
maxα,α≥0
L(x,λ,α) = f(x),
while any time a constraint is violated,
maxλ
maxα,α≥0
L(x,λ,α) = ∞.
These properties guarantee that no infeasible point will ever be optimal, and thatthe optimum within the feasible points is unchanged.
To perform constrained maximization, we can construct the generalized La-grange function of −f(x), which leads to this optimization problem:
minx
maxλ
maxα,α≥0
−f(x) +X
i
λigi(x) +X
j
αjh j(x).
We may also convert this to a problem with maximization in the outer loop:
maxx
minλ
minα,α≥0
f(x) +X
i
λ igi (x) −X
j
αjh j(x).
ity constraints
89
CHAPTER 4. NUMERICAL COMPUTATION
tion to constrained optimization. With the KKT approach, we introduce a newfunction called the generalized Lagrangian or generalized Lagrange function.
To define the Lagrangian, we first need to describe S in terms of equationsand inequalities. We want a description of S in terms of m functions gi and nfunctions hj so that S = {x | ∀i, gi(x) = 0 and ∀j, hj(x) ≤ 0}. The equationsinvolving gi are called the equality constraints and the inequalities involving h j
are called inequality constraints.We introduce new variables λi and α j for each constraint, these are called the
KKT multipliers. The generalized Lagrangian is then defined as
L(x,λ,α) = f(x) +X
i
λ i gi(x) +X
j
αjhj(x).
We can now solve a constrained minimization problem using unconstrainedoptimization of the generalized Lagrangian. Observe that, so long as at least onefeasible point exists and f(x) is not permitted to have value ∞, then
minx
maxλ
maxα,α≥0
L(x,λ,α).
has the same optimal objective function value and set of optimal points x as
minx∈S
f(x).
This follows because any time the constraints are satisfied,
maxλ
maxα,α≥0
L(x,λ,α) = f(x),
while any time a constraint is violated,
maxλ
maxα,α≥0
L(x,λ,α) = ∞.
These properties guarantee that no infeasible point will ever be optimal, and thatthe optimum within the feasible points is unchanged.
To perform constrained maximization, we can construct the generalized La-grange function of −f(x), which leads to this optimization problem:
minx
maxλ
maxα,α≥0
−f(x) +X
i
λigi(x) +X
j
αjh j(x).
We may also convert this to a problem with maximization in the outer loop:
maxx
minλ
minα,α≥0
f(x) +X
i
λ igi (x) −X
j
αjh j(x).
ity constraints
89
CHAPTER 4. NUMERICAL COMPUTATION
tion to constrained optimization. With the KKT approach, we introduce a newfunction called the generalized Lagrangian or generalized Lagrange function.
To define the Lagrangian, we first need to describe S in terms of equationsand inequalities. We want a description of S in terms of m functions gi and nfunctions hj so that S = {x | ∀i, gi(x) = 0 and ∀j, hj(x) ≤ 0}. The equationsinvolving gi are called the equality constraints and the inequalities involving h j
are called inequality constraints.We introduce new variables λi and α j for each constraint, these are called the
KKT multipliers. The generalized Lagrangian is then defined as
L(x,λ,α) = f(x) +X
i
λ i gi(x) +X
j
αjhj(x).
We can now solve a constrained minimization problem using unconstrainedoptimization of the generalized Lagrangian. Observe that, so long as at least onefeasible point exists and f(x) is not permitted to have value ∞, then
minx
maxλ
maxα,α≥0
L(x,λ,α).
has the same optimal objective function value and set of optimal points x as
minx∈S
f(x).
This follows because any time the constraints are satisfied,
maxλ
maxα,α≥0
L(x,λ,α) = f(x),
while any time a constraint is violated,
maxλ
maxα,α≥0
L(x,λ,α) = ∞.
These properties guarantee that no infeasible point will ever be optimal, and thatthe optimum within the feasible points is unchanged.
To perform constrained maximization, we can construct the generalized La-grange function of −f(x), which leads to this optimization problem:
minx
maxλ
maxα,α≥0
−f(x) +X
i
λigi(x) +X
j
αjh j(x).
We may also convert this to a problem with maximization in the outer loop:
maxx
minλ
minα,α≥0
f(x) +X
i
λ igi (x) −X
j
αjh j(x).
ity constraints
89
sono la stessa funzione obbiettivo
CHAPTER 4. NUMERICAL COMPUTATION
tion to constrained optimization. With the KKT approach, we introduce a newfunction called the generalized Lagrangian or generalized Lagrange function.
To define the Lagrangian, we first need to describe S in terms of equationsand inequalities. We want a description of S in terms of m functions gi and nfunctions hj so that S = {x | ∀i, gi(x) = 0 and ∀j, hj(x) ≤ 0}. The equationsinvolving gi are called the equality constraints and the inequalities involving h j
are called inequality constraints.We introduce new variables λi and α j for each constraint, these are called the
KKT multipliers. The generalized Lagrangian is then defined as
L(x,λ,α) = f(x) +X
i
λ i gi(x) +X
j
αjhj(x).
We can now solve a constrained minimization problem using unconstrainedoptimization of the generalized Lagrangian. Observe that, so long as at least onefeasible point exists and f(x) is not permitted to have value ∞, then
minx
maxλ
maxα,α≥0
L(x,λ,α).
has the same optimal objective function value and set of optimal points x as
minx∈S
f(x).
This follows because any time the constraints are satisfied,
maxλ
maxα,α≥0
L(x,λ,α) = f(x),
while any time a constraint is violated,
maxλ
maxα,α≥0
L(x,λ,α) = ∞.
These properties guarantee that no infeasible point will ever be optimal, and thatthe optimum within the feasible points is unchanged.
To perform constrained maximization, we can construct the generalized La-grange function of −f(x), which leads to this optimization problem:
minx
maxλ
maxα,α≥0
−f(x) +X
i
λigi(x) +X
j
αjh j(x).
We may also convert this to a problem with maximization in the outer loop:
maxx
minλ
minα,α≥0
f(x) +X
i
λ igi (x) −X
j
αjh j(x).
ity constraints
89
• quando i vincoli sono soddisfatti
Esempio. // regressione lineare: regolarizzazione
curva originale (non nota)
punti osservati t
Polinomio approssimante
Problema di apprendimento Trovare i coefficienti w (parametri) noti i punti target t
rumore
Trovare i coefficienti w
Esempio. // regressione lineare: regolarizzazione
Funzione di loss (errore) quadratica
Esempio. // regressione lineare: regolarizzazione
Polinomio di ordine 0
Esempio. // regressione lineare: regolarizzazione
Polinomio di ordine 1
Esempio. // regressione lineare: regolarizzazione
Polinomio di ordine 3
Esempio. // regressione lineare: regolarizzazione
Polinomio di ordine 9
Esempio. // regressione lineare: regolarizzazione
Root-Mean-Square (RMS) Error:
Esempio. // regressione lineare: regolarizzazione
Esempio. // regressione lineare: regolarizzazione
• Introduzione di un termine di regolarizzazione nella funzione di costo
• Esempio: funzione quadratica che penalizza grandi coefficienti
penalized loss function
regularization parameter
dipendendente dai dati
dipendendente dai parametri
Esempio. // regressione lineare: regolarizzazione
Esempio. // regressione lineare: regolarizzazione
Esempio. // regressione lineare: regolarizzazione
vs
Esempio. // regressione lineare: regolarizzazione
Esempio. // regressione lineare: regolarizzazione
Esempio: riduzione di dimensionalità
Esempio: riduzione di dimensionalità
spazio originale
N dimensionale
spazio “ridotto”
D dimensionale
M > D
Esempio: riduzione di dimensionalità(NxM)
(NxD)
Le proiezioni migliori massimizzano la varianza
Esempio: riduzione di dimensionalità
Analisi per componenti principali //Descrizione intuitiva
• N vettori di dati “sbiancati”
• ovvero, per
• la varianza diventa:
Analisi per componenti principali //Descrizione intuitiva
• Sostituendo in
• dove C è la matrice di covarianza empirica
Esempio: trovare la matrice di covarianza
Analisi per componenti principali //Descrizione intuitiva
• Il problema è trovare w in maniera da massimizzare la varianza
• lo facciamo sotto il vincolo di ortogonalità
• si ottiene un’ equazione agli autovettori
• La coppia autovettore / autovalore = direzione proiezione / varianza
• direzioni ortogonali di massima varianza
Determinazione delle componenti principali
Analisi per componenti principali //Descrizione intuitiva
% Sottrazione medie Y = Y - repmat(mean(Y,1),N,1);
% Calcola matrice di covarianza C = (1/N)*Y'*Y;
% Trova autovettori / autovalori % le colonne di w corrispondono alle direzioni % di proiezione [w,lambda] = eig(C);
% Proietta i dati sulle prime D dimensioni X = Y*w(:,1:D);
(MxM)
(NxM)
(NxD) (MxD)(NxM)
Ricostruzione o back-projection
Analisi per componenti principali //Descrizione intuitiva
(DxM)(NxM) (NxD)
(NxM)
(NxD)
• Dati input D=
• Autovettori (D=48)
Analisi per componenti principali //Esempi
Analisi per componenti principali //Esempi
D=10
D=100
ricorda il risultato ottenuto con SVD
Analisi per componenti principali //Versione SVD
• Possiamo ridefinire il problema in termini di SVD
componenti principalivarianze
(simmetria)
(per def.)
• Definiamo
(per SVD)
Analisi per componenti principali //Versione SVD
12
a symmetric matrix has the special property that all ofits eigenvectors are not just linearly independent but alsoorthogonal, thus completing our proof.
In the first part of the proof, let A be just some ma-trix, not necessarily symmetric, and let it have indepen-dent eigenvectors (i.e. no degeneracy). Furthermore, letE = [e1 e2 . . . en] be the matrix of eigenvectors placedin the columns. Let D be a diagonal matrix where theith eigenvalue is placed in the iith position.
We will now show that AE = ED. We can examinethe columns of the right-hand and left-hand sides of theequation.
Left hand side : AE = [Ae1 Ae2 . . . Aen]Right hand side : ED = [λ1e1 λ2e2 . . . λnen]
Evidently, if AE = ED then Aei = λiei for all i. Thisequation is the definition of the eigenvalue equation.Therefore, it must be that AE = ED. A little rearrange-ment provides A = EDE−1, completing the first part theproof.
For the second part of the proof, we show that a sym-metric matrix always has orthogonal eigenvectors. Forsome symmetric matrix, let λ1 and λ2 be distinct eigen-values for eigenvectors e1 and e2.
λ1e1 · e2 = (λ1e1)T e2
= (Ae1)T e2
= e1T AT e2
= e1T Ae2
= e1T (λ2e2)
λ1e1 · e2 = λ2e1 · e2
By the last relation we can equate that(λ1 − λ2)e1 · e2 = 0. Since we have conjecturedthat the eigenvalues are in fact unique, it must be thecase that e1 · e2 = 0. Therefore, the eigenvectors of asymmetric matrix are orthogonal.
Let us back up now to our original postulate that A isa symmetric matrix. By the second part of the proof, weknow that the eigenvectors of A are all orthonormal (wechoose the eigenvectors to be normalized). This meansthat E is an orthogonal matrix so by theorem 1, ET =E−1 and we can rewrite the final result.
A = EDET
. Thus, a symmetric matrix is diagonalized by a matrixof its eigenvectors.
5. For any arbitrary m × n matrix X, thesymmetric matrix XT X has a set of orthonor-mal eigenvectors of {v̂1, v̂2, . . . , v̂n} and a set ofassociated eigenvalues {λ1,λ2, . . . ,λn}. The set ofvectors {Xv̂1,Xv̂2, . . . ,Xv̂n} then form an orthog-onal basis, where each vector Xv̂i is of length
√λi.
All of these properties arise from the dot product ofany two vectors from this set.
(Xv̂i) · (Xv̂j) = (Xv̂i)T (Xv̂j)
= v̂Ti XT Xv̂j
= v̂Ti (λjv̂j)
= λjv̂i · v̂j
(Xv̂i) · (Xv̂j) = λjδij
The last relation arises because the set of eigenvectorsof X is orthogonal resulting in the Kronecker delta. Inmore simpler terms the last relation states:
(Xv̂i) · (Xv̂j) =
!
λj i = j0 i ̸= j
This equation states that any two vectors in the set areorthogonal.
The second property arises from the above equation byrealizing that the length squared of each vector is definedas:
∥Xv̂i∥2 = (Xv̂i) · (Xv̂i) = λi
APPENDIX B: Code
This code is written for Matlab 6.5 (Release 13)from Mathworks13. The code is not computationally ef-ficient but explanatory (terse comments begin with a %).
This first version follows Section 5 by examiningthe covariance of the data set.
function [signals,PC,V] = pca1(data)% PCA1: Perform PCA using covariance.% data - MxN matrix of input data% (M dimensions, N trials)% signals - MxN matrix of projected data% PC - each column is a PC% V - Mx1 matrix of variances
[M,N] = size(data);
% subtract off the mean for each dimensionmn = mean(data,2);data = data - repmat(mn,1,N);
% calculate the covariance matrixcovariance = 1 / (N-1) * data * data’;
% find the eigenvectors and eigenvalues[PC, V] = eig(covariance);
13 http://www.mathworks.com
13
% extract diagonal of matrix as vectorV = diag(V);
% sort the variances in decreasing order[junk, rindices] = sort(-1*V);V = V(rindices);PC = PC(:,rindices);
% project the original data setsignals = PC’ * data;
This second version follows section 6 computing PCAthrough SVD.
function [signals,PC,V] = pca2(data)% PCA2: Perform PCA using SVD.% data - MxN matrix of input data% (M dimensions, N trials)% signals - MxN matrix of projected data% PC - each column is a PC% V - Mx1 matrix of variances
[M,N] = size(data);
% subtract off the mean for each dimensionmn = mean(data,2);data = data - repmat(mn,1,N);
% construct the matrix YY = data’ / sqrt(N-1);
% SVD does it all[u,S,PC] = svd(Y);
% calculate the variancesS = diag(S);V = S .* S;
% project the original datasignals = PC’ * data;
APPENDIX C: References
Bell, Anthony and Sejnowski, Terry. (1997) “TheIndependent Components of Natural Scenes are Edge
Filters.” Vision Research 37(23), 3327-3338.[A paper from my field of research that surveys and exploresdifferent forms of decorrelating data sets. The authors examinethe features of PCA and compare it with new ideas in redun-dancy reduction, namely Independent Component Analysis.]
Bishop, Christopher. (1996) Neural Networks forPattern Recognition. Clarendon, Oxford, UK.[A challenging but brilliant text on statistical pattern recog-nition. Although the derivation of PCA is tough in section8.6 (p.310-319), it does have a great discussion on potentialextensions to the method and it puts PCA in context of othermethods of dimensional reduction. Also, I want to acknowledgethis book for several ideas about the limitations of PCA.]
Lay, David. (2000). Linear Algebra and It’s Applica-tions. Addison-Wesley, New York.[This is a beautiful text. Chapter 7 in the second edition (p.441-486) has an exquisite, intuitive derivation and discussion ofSVD and PCA. Extremely easy to follow and a must read.]
Mitra, Partha and Pesaran, Bijan. (1999) ”Analysisof Dynamic Brain Imaging Data.” Biophysical Journal.76, 691-708.[A comprehensive and spectacular paper from my field ofresearch interest. It is dense but in two sections ”EigenmodeAnalysis: SVD” and ”Space-frequency SVD” the authors discussthe benefits of performing a Fourier transform on the databefore an SVD.]
Will, Todd (1999) ”Introduction to the Sin-gular Value Decomposition” Davidson College.www.davidson.edu/academic/math/will/svd/index.html[A math professor wrote up a great web tutorial on SVD withtremendous intuitive explanations, graphics and animations.Although it avoids PCA directly, it gives a great intuitive feelfor what SVD is doing mathematically. Also, it is the inspirationfor my ”spring” example.]
13
% extract diagonal of matrix as vectorV = diag(V);
% sort the variances in decreasing order[junk, rindices] = sort(-1*V);V = V(rindices);PC = PC(:,rindices);
% project the original data setsignals = PC’ * data;
This second version follows section 6 computing PCAthrough SVD.
function [signals,PC,V] = pca2(data)% PCA2: Perform PCA using SVD.% data - MxN matrix of input data% (M dimensions, N trials)% signals - MxN matrix of projected data% PC - each column is a PC% V - Mx1 matrix of variances
[M,N] = size(data);
% subtract off the mean for each dimensionmn = mean(data,2);data = data - repmat(mn,1,N);
% construct the matrix YY = data’ / sqrt(N-1);
% SVD does it all[u,S,PC] = svd(Y);
% calculate the variancesS = diag(S);V = S .* S;
% project the original datasignals = PC’ * data;
APPENDIX C: References
Bell, Anthony and Sejnowski, Terry. (1997) “TheIndependent Components of Natural Scenes are Edge
Filters.” Vision Research 37(23), 3327-3338.[A paper from my field of research that surveys and exploresdifferent forms of decorrelating data sets. The authors examinethe features of PCA and compare it with new ideas in redun-dancy reduction, namely Independent Component Analysis.]
Bishop, Christopher. (1996) Neural Networks forPattern Recognition. Clarendon, Oxford, UK.[A challenging but brilliant text on statistical pattern recog-nition. Although the derivation of PCA is tough in section8.6 (p.310-319), it does have a great discussion on potentialextensions to the method and it puts PCA in context of othermethods of dimensional reduction. Also, I want to acknowledgethis book for several ideas about the limitations of PCA.]
Lay, David. (2000). Linear Algebra and It’s Applica-tions. Addison-Wesley, New York.[This is a beautiful text. Chapter 7 in the second edition (p.441-486) has an exquisite, intuitive derivation and discussion ofSVD and PCA. Extremely easy to follow and a must read.]
Mitra, Partha and Pesaran, Bijan. (1999) ”Analysisof Dynamic Brain Imaging Data.” Biophysical Journal.76, 691-708.[A comprehensive and spectacular paper from my field ofresearch interest. It is dense but in two sections ”EigenmodeAnalysis: SVD” and ”Space-frequency SVD” the authors discussthe benefits of performing a Fourier transform on the databefore an SVD.]
Will, Todd (1999) ”Introduction to the Sin-gular Value Decomposition” Davidson College.www.davidson.edu/academic/math/will/svd/index.html[A math professor wrote up a great web tutorial on SVD withtremendous intuitive explanations, graphics and animations.Although it avoids PCA directly, it gives a great intuitive feelfor what SVD is doing mathematically. Also, it is the inspirationfor my ”spring” example.]