Download pdf - Computazione per l’interazione naturale: Richiami di ...boccignone.di.unimi.it/IN_2016_files/LezIN_Ottimizzazione-3.key.pdf · Richiami di ottimizzazione (3) (e primi esempi di

Corso di Interazione Naturale

Prof. Giuseppe Boccignone

Dipartimento di InformaticaUniversità di Milano

[email protected]/IN_2016.html

Computazione per l’interazione naturale: Richiami di ottimizzazione (3)

(e primi esempi di Machine Learning)

2

• Minimizzare f(x) sotto il vincolo (s.t.) g(x) = 0

• Condizione necessaria affinchè x0 sia una soluzione:

• α: moltiplicatore di Lagrange (Lagrange multiplier)

• Vincoli multipli (multiple constraints) gi(x) = 0, i=1, …, m, un moltiplicatore αi per ciascun vincolo

Un po’ di ottimizzazione di base // ottimizzazione vincolata

3

• Per vincoli del tipo gi(x)≤0 il Lagrange multiplier αi deve essere positivo

• Se x0 è soluzione per il problema

• esistono αi≥0 per i=1, …, m tali che x0 soddisfa

• La funzione è la Lagrangiana; il suo gradiente deve essere messo a 0

Lagrangiana



• In generale valgono le condizioni di Karush-Kuhn-Tucker (KKT)

CHAPTER 4. NUMERICAL COMPUTATION

tion to constrained optimization. With the KKT approach, we introduce a newfunction called the generalized Lagrangian or generalized Lagrange function.

To define the Lagrangian, we first need to describe S in terms of equationsand inequalities. We want a description of S in terms of m functions gi and nfunctions hj so that S = {x | ∀i, gi(x) = 0 and ∀j, hj(x) ≤ 0}. The equationsinvolving gi are called the equality constraints and the inequalities involving h j

are called inequality constraints.We introduce new variables λi and α j for each constraint, these are called the

KKT multipliers. The generalized Lagrangian is then defined as

L(x,λ,α) = f(x) +X

i

λ i gi(x) +X

j

αjhj(x).

We can now solve a constrained minimization problem using unconstrainedoptimization of the generalized Lagrangian. Observe that, so long as at least onefeasible point exists and f(x) is not permitted to have value ∞, then

minx

maxλ

maxα,α≥0

L(x,λ,α).

has the same optimal objective function value and set of optimal points x as

minx∈S

f(x).

This follows because any time the constraints are satisfied,

maxλ

maxα,α≥0

L(x,λ,α) = f(x),

while any time a constraint is violated,

maxλ

maxα,α≥0

L(x,λ,α) = ∞.

These properties guarantee that no infeasible point will ever be optimal, and thatthe optimum within the feasible points is unchanged.

To perform constrained maximization, we can construct the generalized La-grange function of −f(x), which leads to this optimization problem:

minx

maxλ

maxα,α≥0

−f(x) +X

i

λigi(x) +X

j

αjh j(x).

We may also convert this to a problem with maximization in the outer loop:

maxx

minλ

minα,α≥0

f(x) +X

i

λ igi (x) −X

j

αjh j(x).

ity constraints

89






L(x,λ,α) = f(x) +X

i

λ i gi(x) +X

j

αjhj(x).


minx

maxλ

maxα,α≥0

L(x,λ,α).


minx∈S

f(x).


maxλ

maxα,α≥0

L(x,λ,α) = f(x),


maxλ

maxα,α≥0

L(x,λ,α) = ∞.



minx

maxλ

maxα,α≥0

−f(x) +X

i

λigi(x) +X

j

αjh j(x).


maxx

minλ

minα,α≥0

f(x) +X

i

λ igi (x) −X

j

αjh j(x).

ity constraints

89






L(x,λ,α) = f(x) +X

i

λ i gi(x) +X

j

αjhj(x).


minx

maxλ

maxα,α≥0

L(x,λ,α).


minx∈S

f(x).


maxλ

maxα,α≥0

L(x,λ,α) = f(x),


maxλ

maxα,α≥0

L(x,λ,α) = ∞.



minx

maxλ

maxα,α≥0

−f(x) +X

i

λigi(x) +X

j

αjh j(x).


maxx

minλ

minα,α≥0

f(x) +X

i

λ igi (x) −X

j

αjh j(x).

ity constraints

89






L(x,λ,α) = f(x) +X

i

λ i gi(x) +X

j

αjhj(x).


minx

maxλ

maxα,α≥0

L(x,λ,α).


minx∈S

f(x).


maxλ

maxα,α≥0

L(x,λ,α) = f(x),


maxλ

maxα,α≥0

L(x,λ,α) = ∞.



minx

maxλ

maxα,α≥0

−f(x) +X

i

λigi(x) +X

j

αjh j(x).


maxx

minλ

minα,α≥0

f(x) +X

i

λ igi (x) −X

j

αjh j(x).

ity constraints

89

sono la stessa funzione obbiettivo






L(x,λ,α) = f(x) +X

i

λ i gi(x) +X

j

αjhj(x).


minx

maxλ

maxα,α≥0

L(x,λ,α).


minx∈S

f(x).


maxλ

maxα,α≥0

L(x,λ,α) = f(x),


maxλ

maxα,α≥0

L(x,λ,α) = ∞.



minx

maxλ

maxα,α≥0

−f(x) +X

i

λigi(x) +X

j

αjh j(x).


maxx

minλ

minα,α≥0

f(x) +X

i

λ igi (x) −X

j

αjh j(x).

ity constraints

89

• quando i vincoli sono soddisfatti

Esempio. // regressione lineare: regolarizzazione

curva originale (non nota)

punti osservati t

Polinomio approssimante

Problema di apprendimento Trovare i coefficienti w (parametri) noti i punti target t

rumore

Trovare i coefficienti w


Funzione di loss (errore) quadratica


Polinomio di ordine 0








Root-Mean-Square (RMS) Error:



• Introduzione di un termine di regolarizzazione nella funzione di costo

• Esempio: funzione quadratica che penalizza grandi coefficienti

penalized loss function

regularization parameter

dipendendente dai dati

dipendendente dai parametri




vs



Esempio: riduzione di dimensionalità


spazio originale

N dimensionale

spazio “ridotto”

D dimensionale

M > D

Esempio: riduzione di dimensionalità(NxM)

(NxD)

Le proiezioni migliori massimizzano la varianza


Analisi per componenti principali //Descrizione intuitiva

• N vettori di dati “sbiancati”

• ovvero, per

• la varianza diventa:


• Sostituendo in

• dove C è la matrice di covarianza empirica

Esempio: trovare la matrice di covarianza


• Il problema è trovare w in maniera da massimizzare la varianza

• lo facciamo sotto il vincolo di ortogonalità

• si ottiene un’ equazione agli autovettori

• La coppia autovettore / autovalore = direzione proiezione / varianza

• direzioni ortogonali di massima varianza

Determinazione delle componenti principali


% Sottrazione medie Y = Y - repmat(mean(Y,1),N,1);

% Calcola matrice di covarianza C = (1/N)*Y'*Y;

% Trova autovettori / autovalori % le colonne di w corrispondono alle direzioni % di proiezione [w,lambda] = eig(C);

% Proietta i dati sulle prime D dimensioni X = Y*w(:,1:D);

(MxM)

(NxM)

(NxD) (MxD)(NxM)

Ricostruzione o back-projection


(DxM)(NxM) (NxD)

(NxM)

(NxD)

• Dati input D=

• Autovettori (D=48)

Analisi per componenti principali //Esempi

Analisi per componenti principali //Esempi

D=10

D=100

ricorda il risultato ottenuto con SVD

Analisi per componenti principali //Versione SVD

• Possiamo ridefinire il problema in termini di SVD

componenti principalivarianze

(simmetria)

(per def.)

• Definiamo

(per SVD)

Analisi per componenti principali //Versione SVD

12

a symmetric matrix has the special property that all ofits eigenvectors are not just linearly independent but alsoorthogonal, thus completing our proof.

In the first part of the proof, let A be just some ma-trix, not necessarily symmetric, and let it have indepen-dent eigenvectors (i.e. no degeneracy). Furthermore, letE = [e1 e2 . . . en] be the matrix of eigenvectors placedin the columns. Let D be a diagonal matrix where theith eigenvalue is placed in the iith position.

We will now show that AE = ED. We can examinethe columns of the right-hand and left-hand sides of theequation.

Left hand side : AE = [Ae1 Ae2 . . . Aen]Right hand side : ED = [λ1e1 λ2e2 . . . λnen]

Evidently, if AE = ED then Aei = λiei for all i. Thisequation is the definition of the eigenvalue equation.Therefore, it must be that AE = ED. A little rearrange-ment provides A = EDE−1, completing the first part theproof.

For the second part of the proof, we show that a sym-metric matrix always has orthogonal eigenvectors. Forsome symmetric matrix, let λ1 and λ2 be distinct eigen-values for eigenvectors e1 and e2.

λ1e1 · e2 = (λ1e1)T e2

= (Ae1)T e2

= e1T AT e2

= e1T Ae2

= e1T (λ2e2)

λ1e1 · e2 = λ2e1 · e2

By the last relation we can equate that(λ1 − λ2)e1 · e2 = 0. Since we have conjecturedthat the eigenvalues are in fact unique, it must be thecase that e1 · e2 = 0. Therefore, the eigenvectors of asymmetric matrix are orthogonal.

Let us back up now to our original postulate that A isa symmetric matrix. By the second part of the proof, weknow that the eigenvectors of A are all orthonormal (wechoose the eigenvectors to be normalized). This meansthat E is an orthogonal matrix so by theorem 1, ET =E−1 and we can rewrite the final result.

A = EDET

. Thus, a symmetric matrix is diagonalized by a matrixof its eigenvectors.

5. For any arbitrary m × n matrix X, thesymmetric matrix XT X has a set of orthonor-mal eigenvectors of {v̂1, v̂2, . . . , v̂n} and a set ofassociated eigenvalues {λ1,λ2, . . . ,λn}. The set ofvectors {Xv̂1,Xv̂2, . . . ,Xv̂n} then form an orthog-onal basis, where each vector Xv̂i is of length

√λi.

All of these properties arise from the dot product ofany two vectors from this set.

(Xv̂i) · (Xv̂j) = (Xv̂i)T (Xv̂j)

= v̂Ti XT Xv̂j

= v̂Ti (λjv̂j)

= λjv̂i · v̂j

(Xv̂i) · (Xv̂j) = λjδij

The last relation arises because the set of eigenvectorsof X is orthogonal resulting in the Kronecker delta. Inmore simpler terms the last relation states:

(Xv̂i) · (Xv̂j) =

!

λj i = j0 i ̸= j

This equation states that any two vectors in the set areorthogonal.

The second property arises from the above equation byrealizing that the length squared of each vector is definedas:

∥Xv̂i∥2 = (Xv̂i) · (Xv̂i) = λi

APPENDIX B: Code

This code is written for Matlab 6.5 (Release 13)from Mathworks13. The code is not computationally ef-ficient but explanatory (terse comments begin with a %).

This first version follows Section 5 by examiningthe covariance of the data set.

function [signals,PC,V] = pca1(data)% PCA1: Perform PCA using covariance.% data - MxN matrix of input data% (M dimensions, N trials)% signals - MxN matrix of projected data% PC - each column is a PC% V - Mx1 matrix of variances

[M,N] = size(data);

% subtract off the mean for each dimensionmn = mean(data,2);data = data - repmat(mn,1,N);

% calculate the covariance matrixcovariance = 1 / (N-1) * data * data’;

% find the eigenvectors and eigenvalues[PC, V] = eig(covariance);

13 http://www.mathworks.com

13

% extract diagonal of matrix as vectorV = diag(V);

% sort the variances in decreasing order[junk, rindices] = sort(-1*V);V = V(rindices);PC = PC(:,rindices);

% project the original data setsignals = PC’ * data;

This second version follows section 6 computing PCAthrough SVD.

function [signals,PC,V] = pca2(data)% PCA2: Perform PCA using SVD.% data - MxN matrix of input data% (M dimensions, N trials)% signals - MxN matrix of projected data% PC - each column is a PC% V - Mx1 matrix of variances

[M,N] = size(data);


% construct the matrix YY = data’ / sqrt(N-1);

% SVD does it all[u,S,PC] = svd(Y);

% calculate the variancesS = diag(S);V = S .* S;

% project the original datasignals = PC’ * data;

APPENDIX C: References

Bell, Anthony and Sejnowski, Terry. (1997) “TheIndependent Components of Natural Scenes are Edge

Filters.” Vision Research 37(23), 3327-3338.[A paper from my field of research that surveys and exploresdifferent forms of decorrelating data sets. The authors examinethe features of PCA and compare it with new ideas in redun-dancy reduction, namely Independent Component Analysis.]

Bishop, Christopher. (1996) Neural Networks forPattern Recognition. Clarendon, Oxford, UK.[A challenging but brilliant text on statistical pattern recog-nition. Although the derivation of PCA is tough in section8.6 (p.310-319), it does have a great discussion on potentialextensions to the method and it puts PCA in context of othermethods of dimensional reduction. Also, I want to acknowledgethis book for several ideas about the limitations of PCA.]

Lay, David. (2000). Linear Algebra and It’s Applica-tions. Addison-Wesley, New York.[This is a beautiful text. Chapter 7 in the second edition (p.441-486) has an exquisite, intuitive derivation and discussion ofSVD and PCA. Extremely easy to follow and a must read.]

Mitra, Partha and Pesaran, Bijan. (1999) ”Analysisof Dynamic Brain Imaging Data.” Biophysical Journal.76, 691-708.[A comprehensive and spectacular paper from my field ofresearch interest. It is dense but in two sections ”EigenmodeAnalysis: SVD” and ”Space-frequency SVD” the authors discussthe benefits of performing a Fourier transform on the databefore an SVD.]

Will, Todd (1999) ”Introduction to the Sin-gular Value Decomposition” Davidson College.www.davidson.edu/academic/math/will/svd/index.html[A math professor wrote up a great web tutorial on SVD withtremendous intuitive explanations, graphics and animations.Although it avoids PCA directly, it gives a great intuitive feelfor what SVD is doing mathematically. Also, it is the inspirationfor my ”spring” example.]

13

% extract diagonal of matrix as vectorV = diag(V);

% sort the variances in decreasing order[junk, rindices] = sort(-1*V);V = V(rindices);PC = PC(:,rindices);

% project the original data setsignals = PC’ * data;

This second version follows section 6 computing PCAthrough SVD.

function [signals,PC,V] = pca2(data)% PCA2: Perform PCA using SVD.% data - MxN matrix of input data% (M dimensions, N trials)% signals - MxN matrix of projected data% PC - each column is a PC% V - Mx1 matrix of variances

[M,N] = size(data);


% construct the matrix YY = data’ / sqrt(N-1);

% SVD does it all[u,S,PC] = svd(Y);

% calculate the variancesS = diag(S);V = S .* S;

% project the original datasignals = PC’ * data;

APPENDIX C: References

Bell, Anthony and Sejnowski, Terry. (1997) “TheIndependent Components of Natural Scenes are Edge

Filters.” Vision Research 37(23), 3327-3338.[A paper from my field of research that surveys and exploresdifferent forms of decorrelating data sets. The authors examinethe features of PCA and compare it with new ideas in redun-dancy reduction, namely Independent Component Analysis.]

Bishop, Christopher. (1996) Neural Networks forPattern Recognition. Clarendon, Oxford, UK.[A challenging but brilliant text on statistical pattern recog-nition. Although the derivation of PCA is tough in section8.6 (p.310-319), it does have a great discussion on potentialextensions to the method and it puts PCA in context of othermethods of dimensional reduction. Also, I want to acknowledgethis book for several ideas about the limitations of PCA.]

Lay, David. (2000). Linear Algebra and It’s Applica-tions. Addison-Wesley, New York.[This is a beautiful text. Chapter 7 in the second edition (p.441-486) has an exquisite, intuitive derivation and discussion ofSVD and PCA. Extremely easy to follow and a must read.]

Mitra, Partha and Pesaran, Bijan. (1999) ”Analysisof Dynamic Brain Imaging Data.” Biophysical Journal.76, 691-708.[A comprehensive and spectacular paper from my field ofresearch interest. It is dense but in two sections ”EigenmodeAnalysis: SVD” and ”Space-frequency SVD” the authors discussthe benefits of performing a Fourier transform on the databefore an SVD.]

Will, Todd (1999) ”Introduction to the Sin-gular Value Decomposition” Davidson College.www.davidson.edu/academic/math/will/svd/index.html[A math professor wrote up a great web tutorial on SVD withtremendous intuitive explanations, graphics and animations.Although it avoids PCA directly, it gives a great intuitive feelfor what SVD is doing mathematically. Also, it is the inspirationfor my ”spring” example.]