37
Computazione per l’interazione naturale: Richiami di ottimizzazione (1 e 2) (e primi esempi di Machine Learning) Corso di Interazione Naturale Prof. Giuseppe Boccignone Dipartimento di Informatica Università di Milano [email protected] boccignone.di.unimi.it/IN_2018.html Modelli probabilistici Algoritmi di inferenza e learning Implementazione hardware Modelli nelle scienze cognitive //strumenti X Y Dataset D = (X,Y) Tecniche stocastiche (Monte Carlo) Tecniche di ottimizzazione (funzione di costo) Algebra lineare Probabilità Modelli probabilistici strutturati (Modelli grafici) Modelli psicologici / neurobiologici

Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

  • Upload
    buinhu

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Computazione per l’interazione naturale: Richiami di ottimizzazione (1 e 2)

(e primi esempi di Machine Learning)

Corso di Interazione Naturale

Prof. Giuseppe Boccignone

Dipartimento di InformaticaUniversità di Milano

[email protected]/IN_2018.html

Modelli probabilistici

Algoritmi di inferenza e learning

Implementazione hardware

Modelli nelle scienze cognitive //strumenti

X

Y

Dataset D = (X, Y)

Tecniche stocastiche(Monte Carlo)

Tecniche di ottimizzazione

(funzione di costo)

Algebra lineare

Probabilità

Modelli probabilistici strutturati(Modelli grafici)

Modelli psicologici / neurobiologici

Page 2: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Un po’ di ottimizzazione di base // il problema

• Minimizzare o massimizzare una funzione alterando

• min = max (

• Tipicamente x è un vettore di parametri

• Si procede in maniera iterativa

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

funzione di costo o

funzione obiettivo

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

602 B Optimization

Figure B.1 Local minima. Optimiza-tion methods aim to find the minimumof the objective function f [✓] withrespect to parameters ✓. Roughly,they work by starting with an initialestimate ✓[0] and moving iterativelydownhill until no more progress canbe made (final position represented by✓[1]). Unfortunately, it is possible toterminate in a local minimum. For ex-ample, if we start at ✓0[0] and movedownhill, we wind up in position ✓0[1].

Obj

ectiv

e fu

nctio

n

does not mean there is not a better solution in some distant part of the functionthat has not yet been explored (figure B.1). In optimization parlance, they can onlyfind local minima. One way to mitigate this problem is to start the optimizationfrom a number of di↵erent places and choose the final solution with the lowest cost.

In the special case where the function is convex, there will only be a singleminimum, and we are guaranteed to find it with su�cient iterations (figure B.2).For a 1D function, it is possible to establish the convexity by looking at the secondderivative of the function; if this is positive everywhere (i.e., the slope is contin-uously increasing) then the function is convex and the global minimum can befound. The equivalent test in higher dimensions is to examine the Hessian matrix(the matrix of second derivatives of the cost function with respect to the parame-ters). If this is positive definite everywhere (see appendix C.2.6), then the functionis convex and the global minimum will be found. Some of the cost functions in thisbook are convex, but this is unusual; most optimization problems found in visiondo not have this convenient property.

B.1.2 Overview of approach

In general the parameters ✓ over which we search are multidimensional. For exam-ple, when ✓ has two dimensions, we can think of the function as a two dimensionalsurface (figure B.3). With this in mind, the principles behind the methods we willdiscuss are simple. We alternately

• choose a search direction s based on the local properties of the function, and

• search to find the minimum along the chosen direction. In other words, weseek the distance � to move such that

� = argmin�

hf [✓[t] + �s]

i, (B.2)

and then set ✓[t+1] = ✓[t] + �s. This is termed a line search.

We now consider each of these stages in turn.

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Un po’ di ottimizzazione di base // il problema

• Minimizzare o massimizzare una funzione alterando

• min = max

• Fondamentale il concetto di derivata: come “scalare” un piccolo cambiamento nell’input per ottenere un cambiamento corrispondente nell’output

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.1: An illustration of how the derivatives of a function can be used to follow thefunction downhill to a minimum. This technique is called gradient descent.

The derivative is therefore useful for minimizing a function because it tells ushow to change x in order to make a small improvement in y. For example, weknow that f(x − sign(f 0(x))) is less than f(x) for small enough . We can thusreduce f(x) by moving x in small steps with opposite sign of the the derivative.This technique is called gradient descent (Cauchy, 1847a). See Fig. 4.1 for anexample of this technique.

When f 0(x) = 0, the derivative provides no information about which directionto move. Points where f 0(x) = 0 are known as critical points or stationary points.

A local minimum is a point where f(x) is lower than at all neighboring points,so it is no longer possible to decrease f(x) by making infinitesimal steps. A local

maximum is a point where f(x) is higher than at all neighboring points, so it isnot possible to increase f(x) by making infinitesimal steps. Some critical pointsare neither maxima nor minima. These are known as saddle points. See Fig. 4.2for examples of each type of critical point.

A point that obtains the absolute lowest value of f(x) is a global minimum. It

80

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.1: An illustration of how the derivatives of a function can be used to follow thefunction downhill to a minimum. This technique is called gradient descent.

The derivative is therefore useful for minimizing a function because it tells ushow to change x in order to make a small improvement in y. For example, weknow that f(x − sign(f 0(x))) is less than f(x) for small enough . We can thusreduce f(x) by moving x in small steps with opposite sign of the the derivative.This technique is called gradient descent (Cauchy, 1847a). See Fig. 4.1 for anexample of this technique.

When f 0(x) = 0, the derivative provides no information about which directionto move. Points where f 0(x) = 0 are known as critical points or stationary points.

A local minimum is a point where f(x) is lower than at all neighboring points,so it is no longer possible to decrease f(x) by making infinitesimal steps. A local

maximum is a point where f(x) is higher than at all neighboring points, so it isnot possible to increase f(x) by making infinitesimal steps. Some critical pointsare neither maxima nor minima. These are known as saddle points. See Fig. 4.2for examples of each type of critical point.

A point that obtains the absolute lowest value of f(x) is a global minimum. It

80

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.1: An illustration of how the derivatives of a function can be used to follow thefunction downhill to a minimum. This technique is called gradient descent.

The derivative is therefore useful for minimizing a function because it tells ushow to change x in order to make a small improvement in y. For example, weknow that f(x − sign(f 0(x))) is less than f(x) for small enough . We can thusreduce f(x) by moving x in small steps with opposite sign of the the derivative.This technique is called gradient descent (Cauchy, 1847a). See Fig. 4.1 for anexample of this technique.

When f 0(x) = 0, the derivative provides no information about which directionto move. Points where f 0(x) = 0 are known as critical points or stationary points.

A local minimum is a point where f(x) is lower than at all neighboring points,so it is no longer possible to decrease f(x) by making infinitesimal steps. A local

maximum is a point where f(x) is higher than at all neighboring points, so it isnot possible to increase f(x) by making infinitesimal steps. Some critical pointsare neither maxima nor minima. These are known as saddle points. See Fig. 4.2for examples of each type of critical point.

A point that obtains the absolute lowest value of f(x) is a global minimum. It

80

<

per piccoli

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.1: An illustration of how the derivatives of a function can be used to follow thefunction downhill to a minimum. This technique is called gradient descent.

The derivative is therefore useful for minimizing a function because it tells ushow to change x in order to make a small improvement in y. For example, weknow that f(x − sign(f 0(x))) is less than f(x) for small enough . We can thusreduce f(x) by moving x in small steps with opposite sign of the the derivative.This technique is called gradient descent (Cauchy, 1847a). See Fig. 4.1 for anexample of this technique.

When f 0(x) = 0, the derivative provides no information about which directionto move. Points where f 0(x) = 0 are known as critical points or stationary points.

A local minimum is a point where f(x) is lower than at all neighboring points,so it is no longer possible to decrease f(x) by making infinitesimal steps. A local

maximum is a point where f(x) is higher than at all neighboring points, so it isnot possible to increase f(x) by making infinitesimal steps. Some critical pointsare neither maxima nor minima. These are known as saddle points. See Fig. 4.2for examples of each type of critical point.

A point that obtains the absolute lowest value of f(x) is a global minimum. It

80

ci muoviamo nella direzione negativa della derivata

Page 3: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Un po’ di ottimizzazione di base // il problema

• Minimizzare o massimizzare una funzione alterando

• min = max

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientific computation because roundingerrors in the inputs can result in large changes in the output.

Consider the function f (x) = A−1x. When A ∈ Rn×n has an eigenvalue

decomposition, its condition number is

maxi,j

|λ i

λj|,

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the resultof rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,the error will be compounded further by numerical errors in the inversion processitself. With iterative algorithms such as solving a linear system (or the worked-outexample of linear least square by gradient descent, Section 4.5) ill-conditioning(in that case of the linear system matrix) yields very slow convergence of theiterative algorithm, i.e., more iterations are needed to achieve some given degreeof approximation to the final solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimizationrefers to the task of either minimizing or maximizing some function f (x) by al-tering x. We usually phrase most optimization problems in terms of minimizingf (x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).The function we want to minimize or maximize is called the objective function

or criterion. When we are minimizing it, we may also call it the cost function,loss function, or error function. In this book, we use these terms interchangeably,though some machine learning publications assign special meaning to some ofthese terms.

We often denote the value that minimizes or maximizes a function with asuperscript ∗. For example, we might say x ∗ = arg min f(x).

We assume the reader is already familiar with calculus, but provide a briefreview of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.The derivative of this function is denoted as f 0(x) or as dy

dx . The derivative f 0(x)gives the slope of f(x) at the point x. In other words, it specifies how to scalea small change in the input in order to obtain the corresponding change in theoutput: f(x+ ) ≈ f(x) + f 0(x).

79

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.2: Examples of each of the three types of critical points in 1-D. A critical point isa point with zero slope. Such a point can either be a local minimum, which is lower thanthe neighboring points, a local maximum, which is higher than the neighboring points, or

a saddle point, which has neighbors that are both higher and lower than the point itself.The situation in higher dimension is qualitatively different, especially for saddle points:see Figures 4.4 and 4.5.

81

Un po’ di ottimizzazione di base // derivazione vettoriale

• Derivata vettoriale di una funzione scalare di un vettore D-dimensionale

• Esempio: se

Page 4: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Un po’ di ottimizzazione di base // derivazione vettoriale

• Derivata vettoriale di una funzione scalare di un vettore D-dimensionale

where Xno is the first element of the nth row of X, i.e. the first element of the nth data object and X.1 is the second (we'vestarted numbering from zero to maintain the relationship with wo). Similarly, the second term is equivalent to

Combining these and noting that, in our previous notation, Xno = 1 and Xn1 = Xn results in

Recalling our shorthand for the various averages and differentiating with respect to wo and Wi results in

It is left as an informal exercise to show that these are indeed equivalent to the derivates obtained from the non-vectorised lossfunction (Equations 1.7 and 1.6).

Fortunately, there are many standard identities that we can use that enable us to differentiate the vectorised expressiondirectly. Those that we will need are shown in Table 1.4.

TABLE 1.4: Some useful identities when differentiating with respect to a vector.

From these identities, and equating the derivative to zero, we can directly obtain the following:

49

Old and New Matrix Algebra Useful for Statistics

Thomas P. MinkaDecember 28, 2000

Contents1 Derivatives 12 Kronecker product and vec 63 Vec-transpose 74 Multilinear forms 85 Hadamard product and diag 106 Inverting partitioned matrices 127 Polar decomposition 148 Hessians 15

Warning: This paper contains a large number of matrix identities which cannot be absorbedby mere reading. The reader is encouraged to take time and check each equation by hand andwork out the examples. This is advanced material; see Searle (1982) for basic results.

1 Derivatives

Maximum-likelihood problems almost always require derivatives. There are six kinds of deriva-tives that can be expressed as matrices:

Scalar Vector Matrix

Scalar dydx

dydx =

!

∂yi∂x

"

dYdx =

#

∂yij∂x

$

Vector dydx =

#

∂y∂xj

$

dydx =

#

∂yi∂xj

$

Matrix dydX =

#

∂y∂xji

$

The partials with respect to the numerator are laid out according to the shape of Y whilethe partials with respect to the denominator are laid out according to the transpose of X.For example, dy/dx is a column vector while dy/dx is a row vector (assuming x and y arecolumn vectors—otherwise it is flipped). Each of these derivatives can be tediously computed viapartials, but this section shows how they instead can be computed with matrix manipulations.The material is based on Magnus and Neudecker (1988).

Define the differential dy(x) to be that part of y(x+dx)−y(x) which is linear in dx. Unlike theclassical definition in terms of limits, this definition applies even when x or y are not scalars.

1

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Riferimento: The Matrix Cookbook

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

derivata direzionale

Un po’ di ottimizzazione di base // derivazione vettoriale

gradiente= vettore di tutte le derivate parziali

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

Appendix B

Optimization

Throughout this book, we have used iterative nonlinear optimization methods tofind the maximum likelihood or MAP parameter estimates. We now provide moredetails about these methods. It is impossible to do full justice to this topic in thespace available; many entire books have been written about nonlinear optimization.Our goal is merely to provide a brief introduction to the main ideas.

B.1 Problem statement

Continuous nonlinear optimization techniques aim to find the set of parameters ✓that minimize a function f [•]. In other words, they try to compute,

✓ = argmin✓

[f [✓]] , (B.1)

where f [•] is termed a cost function or objective function.Although optimization techniques are usually described in terms of minimizing

a function, most optimization problems in this book involve maximizing an ob-jective function based on log probability. To turn a maximization problem intoa minimization we multiply the objective function by minus one. In other words,instead of maximizing the log probability, we minimize the negative log probability.

B.1.1 Convexity

The optimization techniques that we consider here are iterative: they start with anestimate ✓[0] and improve it by finding successive new estimates ✓[1]

,✓[2], . . . ,✓[1],

each of which is better than the last, until no more improvement can be made.The techniques are purely local in the sense that the decision about where tomove next is based on only the properties of the function at the current position.Consequently, these techniques cannot guarantee the correct solution: they mayfind an estimate ✓[1] from which no local change improves the cost. However, this

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

derivata direzionale

direzione nello spazio dei parametri

Page 5: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Un po’ di ottimizzazione di base // steepest descent (discesa del gradiente)

• Generalizziamo il ragionamento unidimensionale:

• Esempio: nello spazio dei parametri

per piccoli

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.1: An illustration of how the derivatives of a function can be used to follow thefunction downhill to a minimum. This technique is called gradient descent.

The derivative is therefore useful for minimizing a function because it tells ushow to change x in order to make a small improvement in y. For example, weknow that f(x − sign(f 0(x))) is less than f(x) for small enough . We can thusreduce f(x) by moving x in small steps with opposite sign of the the derivative.This technique is called gradient descent (Cauchy, 1847a). See Fig. 4.1 for anexample of this technique.

When f 0(x) = 0, the derivative provides no information about which directionto move. Points where f 0(x) = 0 are known as critical points or stationary points.

A local minimum is a point where f(x) is lower than at all neighboring points,so it is no longer possible to decrease f(x) by making infinitesimal steps. A local

maximum is a point where f(x) is higher than at all neighboring points, so it isnot possible to increase f(x) by making infinitesimal steps. Some critical pointsare neither maxima nor minima. These are known as saddle points. See Fig. 4.2for examples of each type of critical point.

A point that obtains the absolute lowest value of f(x) is a global minimum. It

80

604 B Optimization

a) b)

5 Iterations

Steepest Descent

Figure B.3 Optimization on a two dimensional function (color representsheight of function). We wish to find the parameters that minimize thefunction (green cross). Given an initial starting point ✓0 (blue cross), wechoose a direction and then perform a local search to find the optimal pointin that direction. a) One way to chose the direction is steepest descent:at each iteration, we head in the direction where the function changes thefastest. b) When we initialize from a di↵erent position, the steepest descentmethod takes many iterations to converge due to oscillatory behavior. c)Close-up of oscillatory region (see main text). d) Setting the direction usingNewton’s method results in faster convergence. e) Newton’s method does notundergo oscillatory behavior when we initialize from the second position.

where the derivative @f/@✓ is the gradient vector, which points uphill, and � isthe distance moved downhill in the opposite direction �@f/@✓. The line searchprocedure (section B.3) selects the value of �.

Steepest descent sounds like a good idea but can be very ine�cient in certainsituations (figure B.3b). For example, in a descending valley, it can oscillate ine↵ec-tually from one side to the other rather than proceeding straight down the center:the method approaches the bottom of the valley from one side, but overshootsbecause the valley itself is descending, so the minimum along the search directionis not exactly in the valley center (figure B.3c). When we re-measure the gradientand perform a second line search, we overshoot in the other direction. This is notan unusual situation: it is guaranteed that the gradient at the new point will beperpendicular to the previous one, so the only way to avoid this oscillation is tohit the valley at exactly right angles.

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

CHAPTER 4. NUMERICAL COMPUTATION

increases at point x. The gradient generalizes the notion of derivative to thecase where the derivative is with respect to a vector: f is the vector containingall of the partial derivatives, denoted ∇xf(x). Element i of the gradient is thepartial derivative of f with respect to x i. In multiple dimensions, critical pointsare points where every element of the gradient is equal to zero.

The directional derivative in direction u (a unit vector) is the slope of thefunction f in direction u. In other words, the derivative of the function f(x+αu)with respect to α, evaluated at α = 0. Using the chain rule, we can see that thisis u>∇xf(x).

To minimize f , we would like to find the direction in which f decreases thefastest. We can do this using the directional derivative:

minu,u>u=1

u>∇xf(x)

= minu,u>u=1

||u||2||∇xf(x)||2 cos θ

where θ is the angle between u and the gradient. Substituting in ||u||2 = 1

and ignoring factors that don’t depend on u, this simplifies to minu cos θ. Thisis minimized when u points in the opposite direction as the gradient. In other

words, the gradient points directly uphill, and the negative gradient points directlydownhill. We can decrease f by moving in the direction of the negative gradient.This is known as the method of steepest descent or gradient descent.

Steepest descent proposes a new point

x0 = x− ∇ xf(x)

where is the size of the step. We can choose in several different ways. Apopular approach is to set to a small constant. Sometimes, we can solve forthe step size that makes the directional derivative vanish. Another approach is toevaluate f (x − ∇xf(x)) for several values of and choose the one that resultsin the smallest objective function value. This last strategy is called a line search.

Steepest descent converges when every element of the gradient is zero (or, inpractice, very close to zero). In some cases, we may be able to avoid runningthis iterative algorithm, and just jump directly to the critical point by solving theequation ∇xf(x) = 0 for x.

Sometimes we need to find all of the partial derivatives of all of the elements

of a vector-valued function. The matrix containing all such partial derivatives isknown as a Jacobian matrix. Specifically, if we have a function f : Rm → R

n ,

then the Jacobian matrix J ∈ Rn×m of f is defined such that Ji,j = ∂

∂xjf(x)i.

We are also sometimes interested in a derivative of a derivative. This is knownas a second derivative. For example, for a function f : R n → R, the derivative

with respect to xi of the derivative of f with respect to xj is denoted as ∂2

∂xi∂xjf .

83

B.2 Choosing a search direction 603

Figure B.2 Convex functions. If thefunction is convex, then the globalminimum can be found. A function isconvex if no chord (line between twopoints on the function) intersects thefunction. The figure shows two exam-ple chords (blue dashed lines). Theconvexity of a function can be estab-lished algebraically by considering thematrix of second derivatives. If thisis positive definite for all values of ✓,then the function is convex.

B.2 Choosing a search direction

We will describe two general methods for choosing a search direction (steepestdescent and Newton’s method) and one method which is specialized for least squaresproblems (the Gauss-Newton method). All of these methods rely on computingderivatives of the function with respect to the parameters at the current position.To this end, we are relying on the function being smooth so that the derivativesare well behaved.

For most models, it is easy to find a closed form expression for the derivatives.If this is not the case, then an alternative is to approximate them using finitedi↵erences. For example, the first derivative of f [•] with respect to the jth elementof ✓ can be approximated by

@f

@✓j⇡

f [✓ + aej ]� f [✓]

a, (B.3)

where a is a small number and ej is the unit vector in the jth direction. In principleas a tends to zero, this estimate becomes more accurate. However, in practice thecalculation is limited by the floating point precision of the computer, so a must bechosen with care.

B.2.1 Steepest descent

An intuitive way to choose the search direction is to measure the gradient and selectthe direction which moves us downhill fastest. We could move in this direction untilthe function no longer decreases, then recompute the steepest direction and moveagain. In this way, we gradually move toward a local minimum of the function(figure B.3a). The algorithm terminates when the gradient is zero and the secondderivative is positive, indicating that we are at the minimum point and any furtherlocal changes would not result in further improvement. This approach is termedsteepest descent. More precisely, we choose

✓[t+1] = ✓[t]� �

@f

@✓

����✓[t]

, (B.4)

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.1: An illustration of how the derivatives of a function can be used to follow thefunction downhill to a minimum. This technique is called gradient descent.

The derivative is therefore useful for minimizing a function because it tells ushow to change x in order to make a small improvement in y. For example, weknow that f(x − sign(f 0(x))) is less than f(x) for small enough . We can thusreduce f(x) by moving x in small steps with opposite sign of the the derivative.This technique is called gradient descent (Cauchy, 1847a). See Fig. 4.1 for anexample of this technique.

When f 0(x) = 0, the derivative provides no information about which directionto move. Points where f 0(x) = 0 are known as critical points or stationary points.

A local minimum is a point where f(x) is lower than at all neighboring points,so it is no longer possible to decrease f(x) by making infinitesimal steps. A local

maximum is a point where f(x) is higher than at all neighboring points, so it isnot possible to increase f(x) by making infinitesimal steps. Some critical pointsare neither maxima nor minima. These are known as saddle points. See Fig. 4.2for examples of each type of critical point.

A point that obtains the absolute lowest value of f(x) is a global minimum. It

80

per piccoli

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.1: An illustration of how the derivatives of a function can be used to follow thefunction downhill to a minimum. This technique is called gradient descent.

The derivative is therefore useful for minimizing a function because it tells ushow to change x in order to make a small improvement in y. For example, weknow that f(x − sign(f 0(x))) is less than f(x) for small enough . We can thusreduce f(x) by moving x in small steps with opposite sign of the the derivative.This technique is called gradient descent (Cauchy, 1847a). See Fig. 4.1 for anexample of this technique.

When f 0(x) = 0, the derivative provides no information about which directionto move. Points where f 0(x) = 0 are known as critical points or stationary points.

A local minimum is a point where f(x) is lower than at all neighboring points,so it is no longer possible to decrease f(x) by making infinitesimal steps. A local

maximum is a point where f(x) is higher than at all neighboring points, so it isnot possible to increase f(x) by making infinitesimal steps. Some critical pointsare neither maxima nor minima. These are known as saddle points. See Fig. 4.2for examples of each type of critical point.

A point that obtains the absolute lowest value of f(x) is a global minimum. It

80

ci muoviamo nella direzione negativa del gradiente

• La derivata vettoriale di una funzione vettoriale di un vettore D-dimensionale è definita mediante la matrice Jacobiana

Un po’ di ottimizzazione di base // derivazione vettoriale

CHAPTER 4. NUMERICAL COMPUTATION

increases at point x. The gradient generalizes the notion of derivative to thecase where the derivative is with respect to a vector: f is the vector containingall of the partial derivatives, denoted ∇xf(x). Element i of the gradient is thepartial derivative of f with respect to x i. In multiple dimensions, critical pointsare points where every element of the gradient is equal to zero.

The directional derivative in direction u (a unit vector) is the slope of thefunction f in direction u. In other words, the derivative of the function f(x+αu)with respect to α, evaluated at α = 0. Using the chain rule, we can see that thisis u>∇xf(x).

To minimize f , we would like to find the direction in which f decreases thefastest. We can do this using the directional derivative:

minu,u>u=1

u>∇xf(x)

= minu,u>u=1

||u||2||∇xf(x)||2 cos θ

where θ is the angle between u and the gradient. Substituting in ||u||2 = 1

and ignoring factors that don’t depend on u, this simplifies to minu cos θ. Thisis minimized when u points in the opposite direction as the gradient. In other

words, the gradient points directly uphill, and the negative gradient points directlydownhill. We can decrease f by moving in the direction of the negative gradient.This is known as the method of steepest descent or gradient descent.

Steepest descent proposes a new point

x0 = x− ∇ xf(x)

where is the size of the step. We can choose in several different ways. Apopular approach is to set to a small constant. Sometimes, we can solve forthe step size that makes the directional derivative vanish. Another approach is toevaluate f (x − ∇xf(x)) for several values of and choose the one that resultsin the smallest objective function value. This last strategy is called a line search.

Steepest descent converges when every element of the gradient is zero (or, inpractice, very close to zero). In some cases, we may be able to avoid runningthis iterative algorithm, and just jump directly to the critical point by solving theequation ∇xf(x) = 0 for x.

Sometimes we need to find all of the partial derivatives of all of the elements

of a vector-valued function. The matrix containing all such partial derivatives isknown as a Jacobian matrix. Specifically, if we have a function f : Rm → R

n ,

then the Jacobian matrix J ∈ Rn×m of f is defined such that Ji,j = ∂

∂xjf(x)i.

We are also sometimes interested in a derivative of a derivative. This is knownas a second derivative. For example, for a function f : R n → R, the derivative

with respect to xi of the derivative of f with respect to xj is denoted as ∂2

∂xi∂xjf .

83

CHAPTER 4. NUMERICAL COMPUTATION

increases at point x. The gradient generalizes the notion of derivative to thecase where the derivative is with respect to a vector: f is the vector containingall of the partial derivatives, denoted ∇xf(x). Element i of the gradient is thepartial derivative of f with respect to x i. In multiple dimensions, critical pointsare points where every element of the gradient is equal to zero.

The directional derivative in direction u (a unit vector) is the slope of thefunction f in direction u. In other words, the derivative of the function f(x+αu)with respect to α, evaluated at α = 0. Using the chain rule, we can see that thisis u>∇xf(x).

To minimize f , we would like to find the direction in which f decreases thefastest. We can do this using the directional derivative:

minu,u>u=1

u>∇xf(x)

= minu,u>u=1

||u||2||∇xf(x)||2 cos θ

where θ is the angle between u and the gradient. Substituting in ||u||2 = 1

and ignoring factors that don’t depend on u, this simplifies to minu cos θ. Thisis minimized when u points in the opposite direction as the gradient. In other

words, the gradient points directly uphill, and the negative gradient points directlydownhill. We can decrease f by moving in the direction of the negative gradient.This is known as the method of steepest descent or gradient descent.

Steepest descent proposes a new point

x0 = x− ∇ xf(x)

where is the size of the step. We can choose in several different ways. Apopular approach is to set to a small constant. Sometimes, we can solve forthe step size that makes the directional derivative vanish. Another approach is toevaluate f (x − ∇xf(x)) for several values of and choose the one that resultsin the smallest objective function value. This last strategy is called a line search.

Steepest descent converges when every element of the gradient is zero (or, inpractice, very close to zero). In some cases, we may be able to avoid runningthis iterative algorithm, and just jump directly to the critical point by solving theequation ∇xf(x) = 0 for x.

Sometimes we need to find all of the partial derivatives of all of the elements

of a vector-valued function. The matrix containing all such partial derivatives isknown as a Jacobian matrix. Specifically, if we have a function f : Rm → R

n ,

then the Jacobian matrix J ∈ Rn×m of f is defined such that Ji,j = ∂

∂xjf(x)i.

We are also sometimes interested in a derivative of a derivative. This is knownas a second derivative. For example, for a function f : R n → R, the derivative

with respect to xi of the derivative of f with respect to xj is denoted as ∂2

∂xi∂xjf .

83

Page 6: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

• Sia data la funzione scalare

• La derivata (parziale) prima

• La derivata seconda

Un po’ di ottimizzazione di base // derivazione vettoriale

• Matrice Hessiana

Un po’ di ottimizzazione di base // derivazione vettoriale

CHAPTER 4. NUMERICAL COMPUTATION

In a single dimension, we can denote d2

dx2 f by f 00(x).The second derivative tells us how the first derivative will change as we vary

the input. This means it can be useful for determining whether a critical pointis a local maximum, a local minimum, or saddle point. Recall that on a criticalpoint, f 0(x) = 0. When f 00(x) > 0, this means that f 0(x) increases as we move tothe right, and f 0 (x) decreases as we move to the left. This means f0(x − ) < 0and f 0(x+ ) > 0 for small enough . In other words, as we move right, the slopebegins to point uphill to the right, and as we move left, the slope begins to pointuphill to the left. Thus, when f 0 (x) = 0 and f00(x) > 0, we can conclude that x is

a local minimum. Similarly, when f 0(x) = 0 and f 00(x) < 0, we can conclude thatx is a local maximum. This is known as the second derivative test. Unfortunately,

when f 00 (x) = 0, the test is inconclusive. In this case x may be a saddle point, ora part of a flat region.

In multiple dimensions, we need to examine all of the second derivatives ofthe function. These derivatives can be collected together into a matrix called the

Hessian matrix. The Hessian matrix H(f)(x) is defined such that

H(f)(x)i,j =∂2

∂xi∂xjf(x).

Equivalently, the Hessian is the Jacobian of the gradient.

Anywhere that the second partial derivatives are continuous, the differentialoperators are commutative, i.e. their order can be swapped:

∂2

∂xi ∂xjf(x) =

∂2

∂xj ∂xif(x).

This implies that H i,j = Hj,i, so the Hessian matrix is symmetric at such points.Most of the functions we encounter in the context of deep learning have a sym-

metric Hessian almost everywhere. Because the Hessian matrix is real and sym-metric, we can decompose it into a set of real eigenvalues and an orthogonal basisof eigenvectors.

Using the eigendecomposition of the Hessian matrix, we can generalize the sec-ond derivative test to multiple dimensions. At a critical point, where ∇ xf(x) = 0,we can examine the eigenvalues of the Hessian to determine whether the criticalpoint is a local maximum, local minimum, or saddle point. When the Hessian ispositive definite1, the point is a local minimum. This can be seen by observing

that the directional second derivative in any direction must be positive, and mak-ing reference to the univariate second derivative test. Likewise, when the Hessianis negative definite2 , the point is a local maximum. In multiple dimensions, it is

1all its eigenvalues are positive2all its eigenvalues are negative

84

CHAPTER 4. NUMERICAL COMPUTATION

increases at point x. The gradient generalizes the notion of derivative to thecase where the derivative is with respect to a vector: f is the vector containingall of the partial derivatives, denoted ∇xf(x). Element i of the gradient is thepartial derivative of f with respect to x i. In multiple dimensions, critical pointsare points where every element of the gradient is equal to zero.

The directional derivative in direction u (a unit vector) is the slope of thefunction f in direction u. In other words, the derivative of the function f(x+αu)with respect to α, evaluated at α = 0. Using the chain rule, we can see that thisis u>∇xf(x).

To minimize f , we would like to find the direction in which f decreases thefastest. We can do this using the directional derivative:

minu,u>u=1

u>∇xf(x)

= minu,u>u=1

||u||2||∇xf(x)||2 cos θ

where θ is the angle between u and the gradient. Substituting in ||u||2 = 1

and ignoring factors that don’t depend on u, this simplifies to minu cos θ. Thisis minimized when u points in the opposite direction as the gradient. In other

words, the gradient points directly uphill, and the negative gradient points directlydownhill. We can decrease f by moving in the direction of the negative gradient.This is known as the method of steepest descent or gradient descent.

Steepest descent proposes a new point

x0 = x− ∇ xf(x)

where is the size of the step. We can choose in several different ways. Apopular approach is to set to a small constant. Sometimes, we can solve forthe step size that makes the directional derivative vanish. Another approach is toevaluate f (x − ∇xf(x)) for several values of and choose the one that resultsin the smallest objective function value. This last strategy is called a line search.

Steepest descent converges when every element of the gradient is zero (or, inpractice, very close to zero). In some cases, we may be able to avoid runningthis iterative algorithm, and just jump directly to the critical point by solving theequation ∇xf(x) = 0 for x.

Sometimes we need to find all of the partial derivatives of all of the elements

of a vector-valued function. The matrix containing all such partial derivatives isknown as a Jacobian matrix. Specifically, if we have a function f : Rm → R

n ,

then the Jacobian matrix J ∈ Rn×m of f is defined such that Ji,j = ∂

∂xjf(x)i.

We are also sometimes interested in a derivative of a derivative. This is knownas a second derivative. For example, for a function f : R n → R, the derivative

with respect to xi of the derivative of f with respect to xj is denoted as ∂2

∂xi∂xjf .

83

CHAPTER 4. NUMERICAL COMPUTATION

In a single dimension, we can denote d2

dx2 f by f 00(x).The second derivative tells us how the first derivative will change as we vary

the input. This means it can be useful for determining whether a critical pointis a local maximum, a local minimum, or saddle point. Recall that on a criticalpoint, f 0(x) = 0. When f 00(x) > 0, this means that f 0(x) increases as we move tothe right, and f 0 (x) decreases as we move to the left. This means f0(x − ) < 0and f 0(x+ ) > 0 for small enough . In other words, as we move right, the slopebegins to point uphill to the right, and as we move left, the slope begins to pointuphill to the left. Thus, when f 0 (x) = 0 and f00(x) > 0, we can conclude that x is

a local minimum. Similarly, when f 0(x) = 0 and f 00(x) < 0, we can conclude thatx is a local maximum. This is known as the second derivative test. Unfortunately,

when f 00 (x) = 0, the test is inconclusive. In this case x may be a saddle point, ora part of a flat region.

In multiple dimensions, we need to examine all of the second derivatives ofthe function. These derivatives can be collected together into a matrix called the

Hessian matrix. The Hessian matrix H(f)(x) is defined such that

H(f)(x)i,j =∂2

∂xi∂xjf(x).

Equivalently, the Hessian is the Jacobian of the gradient.

Anywhere that the second partial derivatives are continuous, the differentialoperators are commutative, i.e. their order can be swapped:

∂2

∂xi ∂xjf(x) =

∂2

∂xj ∂xif(x).

This implies that H i,j = Hj,i, so the Hessian matrix is symmetric at such points.Most of the functions we encounter in the context of deep learning have a sym-

metric Hessian almost everywhere. Because the Hessian matrix is real and sym-metric, we can decompose it into a set of real eigenvalues and an orthogonal basisof eigenvectors.

Using the eigendecomposition of the Hessian matrix, we can generalize the sec-ond derivative test to multiple dimensions. At a critical point, where ∇ xf(x) = 0,we can examine the eigenvalues of the Hessian to determine whether the criticalpoint is a local maximum, local minimum, or saddle point. When the Hessian ispositive definite1, the point is a local minimum. This can be seen by observing

that the directional second derivative in any direction must be positive, and mak-ing reference to the univariate second derivative test. Likewise, when the Hessianis negative definite2 , the point is a local maximum. In multiple dimensions, it is

1all its eigenvalues are positive2all its eigenvalues are negative

84

CHAPTER 4. NUMERICAL COMPUTATION

In a single dimension, we can denote d2

dx2 f by f 00(x).The second derivative tells us how the first derivative will change as we vary

the input. This means it can be useful for determining whether a critical pointis a local maximum, a local minimum, or saddle point. Recall that on a criticalpoint, f 0(x) = 0. When f 00(x) > 0, this means that f 0(x) increases as we move tothe right, and f 0 (x) decreases as we move to the left. This means f0(x − ) < 0and f 0(x+ ) > 0 for small enough . In other words, as we move right, the slopebegins to point uphill to the right, and as we move left, the slope begins to pointuphill to the left. Thus, when f 0 (x) = 0 and f00(x) > 0, we can conclude that x is

a local minimum. Similarly, when f 0(x) = 0 and f 00(x) < 0, we can conclude thatx is a local maximum. This is known as the second derivative test. Unfortunately,

when f 00 (x) = 0, the test is inconclusive. In this case x may be a saddle point, ora part of a flat region.

In multiple dimensions, we need to examine all of the second derivatives ofthe function. These derivatives can be collected together into a matrix called the

Hessian matrix. The Hessian matrix H(f)(x) is defined such that

H(f)(x)i,j =∂2

∂xi∂xjf(x).

Equivalently, the Hessian is the Jacobian of the gradient.

Anywhere that the second partial derivatives are continuous, the differentialoperators are commutative, i.e. their order can be swapped:

∂2

∂xi ∂xjf(x) =

∂2

∂xj ∂xif(x).

This implies that H i,j = Hj,i, so the Hessian matrix is symmetric at such points.Most of the functions we encounter in the context of deep learning have a sym-

metric Hessian almost everywhere. Because the Hessian matrix is real and sym-metric, we can decompose it into a set of real eigenvalues and an orthogonal basisof eigenvectors.

Using the eigendecomposition of the Hessian matrix, we can generalize the sec-ond derivative test to multiple dimensions. At a critical point, where ∇ xf(x) = 0,we can examine the eigenvalues of the Hessian to determine whether the criticalpoint is a local maximum, local minimum, or saddle point. When the Hessian ispositive definite1, the point is a local minimum. This can be seen by observing

that the directional second derivative in any direction must be positive, and mak-ing reference to the univariate second derivative test. Likewise, when the Hessianis negative definite2 , the point is a local maximum. In multiple dimensions, it is

1all its eigenvalues are positive2all its eigenvalues are negative

84

CHAPTER 4. NUMERICAL COMPUTATION

increases at point x. The gradient generalizes the notion of derivative to thecase where the derivative is with respect to a vector: f is the vector containingall of the partial derivatives, denoted ∇xf(x). Element i of the gradient is thepartial derivative of f with respect to x i. In multiple dimensions, critical pointsare points where every element of the gradient is equal to zero.

The directional derivative in direction u (a unit vector) is the slope of thefunction f in direction u. In other words, the derivative of the function f(x+αu)with respect to α, evaluated at α = 0. Using the chain rule, we can see that thisis u>∇xf(x).

To minimize f , we would like to find the direction in which f decreases thefastest. We can do this using the directional derivative:

minu,u>u=1

u>∇xf(x)

= minu,u>u=1

||u||2||∇xf(x)||2 cos θ

where θ is the angle between u and the gradient. Substituting in ||u||2 = 1

and ignoring factors that don’t depend on u, this simplifies to minu cos θ. Thisis minimized when u points in the opposite direction as the gradient. In other

words, the gradient points directly uphill, and the negative gradient points directlydownhill. We can decrease f by moving in the direction of the negative gradient.This is known as the method of steepest descent or gradient descent.

Steepest descent proposes a new point

x0 = x− ∇ xf(x)

where is the size of the step. We can choose in several different ways. Apopular approach is to set to a small constant. Sometimes, we can solve forthe step size that makes the directional derivative vanish. Another approach is toevaluate f (x − ∇xf(x)) for several values of and choose the one that resultsin the smallest objective function value. This last strategy is called a line search.

Steepest descent converges when every element of the gradient is zero (or, inpractice, very close to zero). In some cases, we may be able to avoid runningthis iterative algorithm, and just jump directly to the critical point by solving theequation ∇xf(x) = 0 for x.

Sometimes we need to find all of the partial derivatives of all of the elements

of a vector-valued function. The matrix containing all such partial derivatives isknown as a Jacobian matrix. Specifically, if we have a function f : Rm → R

n ,

then the Jacobian matrix J ∈ Rn×m of f is defined such that Ji,j = ∂

∂xjf(x)i.

We are also sometimes interested in a derivative of a derivative. This is knownas a second derivative. For example, for a function f : R n → R, the derivative

with respect to xi of the derivative of f with respect to xj is denoted as ∂2

∂xi∂xjf .

83

Page 7: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Esempio. // regressione lineare: tecnica dei minimi quadrati

• Supponiamo di avere i seguenti datixtrain

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ttrain

2.4865 -0.3033 -4.0531 -4.3359 -6.1742 -5.6040 -3.5069 -2.3257 -4.6377 -0.2327 -1.9858 1.0284 -2.2640 -0.4508 1.1672 6.6524 4.1452 5.2677 6.3403 9.6264 14.7842

>> figure(1)>> plot(xtrain,ttrain,'.')

t

x

dati di training disponibili

Esempio. // regressione lineare: tecnica dei minimi quadrati

Page 8: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

• Problema della regressione: dato un vettore di input xT = (x1 , x2 , . . . , xd ), si vuole predire l’output corrispondente individuando una funzione f tale che f(x) non sia troppo diverso dal valore di output reale (il target t) associato a x

• Nel caso precedente abbiamo un’unica feature xT = (x1) = x

• Assumiamo che i dati siano modellabili con una retta

• Vogliamo risolvere i seguenti due problemi

Modello

parametri liberi

Fit (apprendimento dei parametri su train set)

Predizione (su test set)

Esempio. // regressione lineare: tecnica dei minimi quadrati

• Problema della regressione: dato un vettore di input xT = (x1 , x2 , . . . , xd ), si vuole predire l’output corrispondente individuando una funzione f tale che f(x) non sia troppo diverso dal valore di output reale (il target t) associato a x

• Nel caso precedente abbiamo un’unica feature xT = (x1) = x

• Assumiamo che i dati siano modellabili con una retta

• Definiamo come misura della bontà del fitting una funzione di costo (loss function) media su tutti i punti

Modello

Funzione di costo

parametri liberi

Esempio. // regressione lineare: tecnica dei minimi quadrati

Page 9: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Sample MeanSquare Error

MSE

Esempio. // regressione lineare: tecnica dei minimi quadrati

• Assumiamo che i dati siano modellabili con una retta

• Definiamo come misura della bontà del fitting una funzione di costo (loss function) media su tutti i punti

• Esempio: somma dei quadrati dei residui (approccio ai minimi quadrati)

Modello

Funzione di costo

parametri liberi

Sample MeanSquare Error

MSE

Esempio. // regressione lineare: tecnica dei minimi quadrati

Page 10: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Minimizzazione della funzione di

costo rispetto ai parametri

• Vogliamo che tale costo (o errore) sia minimo al variare dei parametri

Esempio. // regressione lineare: tecnica dei minimi quadrati

• Scriviamo un’equazione per ogni osservazione

• Abbiamo un sistema scrivibile in forma matriciale

design matrix

vettore dei parametri

vettore delle risposte

(response)

Esempio. // regressione lineare: tecnica dei minimi quadrati

N x 2

2 x 1

N x 1

Page 11: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

• Si noti che l’ MSE si scrive in forma matriciale come

• Vogliamo i parametri che rendono minimo il MSE

Esempio. // regressione lineare: tecnica dei minimi quadrati

C.7 Common problems 623

which leads to the relation

@det[Y]

@Y= det[Y]Y�T

. (C.37)

4. Derivative of log determinant:

@log[det[Y]]

@Y= Y�T

. (C.38)

5. Derivative of inverse:

@Y�1

@x= �Y�1 @Y

@xY�1

. (C.39)

6. Derivative of trace:@tr[F[X]]

@X=

✓@F[X]

@X

◆T

. (C.40)

More information about matrix calculus can be found in Petersen et al. (2006).

C.7 Common problems

In this section, we discuss several standard linear algebra problems that are foundrepeatedly in computer vision.

C.7.1 Least squares problems

Many inference and learning tasks in computer vision result in least squares prob-lems. The most frequent context is when we use maximum likelihood methodswith the normal distribution. The least squares problem may be formulated in anumber of ways. We may be asked to find the vector x that solves the system

Ax = b (C.41)

in a least squares sense. Alternatively, we may be given i of smaller sets of equationsof the form

Aix = bi, (C.42)

and again asked to solve for x. In this latter case, we form the compound ma-trix A = [AT

1 ,AT2 . . .AT

I ]T and compound vector b = [bT

1 ,bT2 . . .bT

I ]T , and the

problem is the same as in equation C.41.We may equivalently see the same problem in an explicit least-squares form,

x = argminx

h(Ax� b)T (Ax� b)

i. (C.43)

Copyright c�2011,2012 by Simon Prince; published by Cambridge University Press 2012.For personal use only, not for distribution.

errore quadratico medio

(MSE = Mean Square Error)

��� #�� �"����� �! ���

#6E P D Π.1 .2 ! D"

1=p

2 !1=p

2

1=p

2 1=p

2

#� +96? P @CE9@8@?2==J�5:28@?2=:K6D A� D@�E96

492?86�@7�G2C:23=6 1 D P 2 AC@5F46D�E96�BF25C2E:4�7@C> 2T D2 D 3y21 C 7y2

2 � +96�?6H2I6D�7@C�E9:D�492?86�@7�G2C:23=6�2C6�D9@H?�:?��:8� �2��

���!!��'���� #�� �"����� �!

.96? A :D�2? n " n >2EC:I� E96�BF25C2E:4�7@C> Q.1/ D 1TA1 :D�2�C62=�G2=F65�7F?4E:@?H:E9�5@>2:? Rn� �:8FC6 ��5:DA=2JD�E96�8C2A9D�@7�7@FC�BF25C2E:4�7@C>D�H:E9�5@>2:? R2��@C�6249�A@:?E 1 D .x1; x2/ :?�E96�5@>2:?�@7�2�BF25C2E:4�7@C> Q� E96�8C2A9�5:DA=2JD�E96A@:?E .x1; x2; ´/ H96C6 ´ D Q.1/� %@E:46�E92E�6I46AE�2E 1 D �� E96�G2=F6D�@7 Q.1/ 2C62==�A@D:E:G6�:?��:8� ��2��2?5�2==�?682E:G6�:?��:8� ��5�� +96�9@C:K@?E2=�4C@DD�D64E:@?D�@7E96�8C2A9D�2C6�6==:AD6D�:?��:8D� ��2��2?5���5��2?5�9JA6C3@=2D�:?��:8� ��4��

(a) z = 3x2 + 7x2

x1

z

x2

1 2 (b) z = 3x2

x1

z

x2

1

x

(c) z = 3x2 – 7x2

x1

z

x2

1 2 (d) z = –3x2 – 7x2

x1

z

x2

1 2

'*(63& � �C2A9D�@7�BF25C2E:4�7@C>D�

+96�D:>A=6 2 " 2 6I2>A=6D�:?��:8� ��:==FDEC2E6�E96�7@==@H:?8�56Q?:E:@?D�

%&' */ * 5 * 0/ � BF25C2E:4�7@C> Q :D�

2� )(,#-#/����5'#-� :7 Q.1/ > 0 7@C�2== 1 ¤ ��3� '�!�-#/����5'#-� :7 Q.1/ < 0 7@C�2== 1 ¤ ��4� #'��5'#-� :7 Q.1/ 2DDF>6D�3@E9�A@D:E:G6�2?5�?682E:G6�G2=F6D�

�=D@� Q :D�D2:5�E@�36 )(,#-#/��,�&#��5'#-� :7 Q.1/ # 0 7@C�2== 1� 2?5�E@�36 '�!�-#/�,�&#��5'#-� :7 Q.1/ $ 0 7@C�2== 1� +96�BF25C2E:4�7@C>D�:?�A2CED��2��2?5��3��@7��:8����2C63@E9�A@D:E:G6�D6>:56Q?:E6� 3FE�E96�7@C>�:?��2��:D�36EE6C�56D4C:365�2D�A@D:E:G6�56Q?:E6�

+96@C6>���492C24E6C:K6D�D@>6�BF25C2E:4�7@C>D�:?�E6C>D�@7�6:86?G2=F6D�

5)&03&. � 2VBESBUJD 'PSNT BOE &JHFOWBMVFT#6E A 36�2? n " n DJ>>6EC:4�>2EC:I� +96?�2�BF25C2E:4�7@C> 1TA1 :D�

2� A@D:E:G6�56Q?:E6�:7�2?5�@?=J�:7�E96�6:86?G2=F6D�@7 A 2C6�2==�A@D:E:G6�3� ?682E:G6�56Q?:E6�:7�2?5�@?=J�:7�E96�6:86?G2=F6D�@7 A 2C6�2==�?682E:G6� @C4� :?56Q?:E6�:7�2?5�@?=J�:7 A 92D�3@E9�A@D:E:G6�2?5�?682E:G6�6:86?G2=F6D�

• Applicando la derivazione rispetto a vettori

• ovvero

• Si trova dunque il punto stazionario

Fit

pseudo inversa

di X

Esempio. // regressione lineare: tecnica dei minimi quadrati

Page 12: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

• La soluzione si può trovare in forma matriciale

pseudo inversa

di X

w = inv(X’*X)*X’*t

w = pinv(X)*t

w = X \ t

soluzione 1

soluzione 2

soluzione 3

Esempio. // regressione lineare: tecnica dei minimi quadrati

• Applicando la derivazione rispetto a vettori

• ovvero

• Si trova dunque il punto stazionario

Fit

Predizione

Esempio. // regressione lineare: tecnica dei minimi quadrati

Page 13: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

dati di training dati di test

figure(2);scatter(xtest,ttest,'r','filled');title('Test set')

figure(1);scatter(xtrain,ttrain,'b','filled');title('Training set')

Esempio. // regressione lineare: tecnica dei minimi quadrati

• Possiamo vettorizzare le operazioni e usare l’operatore \

Xtrain = xtrain; Xtest = xtest; %% Fit usando le funzioni MatlabXtrain1 = [ones(size(Xtrain,1),1) Xtrain];

w = Xtrain1 \ ttrain

%% PredizioneXtest1 = [ones(size(Xtest,1),1) Xtest];

t_hat = Xtest1*w;

design matrix

Fit

Predizione

Esempio. // regressione lineare: tecnica dei minimi quadrati

Page 14: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

figure(3);scatter(xtest,ttest,'r','filled');hold on;plot(xtest, t_hat, 'k', 'linewidth', 3);title('Predizione sul test set')

Predizione sui dati di test

Esempio. // regressione lineare: tecnica dei minimi quadrati

Esempio //Il clustering

Page 15: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Il problema del clustering

• Trovare strutture coerenti nei dati

• Esempio: Old Faithful dataset

• Una singola Gaussiana è insufficiente

• Supponiamo di avere 2 clusters (agglomerati)

• Vogliamo assegnare ciascun punto ad uno dei due cluster

• Unmetodo di clustering semplice: k-means

K-means clustering //Esempio

Page 16: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

K-means clustering //Esempio

K-means clustering //Esempio

Page 17: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

K-means clustering //Esempio

K-means clustering //Esempio

Page 18: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

K-means clustering //Esempio

K-means clustering //Esempio

Page 19: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

K-means clustering //Esempio

K-means clustering //Esempio

Page 20: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

• Dati (non etichettati) in uno spazio di features di dimensione D:

• Ipotizziamo K possibili clusters (è il nostro modello)

• Supponiamo che esista per ciascun punto n una variabile binaria che vale 1 se il punto n appartiene al cluster k

• Questo tipo di variabile è detta indicatore (indicator variable), ed è in generale una variabile latente (non la conosciamo)

K-means clustering //Algoritmo: ipotesi

• Ci serve una misura di coerenza interna o compattezza per i punti di un cluster: distanza totale di tutti i punti dal baricentro del cluster

• Numero di punti allocati nel cluster k:

• Qualità complessiva del clustering: il criterio da ottimizzare

K-means clustering //Algoritmo: misura di qualità del cluster

Page 21: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

• Due insiemi di parametri: ,

• 1. Fissato derivo rispetto a : il criterio si annulla quando

• il punto è assegnato alla media più vicina

• 2. Fissato derivo

rispetto a

• Itero fino a convergenza

K-means clustering //Algoritmo: ottimizzazione

function [M,z,e] = kmeans(X,K,Max_Its)%KMEANS Semplice implementazione del k-means clustering% Input:% Data matrix X (N x D) % Numero di cluster K % Massimo numero di iterazioni Max_Its % Output:% Matrice M (K x D) - le K medie% Vettore z (N x 1): z(n)=k indica il Cluster 1.. K a cui ciascun punto x_n e' stato assegnato[N,D]=size(X); %N - num punti, D dimensione datiI=randperm(N); %permutazione random degli interi 1:N - usata %per settare in modo random le K medie inizialiM=X(I(1:K),:); %M matrice K x D dei mean values - selezionate in modo random Mo = M; for t=1:Max_Its %Crea la distance matrix N x K: indica la distanza di ciascun punto X_n dai K centri for k=1:K Dist(:,k) = sum((X - repmat(M(k,:),N,1)).^2,2); end %Minima distanza di K da ciascun punto. [i,z]=min(Dist,[],2); %Abbiamo gli assegnamenti ai clusters: ricalcoliamo i centri dei %cluster for k=1:K if size(find(z==k))>0 M(k,:) = mean(X(find(z==k),:)); end end %Calcoliamo Z matrice N x K: indictor matrix - di elementi z_nk Z = zeros(N,K); for n=1:N Z(n,z(n)) = 1; end %Calcola il valore corrente del criterio di errore minimizzato da K-means e = sum(sum(Z.*Dist)./N); fprintf('%d Error = %f\n', t, e); Mo = M;end

K-means clustering //Algoritmo

Page 22: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

K-means clustering //Segmentazione dell’immagine come clustering

• K regioni/cluster, ad esempio di colore: ogni punto (pixel) è un vettore [R G B]

• Esempio: K=4

• Se si usa per codificare/comprimere: vector quantization

K-means clustering //Segmentazione dell’immagine come clustering

Page 23: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

45

• Minimizzare f(x) sotto il vincolo (s.t.) g(x) = 0

• Condizione necessaria affinchè x0 sia una soluzione:

• α: moltiplicatore di Lagrange (Lagrange multiplier)

• Vincoli multipli (multiple constraints) gi(x) = 0, i=1, …, m, un moltiplicatore αi per ciascun vincolo

Un po’ di ottimizzazione di base // ottimizzazione vincolata

46

• Per vincoli del tipo gi(x)≤0 il Lagrange multiplier αi deve essere positivo

• Se x0 è soluzione per il problema

• esistono αi≥0 per i=1, …, m tali che x0 soddisfa

• La funzione è la Lagrangiana; il suo gradiente deve essere messo a 0

Lagrangiana

Un po’ di ottimizzazione di base // ottimizzazione vincolata

Page 24: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Esempio. // regressione lineare: regolarizzazione

curva originale (non nota)

punti osservati t

Polinomio approssimante

Problema di apprendimento Trovare i coefficienti w (parametri) noti i punti target t

rumore

Trovare i coefficienti w

Esempio. // regressione lineare: regolarizzazione

Page 25: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Funzione di loss (errore) quadratica

Esempio. // regressione lineare: regolarizzazione

Polinomio di ordine 0

Esempio. // regressione lineare: regolarizzazione

Page 26: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Polinomio di ordine 1

Esempio. // regressione lineare: regolarizzazione

Polinomio di ordine 3

Esempio. // regressione lineare: regolarizzazione

Page 27: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Polinomio di ordine 9

Esempio. // regressione lineare: regolarizzazione

Root-Mean-Square (RMS) Error:

Esempio. // regressione lineare: regolarizzazione

Page 28: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Esempio. // regressione lineare: regolarizzazione

• Introduzione di un termine di regolarizzazione nella funzione di costo

• Esempio: funzione quadratica che penalizza grandi coefficienti

penalized loss function

regularization parameter

dipendendente dai dati

dipendendente dai parametri

Esempio. // regressione lineare: regolarizzazione

Page 29: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Esempio. // regressione lineare: regolarizzazione

Esempio. // regressione lineare: regolarizzazione

Page 30: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Esempio. // regressione lineare: regolarizzazione

vs

Esempio. // regressione lineare: regolarizzazione

Page 31: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Esempio: riduzione di dimensionalità

Esempio: riduzione di dimensionalità

Page 32: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

spazio originale

N dimensionale

spazio “ridotto”

D dimensionale

M > D

Esempio: riduzione di dimensionalità(NxM)

(NxD)

Le proiezioni migliori massimizzano la varianza

Esempio: riduzione di dimensionalità

Page 33: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Analisi per componenti principali //Descrizione intuitiva

• N vettori di dati “sbiancati”

• ovvero, per

• la varianza diventa:

Analisi per componenti principali //Descrizione intuitiva

• Sostituendo in

• dove C è la matrice di covarianza empirica

Page 34: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Esempio: trovare la matrice di covarianza

Analisi per componenti principali //Descrizione intuitiva

• Il problema è trovare w in maniera da massimizzare la varianza

• lo facciamo sotto il vincolo di ortogonalità

• si ottiene un’ equazione agli autovettori

• La coppia autovettore / autovalore = direzione proiezione / varianza

• direzioni ortogonali di massima varianza

Page 35: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Determinazione delle componenti principali

Analisi per componenti principali //Descrizione intuitiva

% Sottrazione medie Y = Y - repmat(mean(Y,1),N,1);

% Calcola matrice di covarianza C = (1/N)*Y'*Y;

% Trova autovettori / autovalori % le colonne di w corrispondono alle direzioni % di proiezione [w,lambda] = eig(C);

% Proietta i dati sulle prime D dimensioni X = Y*w(:,1:D);

(MxM)

(NxM)

(NxD) (MxD)(NxM)

Ricostruzione o back-projection

Analisi per componenti principali //Descrizione intuitiva

(DxM)(NxM) (NxD)

(NxM)

(NxD)

Page 36: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

• Dati input D=

• Autovettori (D=48)

Analisi per componenti principali //Esempi

Analisi per componenti principali //Esempi

D=10

D=100

ricorda il risultato ottenuto con SVD

Page 37: Computazione per l’interazione naturale: Richiami di ...homes.di.unimi.it/~boccignone/GiuseppeBoccignone_webpage/IN_2018... · Modelli psicologici / neurobiologici. Un po’ di

Analisi per componenti principali //Versione SVD

• Possiamo ridefinire il problema in termini di SVD

componenti principalivarianze

(simmetria)

(per def.)

• Definiamo

(per SVD)

Analisi per componenti principali //Versione SVD

12

a symmetric matrix has the special property that all ofits eigenvectors are not just linearly independent but alsoorthogonal, thus completing our proof.

In the first part of the proof, let A be just some ma-trix, not necessarily symmetric, and let it have indepen-dent eigenvectors (i.e. no degeneracy). Furthermore, letE = [e1 e2 . . . en] be the matrix of eigenvectors placedin the columns. Let D be a diagonal matrix where theith eigenvalue is placed in the iith position.

We will now show that AE = ED. We can examinethe columns of the right-hand and left-hand sides of theequation.

Left hand side : AE = [Ae1 Ae2 . . . Aen]Right hand side : ED = [λ1e1 λ2e2 . . . λnen]

Evidently, if AE = ED then Aei = λiei for all i. Thisequation is the definition of the eigenvalue equation.Therefore, it must be that AE = ED. A little rearrange-ment provides A = EDE−1, completing the first part theproof.

For the second part of the proof, we show that a sym-metric matrix always has orthogonal eigenvectors. Forsome symmetric matrix, let λ1 and λ2 be distinct eigen-values for eigenvectors e1 and e2.

λ1e1 · e2 = (λ1e1)T e2

= (Ae1)T e2

= e1T AT e2

= e1T Ae2

= e1T (λ2e2)

λ1e1 · e2 = λ2e1 · e2

By the last relation we can equate that(λ1 − λ2)e1 · e2 = 0. Since we have conjecturedthat the eigenvalues are in fact unique, it must be thecase that e1 · e2 = 0. Therefore, the eigenvectors of asymmetric matrix are orthogonal.

Let us back up now to our original postulate that A isa symmetric matrix. By the second part of the proof, weknow that the eigenvectors of A are all orthonormal (wechoose the eigenvectors to be normalized). This meansthat E is an orthogonal matrix so by theorem 1, ET =E−1 and we can rewrite the final result.

A = EDET

. Thus, a symmetric matrix is diagonalized by a matrixof its eigenvectors.

5. For any arbitrary m × n matrix X, thesymmetric matrix XT X has a set of orthonor-mal eigenvectors of {v1, v2, . . . , vn} and a set ofassociated eigenvalues {λ1,λ2, . . . ,λn}. The set ofvectors {Xv1,Xv2, . . . ,Xvn} then form an orthog-onal basis, where each vector Xvi is of length

√λi.

All of these properties arise from the dot product ofany two vectors from this set.

(Xvi) · (Xvj) = (Xvi)T (Xvj)

= vTi XT Xvj

= vTi (λjvj)

= λjvi · vj

(Xvi) · (Xvj) = λjδij

The last relation arises because the set of eigenvectorsof X is orthogonal resulting in the Kronecker delta. Inmore simpler terms the last relation states:

(Xvi) · (Xvj) =

!

λj i = j0 i = j

This equation states that any two vectors in the set areorthogonal.

The second property arises from the above equation byrealizing that the length squared of each vector is definedas:

∥Xvi∥2 = (Xvi) · (Xvi) = λi

APPENDIX B: Code

This code is written for Matlab 6.5 (Release 13)from Mathworks13. The code is not computationally ef-ficient but explanatory (terse comments begin with a %).

This first version follows Section 5 by examiningthe covariance of the data set.

function [signals,PC,V] = pca1(data)% PCA1: Perform PCA using covariance.% data - MxN matrix of input data% (M dimensions, N trials)% signals - MxN matrix of projected data% PC - each column is a PC% V - Mx1 matrix of variances

[M,N] = size(data);

% subtract off the mean for each dimensionmn = mean(data,2);data = data - repmat(mn,1,N);

% calculate the covariance matrixcovariance = 1 / (N-1) * data * data’;

% find the eigenvectors and eigenvalues[PC, V] = eig(covariance);

13 http://www.mathworks.com

13

% extract diagonal of matrix as vectorV = diag(V);

% sort the variances in decreasing order[junk, rindices] = sort(-1*V);V = V(rindices);PC = PC(:,rindices);

% project the original data setsignals = PC’ * data;

This second version follows section 6 computing PCAthrough SVD.

function [signals,PC,V] = pca2(data)% PCA2: Perform PCA using SVD.% data - MxN matrix of input data% (M dimensions, N trials)% signals - MxN matrix of projected data% PC - each column is a PC% V - Mx1 matrix of variances

[M,N] = size(data);

% subtract off the mean for each dimensionmn = mean(data,2);data = data - repmat(mn,1,N);

% construct the matrix YY = data’ / sqrt(N-1);

% SVD does it all[u,S,PC] = svd(Y);

% calculate the variancesS = diag(S);V = S .* S;

% project the original datasignals = PC’ * data;

APPENDIX C: References

Bell, Anthony and Sejnowski, Terry. (1997) “TheIndependent Components of Natural Scenes are Edge

Filters.” Vision Research 37(23), 3327-3338.[A paper from my field of research that surveys and exploresdifferent forms of decorrelating data sets. The authors examinethe features of PCA and compare it with new ideas in redun-dancy reduction, namely Independent Component Analysis.]

Bishop, Christopher. (1996) Neural Networks forPattern Recognition. Clarendon, Oxford, UK.[A challenging but brilliant text on statistical pattern recog-nition. Although the derivation of PCA is tough in section8.6 (p.310-319), it does have a great discussion on potentialextensions to the method and it puts PCA in context of othermethods of dimensional reduction. Also, I want to acknowledgethis book for several ideas about the limitations of PCA.]

Lay, David. (2000). Linear Algebra and It’s Applica-tions. Addison-Wesley, New York.[This is a beautiful text. Chapter 7 in the second edition (p.441-486) has an exquisite, intuitive derivation and discussion ofSVD and PCA. Extremely easy to follow and a must read.]

Mitra, Partha and Pesaran, Bijan. (1999) ”Analysisof Dynamic Brain Imaging Data.” Biophysical Journal.76, 691-708.[A comprehensive and spectacular paper from my field ofresearch interest. It is dense but in two sections ”EigenmodeAnalysis: SVD” and ”Space-frequency SVD” the authors discussthe benefits of performing a Fourier transform on the databefore an SVD.]

Will, Todd (1999) ”Introduction to the Sin-gular Value Decomposition” Davidson College.www.davidson.edu/academic/math/will/svd/index.html[A math professor wrote up a great web tutorial on SVD withtremendous intuitive explanations, graphics and animations.Although it avoids PCA directly, it gives a great intuitive feelfor what SVD is doing mathematically. Also, it is the inspirationfor my ”spring” example.]

13

% extract diagonal of matrix as vectorV = diag(V);

% sort the variances in decreasing order[junk, rindices] = sort(-1*V);V = V(rindices);PC = PC(:,rindices);

% project the original data setsignals = PC’ * data;

This second version follows section 6 computing PCAthrough SVD.

function [signals,PC,V] = pca2(data)% PCA2: Perform PCA using SVD.% data - MxN matrix of input data% (M dimensions, N trials)% signals - MxN matrix of projected data% PC - each column is a PC% V - Mx1 matrix of variances

[M,N] = size(data);

% subtract off the mean for each dimensionmn = mean(data,2);data = data - repmat(mn,1,N);

% construct the matrix YY = data’ / sqrt(N-1);

% SVD does it all[u,S,PC] = svd(Y);

% calculate the variancesS = diag(S);V = S .* S;

% project the original datasignals = PC’ * data;

APPENDIX C: References

Bell, Anthony and Sejnowski, Terry. (1997) “TheIndependent Components of Natural Scenes are Edge

Filters.” Vision Research 37(23), 3327-3338.[A paper from my field of research that surveys and exploresdifferent forms of decorrelating data sets. The authors examinethe features of PCA and compare it with new ideas in redun-dancy reduction, namely Independent Component Analysis.]

Bishop, Christopher. (1996) Neural Networks forPattern Recognition. Clarendon, Oxford, UK.[A challenging but brilliant text on statistical pattern recog-nition. Although the derivation of PCA is tough in section8.6 (p.310-319), it does have a great discussion on potentialextensions to the method and it puts PCA in context of othermethods of dimensional reduction. Also, I want to acknowledgethis book for several ideas about the limitations of PCA.]

Lay, David. (2000). Linear Algebra and It’s Applica-tions. Addison-Wesley, New York.[This is a beautiful text. Chapter 7 in the second edition (p.441-486) has an exquisite, intuitive derivation and discussion ofSVD and PCA. Extremely easy to follow and a must read.]

Mitra, Partha and Pesaran, Bijan. (1999) ”Analysisof Dynamic Brain Imaging Data.” Biophysical Journal.76, 691-708.[A comprehensive and spectacular paper from my field ofresearch interest. It is dense but in two sections ”EigenmodeAnalysis: SVD” and ”Space-frequency SVD” the authors discussthe benefits of performing a Fourier transform on the databefore an SVD.]

Will, Todd (1999) ”Introduction to the Sin-gular Value Decomposition” Davidson College.www.davidson.edu/academic/math/will/svd/index.html[A math professor wrote up a great web tutorial on SVD withtremendous intuitive explanations, graphics and animations.Although it avoids PCA directly, it gives a great intuitive feelfor what SVD is doing mathematically. Also, it is the inspirationfor my ”spring” example.]