53
Pattern Recognition 2020 Introduction Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 53

Pattern Recognition 2020 Introduction - Universiteit Utrecht

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Pattern Recognition 2020Introduction

Ad Feelders

Universiteit Utrecht

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 53

Page 2: Pattern Recognition 2020 Introduction - Universiteit Utrecht

About the Course

Lecturers: Zerrin Yumak and Ad Feelders

Teaching Assistants: Ali Katsheh and Jiayuan Hu

Course info: http://www.cs.uu.nl/docs/vakken/mpr/

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 2 / 53

Page 3: Pattern Recognition 2020 Introduction - Universiteit Utrecht

About the Course

Part I (Ad Feelders): Introduction to statistical machine learning.

General principles of data analysis: overfitting, bias-variance trade-off,model selection, regularization, the curse of dimensionality.

Linear statistical models for regression and classification.

Clustering and unsupervised learning.

Support vector machines.

Required literature:

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 3 / 53

Page 4: Pattern Recognition 2020 Introduction - Universiteit Utrecht

About the Course

Part II (Zerrin Yumak): Neural networks and deep learning.

Feed-forward neural networks.

Convolutional neural networks.

Recurrent neural networks.

Recommended reading:

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 4 / 53

Page 5: Pattern Recognition 2020 Introduction - Universiteit Utrecht

About the Course

Practical assignment: analysis of handwritten digit data in R orPython (teams of 2 students)

Deep learning project: subject of own choice (teams of 5 students).

Online lectures in MS Teams (Wednesday and Friday).

Online support for practical assignment and deep learning project inMS Teams (Friday after the lecture, starting next week).

Grading:

Practical assignment (20%)

Deep learning project (40%)

Written exam (40%)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 5 / 53

Page 6: Pattern Recognition 2020 Introduction - Universiteit Utrecht

What is statistical pattern recognition?

The field of pattern recognition/machine learning is concernedwith the automatic discovery of regularities in data through theuse of computer algorithms and with the use of these regularities totake actions such as classifying the data into different categories.

(Bishop, page 1)

28 × 28 pixel images

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 6 / 53

Page 7: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Machine Learning Approach

Use training dataD = {(x1, t1), . . . , (xN , tN)}

of N labeled examples, and fit a model to the training data.

This model can subsequently be used to predict the class (digit) for newinput vectors x.

The ability to categorize correctly new examples is called generalization.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 7 / 53

Page 8: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Types of Learning Problems

Supervised Learning

Numeric target: regression.Discrete unordered target: classification.Discrete ordered target: ordinal classification; ranking.. . .

Unsupervised Learning

Clustering.Density estimation.. . .

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 8 / 53

Page 9: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Example: Polynomial Curve Fitting

0 1

−1

0

1

t = sin(2πx) + ε, with ε ∼ N (µ = 0, σ = 0.3).

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 9 / 53

Page 10: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Polynomial Curve Fitting

Fit a model:

y(x ,w) = w0 + w1x + w2x2 + . . .+ wMxM

=M∑j=0

wjxj (1.1)

Linear function of the coefficients w = w0,w1, . . . ,wM .

The coefficients (or weights) w are estimated (or learned) from the data.

PS: equation numbers refer to the book of Bishop.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 10 / 53

Page 11: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Error Function

We choose those values for w that minimize the sum of squared errors

E (w) =1

2

N∑n=1

{y(xn,w)− tn}2 (1.2)

Why square the difference between predicted and true value?

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 11 / 53

Page 12: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Error Function

t

x

y(xn,w)

tn

xn

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 12 / 53

Page 13: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Curves Fitted with Least Squares (in red)

�����

0 1

−1

0

1

�����

0 1

−1

0

1

�����

0 1

−1

0

1

�����

0 1

−1

0

1

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 13 / 53

Page 14: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Magnitude of Coefficients

M = 0 M = 1 M = 3 M = 9

w?0 0.19 0.82 0.31 0.35

w?1 -1.27 7.99 232.37

w?2 -25.43 -5321.83

w?3 17.37 48568.31

w?4 -231639.30

w?5 640042.26

w?6 -1061800.52

w?7 1042400.18

w?8 -557682.99

w?9 125201.43

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 14 / 53

Page 15: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Training and Test Error

�����

0 3 6 90

0.5

1TrainingTest

ERMS =√

2E (w?)/N (1.3)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 15 / 53

Page 16: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Overfitting and Sample Size

�������

0 1

−1

0

1

���������

0 1

−1

0

1

Red curve (M = 9) is much more smooth for N = 100 than for N = 15.Also, it is closer to the true (green) curve.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 16 / 53

Page 17: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Regularization

Adjusted error function

E (w) =1

2

N∑n=1

{y(xn,w)− tn}2 +λ

2‖w‖2 (1.4)

with ‖w‖2 = wTw = w20 + w2

1 + . . .+ w2M .

Shrink coefficients towards zero.

Ridge regression

Neural networks: weight decay

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 17 / 53

Page 18: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Magnitude of Coefficients (M = 9)

lnλ = −∞ lnλ = −18 lnλ = 0

w?0 0.35 0.35 0.13

w?1 232.37 4.74 -0.05

w?2 -5321.83 -0.77 -0.06

w?3 48568.31 -31.97 -0.05

w?4 -231639.30 -3.89 -0.03

w?5 640042.26 55.28 -0.02

w?6 -1061800.52 41.32 -0.01

w?7 1042400.18 -45.95 -0.00

w?8 -557682.99 -91.53 0.00

w?9 125201.43 72.68 0.01

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 18 / 53

Page 19: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Fitted Curves for M = 9, λ ≈ 10−8, λ = 1.

� ������� �

0 1

−1

0

1

� ������

0 1

−1

0

1

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 19 / 53

Page 20: Pattern Recognition 2020 Introduction - Universiteit Utrecht

RMSE versus lnλ for M = 9

�����

� ���−35 −30 −25 −200

0.5

1TrainingTest

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 20 / 53

Page 21: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Probability distribution and likelihood function

Binomial distribution with parameters N and π:

p(t) =

(Nt

)πt (1− π)N−t

Binomial distribution with N = 10 and π = 0.7:

p(t) =

(10t

)0.7t 0.310−t

Probability of observing t = 8:(108

)0.780.32 ≈ 0.234

Likelihood function if we observe 7 heads in 10 trials:

L(π | t = 7) =

(107

)π7(1− π)3

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 21 / 53

Page 22: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Probability and Likelihood

t π

0.1 0.3 0.5 0.7 0.9

0 .349 .028 .0011 .387 .121 .012 .194 .234 .044 .0023 .057 .267 .117 .0094 .011 .2 .205 .0365 .002 .103 .246 .103 .0026 .036 .205 .2 .0117 .009 .117 .267 .0578 .002 .044 .234 .1949 .01 .121 .38710 .001 .028 .349

1 1 1 1 1

Probability distribution for π = 0.7 and N = 10.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 22 / 53

Page 23: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Probability and Likelihood

t π

0.1 0.3 0.5 0.7 0.9

0 .349 .028 .0011 .387 .121 .012 .194 .234 .044 .0023 .057 .267 .117 .0094 .011 .2 .205 .0365 .002 .103 .246 .103 .0026 .036 .205 .2 .0117 .009 .117 .267 .0578 .002 .044 .234 .1949 .01 .121 .38710 .001 .028 .349

1 1 1 1 1

Likelihood function for observing t = 7 in 10 trials.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 23 / 53

Page 24: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Likelihood function

Lett = (t1, . . . , tN)

be N independent observations, all from the same probability distribution

p(t | θ),

where θ is the parameter vector of p (e.g. θ = (µ, σ) for normaldistribution), then

L(θ | t) ∝N∏

n=1

p(tn| θ)

is the likelihood function for t.

Maximum likelihood estimation:Find that particular value θML which maximizes L, i.e. that θML such thatthe observed t are more likely to have come from p(t | θML) than fromp(t | θ) for any other value of θ.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 24 / 53

Page 25: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Maximum Likelihood Estimation

Take the derivatives of L with respect to the components of θ and equatethem to zero (normal equations)

∂L

∂θj= 0

Solve for the θj (and check second order condition).Maximizing the loglikelihood function is often easier

L(θ | t) = ln{L(θ | t)} = ln

{N∏

n=1

p(tn | θ)

}

=N∑

n=1

ln p(tn | θ)

since ln ab = ln a + ln b.

This is allowed because the ln function is strictly increasing on (0,∞).Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 25 / 53

Page 26: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Likelihood function

Likelihood function for 7 heads out of 10 coin flips:

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.00

000.

0005

0.00

100.

0015

0.00

20

π

L(π)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 26 / 53

Page 27: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Example: coin flipping

Random variable t with t = 1 if heads comes up, and t = 0 if tails comesup. π = P(t = 1). Probability distribution for one coin flip

p(t) = πt(1− π)1−t

Sequence of N coin flips

p(t) = p(t1, t2, ..., tN) =N∏

n=1

πtn(1− π)1−tn

which defines the likelihood when viewed as a function of π. Theloglikelihood function consequently becomes

L(π | t) =N∑

n=1

tn ln(π) + (1− tn) ln(1− π)

since ln ab = b ln a.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 27 / 53

Page 28: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Example: coin flipping (continued)

In a sequence of 10 coin flips with seven times heads coming up, we obtain

L(π) = ln(π7(1− π)3) = 7 lnπ + 3 ln(1− π)

To determine the maximum we take the derivative with respect to π,equate to zero, and solve for π:

dLdπ

=7

π− 3

1− π= 0

which yields maximum likelihood estimate πML = 0.7.

Note:d ln x

dx=

1

x

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 28 / 53

Page 29: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Model Selection

Cross-Validation

run 1

run 2

run 3

run 4

Score = Quality of Fit − Complexity Penalty

For example: AIC = ln p(D|wML)−M

where ln p(D|wML) is the maximized loglikelihood and M is the number ofparameters of the fitted model.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 29 / 53

Page 30: Pattern Recognition 2020 Introduction - Universiteit Utrecht

The Curse of Dimensionality

���

���

0 0.25 0.5 0.75 10

0.5

1

1.5

2

Predict class of ×.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 30 / 53

Page 31: Pattern Recognition 2020 Introduction - Universiteit Utrecht

The Curse of Dimensionality

���

���

0 0.25 0.5 0.75 10

0.5

1

1.5

2

Predict class of ×.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 31 / 53

Page 32: Pattern Recognition 2020 Introduction - Universiteit Utrecht

The Curse of Dimensionality

x1

D = 1x1

x2

D = 2

x1

x2

x3

D = 3

Number of rectangles grows exponentially with D. If D is large, mostrectangles will be empty (contain no data).

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 32 / 53

Page 33: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Decision Theory

Suppose we have to make a decision in a situation involving uncertainty.Two steps

1 Inference: Learn p(x, t) from data. This problem is the main subjectof this course.

2 Decision: Given this estimate of p(x, t), determine the optimaldecision. Relatively straightforward.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 33 / 53

Page 34: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Decision Theory: Example

Predict whether patient has cancer from X-ray image.

Let t = 1 denote that cancer is present.Then knowledge of

p(t = 1|x) =p(x|t = 1)p(t = 1)

p(x)

would allow us to make optimal predictions of t from x (given anappropriate loss function).

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 34 / 53

Page 35: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Loss Functions for Classification

Suppose we know p(x, t).Task: given a value for x, predict the class label t.Lkj : loss of predicting class j when the true class is k .K : number of classes.

To minimize expected loss, predict the class j that minimizes:

K∑k=1

Lkjp(t = k | x), (1.81)

where j ∈ {1, . . . ,K}.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 35 / 53

Page 36: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Minimizing the Misclassification Rate

To minimize the probability of misclassification, we take (0/1 loss)

Lkj =

{0 if j = k1 otherwise

The minimum of

K∑k=1

Lkjp(t = k | x) =∑k 6=j

p(t = k | x) = 1− p(t = j | x)

is now achieved if we assign to the class j for which p(t = j | x) ismaximum.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 36 / 53

Page 37: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Minimizing Expected Loss

Example loss matrix for prediction of cancer:

kj

0 1

0 0 11 10 0

Suppose p(t = 0) = 0.8 and p(t = 1) = 0.2.

The expected loss of predicting “no cancer present” is:

L00 × p(t = 0) + L10 × p(t = 1) = 0× 0.8 + 10× 0.2 = 2

The expected loss of predicting “cancer present” is:

L01 × p(t = 0) + L11 × p(t = 1) = 1× 0.8 + 0× 0.2 = 0.8

Even though the probability of cancer is “only” 0.2, loss is minimized if wepredict (act as if) cancer is present.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 37 / 53

Page 38: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Properties of Expectation and Variance

Some useful properties:

1 E[c] = c for constant c .

2 E[cx ] = cE[x ].

3 E[x ± y ] = E[x ]± E[y ].

4 var[c] = 0 for constant c .

5 var[cx ] = c2var[x ].

6 var[x ± y ] = var[x ] + var[y ] if x and y independent.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 38 / 53

Page 39: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Loss function for regression

Let t0 be a random draw from p(t | x0) and we predict the value of t0 bysome number y0 = y(x0). The expected squared prediction error is:

E[(y0 − t0)2] = E[y20 − 2y0t0 + t20 ]

= y20 − 2y0E[t0] + E[t20 ],

where expectation is taken with respect to p(t | x0).

To minimize this expression we solve

d(y20 − 2y0E[t0] + E[t20 ])

dy0= 2y0 − 2E[t0] = 0

which gives y0 = E[t0]. Conclusion: predict the expected value (mean)!

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 39 / 53

Page 40: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Minimizing expected squared prediction error

Since this reasoning applies to any value of x we might pick, we have that

y(x) = Et [t | x ] (1.89)

minimizes the expected squared prediction error.

The function Et [t | x ] is called the (population) regression function.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 40 / 53

Page 41: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Population Regression Function

t

xx0

y(x0)

y(x)

p(t|x0)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 41 / 53

Page 42: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Question

We have derived the result that

y(x) = Et [t|x ] (1.89)

minimizes the expected squared prediction error.

How could we use this result to construct a prediction rule y(x) from afinite data sample

D = {(x1, t1), . . . , (xN , tN)}?

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 42 / 53

Page 43: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Simple approach to regression?

xx0

t

y(x0)

Predict the mean of the target values of all training observations withx = x0.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 43 / 53

Page 44: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Simple approach to regression?

xx0

t

y(x0)

Predict the mean of the target values of training observations with x-valueclosest to x0.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 44 / 53

Page 45: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Nearest-neighbor functions

Consider a regression problem with input variable x and the outputvariable t:

for each input value x , we define a neighborhood Nk(x) containingthe indices n of the k points (xn, tn) from the training data that arethe closest to x ;

from the neighborhood function Nk(x), we construct the function

yk(x) =1

k

∑n∈Nk (x)

tn

The function yk(x) is called the k-nearest neighbor function.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 45 / 53

Page 46: Pattern Recognition 2020 Introduction - Universiteit Utrecht

An example learning problem

In a clinical study of risk factors for cardiovascular disease,

the independent variable x is a patient’s waist circumference;

the dependent variable t is a patient’s deep abdominal adipose tissue.

The researchers want to predict the amount of deep abdominal adiposetissue from a simple measurement of waist circumference.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 46 / 53

Page 47: Pattern Recognition 2020 Introduction - Universiteit Utrecht

Scatterplot of the data

For learning the relationship between x and t, measurements (xn, tn) on109 men between 18 and 42 years of age, are available:

0

50

100

150

200

250

60 70 80 90 100 110 120

deep

abd

omin

al A

T (

Y)

waist circumference (X)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 47 / 53

Page 48: Pattern Recognition 2020 Introduction - Universiteit Utrecht

An example

We consider eight (consecutive) points (xn, tn) from the clinical study ofrisk factors for cardiovascular disease:

1.(68.85, 55.78) 5.(73.10, 38.21)2.(71.85, 21.68) 6.(73.20, 32.22)3.(71.90, 28.32) 7.(73.80, 43.35)4.(72.60, 25.89) 8.(74.15, 33.41)

20

25

30

35

40

45

50

55

60

68 69 70 71 72 73 74

deep

abd

omin

al A

T (

Y)

waist circumference (X)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 48 / 53

Page 49: Pattern Recognition 2020 Introduction - Universiteit Utrecht

The example (continued)

With k = 2, the neighborhood of x = 73.00 equals

N2(x = 73.00) = {5, 6}

and we find

y2(x = 73.00) =38.21 + 32.22

2= 35.215

With k = 5, we find y5(x = 73.00) = 33.598.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 49 / 53

Page 50: Pattern Recognition 2020 Introduction - Universiteit Utrecht

The example continued

With k = 2 and Euclidean distance, the following k-nearest neighborfunction is constructed from the training data:

70 80 90 100 110 120

5010

015

020

025

0

kNN with k=2

Waist Circumference

Adi

pose

Tis

sue

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 50 / 53

Page 51: Pattern Recognition 2020 Introduction - Universiteit Utrecht

The example continued

With k = 20 and Euclidean distance, the following k-nearest neighborfunction is constructed from the training data:

70 80 90 100 110 120

5010

015

020

025

0

kNN with k=20

Waist Circumference

Adi

pose

Tis

sue

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 51 / 53

Page 52: Pattern Recognition 2020 Introduction - Universiteit Utrecht

kNN: going to the extremes

70 80 90 100 110 120

5010

015

020

025

0

kNN with k=1

Waist Circumference

Adi

pose

Tis

sue

70 80 90 100 110 120

5010

015

020

025

0

kNN with k=109

Waist Circumference

Adi

pose

Tis

sue

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 52 / 53

Page 53: Pattern Recognition 2020 Introduction - Universiteit Utrecht

The idea of k-nearest neighbor

We recall that, for a regression problem, the best prediction for the outputvariable t at the input value x is the mean E[t | x ]:

the nearest-neighbor function approximates the mean by averagingover the training data;

the nearest-neighbor function relaxes conditioning at a specific inputvalue to the neighborhood of that value.

The nearest-neighbor function thus implements the idea of selecting themeans for prediction directly.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 53 / 53