Machine Learning Concepts

Preview:

Citation preview

Machine Learning Concepts

Professor Marie RochSan Diego State University

Basic definitions & concepts

• Task – What the system is supposed to doe.g. ASR: f(input speech) list of words

• Performance measure – How well does it work?• Experience – How does the machine learn the task?

2

Types of experiences;how a learner learns…

• Supervised learning – Learn class conditional distributions:implies class labels are known

• Unsupervised learning – No labels are provided, learn P(x) and possibly group x’s into clusters.

• Reinforcement learning – Learner actions are associated with payouts for actions in environment.

3

|( )P xω

Experience data set:Design matrix

4

1,1 1,2 1,3 1, 1

2,1 2

3,1 3

,1 ,2 ,3 ,

D

N N N N D N

x x x x yx yx y

x x x x y

label(supervised problems)feature vector

Learning sets of functions may be used if some features are missing. E.g. fn f0 for all features, f1 if feature 1 is missing, etc.

A Cardinal RuleOF MACHINE LEARNING

THOU SHALTNOT TEST ONTHY TRAININGDATA 5

Performance

• A metric that measures how well a learner is able to accomplish the task

• Metrics can vary significantly (more on these later), examples:– loss functions such as squared error– cross entropy

6

Partitioning data

• Training data – experience for learner• Test data – performance measurement• Evaluation data

– Only used once all adjustments are made– It is common to:

7

train test

adjust

• Adjustments are a form of training (see 5.3)

• Evaluation provides an independent test

Regression

Given a set of features and response, predict a response

8

IPCC 4th Assessment ReportClimate Change 2007

Observed changes in numberof warm days/night withextreme temperatures (normative 1961-1990)

Linear regressionA simple learning algorithm

Predict response from data

• w is the weight vector• Goal: Maximize performance on test set.

Learn w to minimize some criterion, e.g. mean squared error (MSE)

9

ˆ Ty w x= , , ˆN yw x∈ℜ ∈ℜ

( ) 22 test testtest 2

1

1 1ˆ ˆm

i ii

y y y yMm

SEm=

= − ≡ −∑

1

denotes the norm | | read Goodfellow et al. 2.5p

p pip

ix L x

Linear regression

• Cannot estimate w from test data• Use the training data• Minimize

10

( ) 22 train traintrain 2

1

1 1ˆ ˆm

i ii

y ym

MS yE ym =

−= − ≡∑

For convenience, we will usually omit the variable descriptor train when describing training.

Linear regression

MSE minimized when gradient is zero

11

2

2

2

22

2

1 ˆ 0

ˆ 0

ˆ0 s

0

a

w

w

w

w

y ymy y

Xw y

MSE

y Xw

=

∇ =

∇ −

∇ =−

=− =∇

Linear regression

12

( ) ( )( )( )( )

( )( )( )

2

2

22

0

=0 norm in matrix notation

=0 transpose distributive over addition

=0 as (A Goodfellow et al. eqn 2.9

0

B)

w

Tw

T Tw

T T T Tw

T T T T T Tw

T T

w

B A

w Xw w y y Xw y

w

Xw y

Xw y Xw y L

Xw y Xw y

y Xw y

X y

X

X

∇ −

∇ − −

∇ − −

∇ − − =

− − + =

=

∇ ( ) ( )( )

0 as

2 0

TT T T T T T T T T T

T T

T

T Tw

X X yw Xw y Xw y y w w y Xwy Xw y X

w X yX w y Xw y

− − + =

∇ − +

==

=

Linear regression

13

( )( )( )

( )

2

1

2 0

2 0 as independent of

2 0 as

2 2 0 derivative

matrix inv

T T T Tw

T T T Tw

T T T Tw

T

T T

T T

T T

w Xw y Xw y y

w Xw y Xw y y w

y Xw w X X X Xw

X Xw y XX Xw y X

w X

X Xw

X y X

X

X

∇ − + =

∇ − =

∇ − = =

− =

=

= 1erse A A I− =

These are referred to as the normal equations

Linear regression

14

Fig. 5.1 Goodfellow

et al.

Normal equations optimize w

Linear regression

Regression formula forces curve to pass through originRemove restriction:• Add bias term• To use normal equations,

use modified x• Last term of new weight

vector w is bias

15

ˆ Ty w x b= +

1

2

1

mod

D

xx

xx

=

Notes on learning

• A learner that performs well on unseen data is said to generalize well.

• When will learning on training data produce good generalization?– Training and test data drawn from the same distribution– Large enough training set to learn the distribution

16

Underfitting & Overfitting

• UnderfitModel cannot learn training data well

• OverfitModel does not generalize well

• These properties are related to model capacity

17

Capacity

What kind of functions can we learn?

18

Fig. 5.2 Goodfellow

et al.

1 00 00 1y w x w x= + 0 1

2 1 00 0 02y w w xx w x= ++

00

ˆ Di

D

i

iy w x −

=

=∑

D=9

Here, model order affects capacity

Shattering points

Points are shattered if a classifier can separate them regardless of their binary label

19

Hastie et al., 2009, Fig 7.6

We can shatter the three points, but notfour with a linear classifier

Capacity

• Representational capacity – best function that can be learned within a set of learnable functions

• Frequently a difficult optimization problem– We might learn a suboptimal solution– This is called effective capacity

20

Measuring capacity

• Model order is not always a good predictor of capacity

21

Hastie et al., 2009, Fig .75

Label determined by sign of function. Increasing frequency of sinusoid enablesever finer distinctions…

Measuring capacity

• Vapnik Chervonikis (VC) dimension– for binary classifiers– Largest # of points that can be shattered by a family of classifiers.

• In practice, hard to estimate for deep learners… so why do we care?

22

Predictions about capacity

• Goal: Minimize the generalization error• Hard; perhaps minimize difference ΔΔ = training error - generalization error.

• Learning theory– Models with higher capacity have higher upper bound on Δ– Increasing amount of training data decreases Δ’s upper bound

23

Recap on capacity

24

Hastie et al., 2009, Fig 7.6

clipart-library.com

The NO FREE LUNCH Theorem

Expected performance of anyclassifier across all possible generalization tasks is no better than any other classifier.

A classifier might be better for some tasks, but no classifier is universally better than others.

25

http://elsalvavidas.mx/lifehacking/inicia-tu-aventura-para-estudiar-en-el-extranjero-con-estos-tips/1182/

N-fold cross validation

• Problem: – More data yields better training– Getting more data can be expensive

• Workaround– Partition data into N different groups– Train on N-1 groups, test on last group– Rotate to have N different evaluations

26Scikit learn’s KFold and StratifiedKFold will split example indices for you:sklearn.model_selection.*

Regularization

• Remember: Learners select a solution function from a set of hypothesis functions.

• Optimization picks the function that minimizes some optimization criterion

• Regularization lets us express a preference for certain types of solutions

27

Example:High Dimensional Polynomial Fit

• Suppose we want small coefficientsRemember:

• This happens when is small and results in shallower slopes

• We can define a new criterion:

where λ controls regularization influence28

ˆ Ty w x=

( ) Tw wwΩ =

( ) ( ) TJ w MSE w MSE wwλ λ= + Ω = +

9th degree polynomial fit with regularization

29

Goodfellow

et al. Fig. 5.5

Point estimators

• An approximation of interest• Examples:

– a statistic of a distribution

– a parameter of a classifiere.g. a mixture model weight or distribution parameter

30

e.g. 𝜇𝜇 = 1𝑁𝑁∑𝑖𝑖 𝑥𝑥(𝑖𝑖) is an approximation of 𝜇𝜇

Point estimators

In general, it is a function of data

and may even be a classifier function that maps data to a label (function estimation).

31

(1) (2) (3) ( )ˆ , , , , )( mm f x x x xΘ …=

Bias

How far from the true value is our estimator?

Goodfellow et al. give an example with a Bernoulli distribution that we have not yet covered (read 3.9.1). Bernoulli distributions are good for estimating the number of times that a binary event occurs (e.g., 42 head in 100 coin tosses).

32

( )ˆ ˆb s ]i [a m mEθ θ θ= −

Bias of sample mean

33

( )

( )

ˆ ˆbias

1

( )

[

[ ]

1

1 ]

0

m m

i

i

i

i

E xN

E xN

N E XN

Eµ µ µ

µ

µ

µ

µ µ

= − = −

=

=

= ⋅ −

= −

unbiased estimator

x(i) is a random var

Bias

• Read more examples in Goodfellow et al.• Bias of classifier functions?

– We are trying to estimate the Bayes classifier.– Bias is amount of error over that

34

Variance

• Already defined:• Variance of classifier functions

– Variance of a mean classification result, e.g., error rate

35

2( ) [ )( ]Var X E X µ= −

( )

( )

1 2

2 †1 22

2 22

1

( )

( )

or equivalantl

y, standard erro S

1

r E( )=

as ( )

1 ˆ

m

m

m

m

m

V

V

X X

V

ar Var Xm

X ar Xm

m

ar X X Var kX k

mm

µ

σσ σ µ

…+

=

= + +

+ +

=

=

+ =

† 2 2 2 it can be shown that ( ) [ ] [ ] follows from this, ( ) ( )X Var kX k Var XVar X E X E= =−

Bias & Variance

• Variance gives us an idea of classifier sensitivity to different data

• Distribution of mean approaches a normal distribution (central limit theorem)

• Together can estimate with 95% confidence that the real mean lies within:

36

1.96 ( ) 1.96ˆ ˆ (ˆ ˆ )m m m m mSE SEµ µ µ µ µ− +≤ ≤

Information Theory

A quick trip down the rabbit hole...• Details in Goodfellow

3.13• Needed for maximum

likelihood estimators

37British Postal Service, Graham Baker-Smith 2015

38

Quantity of information

• Amount of surprise that one sees when observing an event.

• If an event is rare, we can derive a large quantity of information from it.

1( ) logP( )i

i

I xx

=

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Pr(x)I(x

)

39

Quantity of information

• Why use log?– Suppose we want to know the information in two independent

events:1 2

1 2

1 21 2

1 2

1 2

1( , ) logP( , )

1log , independentP( ) P( )

1 1log logP( ) P( )

( ) ( )

I x xx x

x xx x

x xI x I x

=

=

= +

= +

40

Entropy

• Entropy is defined as the expected amount of information (average amount of surprise) and is usually denoted by the symbol H.

( ) [ ( )]P( ) ( ) is all possible symbols

1P( ) log definition ( )P( )

[ log P( )]

i

i

i ix S

i ix S i

H X E I Xx I x S

x I xx

E X

==

=

= −

Discrete vs continuous

Discrete• Shannon Entropy• Use log2

• Units called– bits, or sometimes– Shannons

Continuous• Differential entropy• Use loge

• Units called nats

41

42

Example

• Assume– X = 0, 1

• Then

H(x) versus p

)1log()1(log)]([)(

ppppXIEXH

−−−−==

0P( )

1 1p X

Xp X

== − =

p

Goodfellow

et al. Fig. 3.5

Comparing distributions• How similar are distributions P and Q?

Recall – X~P means that X has distribution P – we denote its probability as PP

• How much more information do we need to represent P’s distribution using Q:

43

( )~ ~

~

log log

log

( )log( )

( )

X P Q

x

P X P Q P

PX P

Q

PP

Q

E I I E P P

PEP

P xx

PP

x

− = − − −

=

=∑

Comparing distributions

• This is known as the Kullback-Leibler (KL) divergence from Q to P:

44

𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃||𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃(𝑋𝑋)𝑃𝑃𝑄𝑄(𝑋𝑋)

= 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) − log𝑃𝑃𝑄𝑄 (𝑋𝑋)

( || ) ( || ) is not a distance measurenote: KL KLD P Q D Q P≠

0 iff , otherwise( || ) ( || ) 0KL KLPD P Q PQ D Q= ≡ >

Cross entropy H(P,Q)

• Calculates total entropy in two distributions

• Can be shown to have the entropy of P plus the KL divergence from Q to P.

• Interesting as we sometimes want to minimize KL divergence…

45

𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐻𝐻(𝑃𝑃) + 𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃||𝑄𝑄)

𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)

Cross entropy

Minimizing Q will minimize KL divergence

46

𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋) = 𝐻𝐻(𝑃𝑃) + 𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃||𝑄𝑄)= 𝐻𝐻(𝑃𝑃) + 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) − log𝑃𝑃𝑄𝑄 (𝑋𝑋)= −𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) + 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) + 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)= 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)

Maximum Likelihood Estimation (MLE)

• Method to estimate model parameters• Suppose we have m independent samples drawn from a

distribution• Can we fit a model with parameters ϴ?

47

(1) (2) ( ) , , ,arg max |

( )

m

ML

x x xP

θθ θ==

…XX

MLE

• The x(i)’s are independent, so

• Transform to something easier…

• Traditional to use derivatives and solve for the maximum value.

48

( )arg max ( | ) arg m |ax ( )iML i

P P xθ θ

θ θ θ= Π= X

( ) ( )|arg max ( ) arg max log |( )i iML i i

P x P xθ θ

θ θ θΠ == ∑log likelihood

MLE through optimization

• Let’s reframe this problem:

• Maximized when 𝑃𝑃model is most like 𝑃𝑃data

• What does this remind us of?

49

[ ]

( )

ˆ~

arg max log ( )

arg max log ( | )

|

data

iML model

i

modelX P

P x

E P x

θ

θ

θ θ

θ

=

=

MLE through optimization

• DKL( 𝑃𝑃data||Pmodel) minimized as 𝑃𝑃model becomes more like 𝑃𝑃data

• Recall

• and we can minimize one term of cross entropy

50

𝐷𝐷𝐾𝐾𝐾𝐾( 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑||𝑃𝑃𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚) = 𝐸𝐸𝑋𝑋~ 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 log 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑋𝑋) − log𝑃𝑃𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 (𝑋𝑋)

−𝐸𝐸𝑋𝑋~ 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 log𝑃𝑃𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 (𝑋𝑋)

Linear regression as MLE

• Instead of thinking of predicting value, what if we predicted a conditional probability?

• Regression: only one possible output.• MLE estimates a distribution: support for multiple

outcomes, can think of this as a noisy prediction.

51

( |ˆ )P y x

y xw=

Linear regression as MLE

• Assume Y|X~n(μ,σ2)– learn μ such that– σ2 fixed (noise)

• Can formulate as an MLE problem

where parameter ϴ is our weights w.

52

( ) ( )| i iE Y x y =

ML arg max | )( ;P Y Xθ

θ θ=

Linear regression as MLE

53

𝜃𝜃ML = arg max𝜃𝜃

𝑃𝑃(𝑌𝑌|𝑋𝑋;𝜃𝜃)= arg max

𝜃𝜃∏𝑖𝑖𝑃𝑃(𝑦𝑦(𝑖𝑖)|𝑥𝑥(𝑖𝑖);𝜃𝜃) if 𝑥𝑥(𝑖𝑖)′s independent & ~ identically

log 𝜃𝜃ML = arg max𝜃𝜃

∑𝑖𝑖 log𝑃𝑃 (𝑦𝑦(𝑖𝑖)|𝑥𝑥(𝑖𝑖);𝜃𝜃)

= arg max𝜃𝜃

∑𝑖𝑖 log12𝜋𝜋𝜎𝜎2

𝑒𝑒−( 𝑦𝑦(𝑖𝑖)−𝜇𝜇)2

2𝜎𝜎2 prediction of 𝑦𝑦(𝑖𝑖) is 𝑦𝑦(𝑖𝑖)

= arg max𝜃𝜃

∑𝑖𝑖 log12𝜋𝜋𝜎𝜎2

𝑒𝑒−( 𝑦𝑦(𝑖𝑖)−𝑦𝑦(𝑖𝑖))2

2𝜎𝜎2 as we want 𝐸𝐸[𝑌𝑌|𝑥𝑥(𝑖𝑖)] = 𝑦𝑦(𝑖𝑖)

Linear regression as MLE

54

log𝜃𝜃ML = arg max𝜃𝜃

∑𝑖𝑖 log12𝜋𝜋𝜎𝜎2

𝑒𝑒−( 𝑦𝑦(𝑖𝑖)−𝑦𝑦(𝑖𝑖))2

2𝜎𝜎2

= arg max𝜃𝜃

∑𝑖𝑖 −12

log 2𝜋𝜋 − log𝜎𝜎 −𝑦𝑦(𝑖𝑖) − 𝑦𝑦(𝑖𝑖) 2

2𝜎𝜎2

= arg max𝜃𝜃

−𝑚𝑚2

log 2𝜋𝜋 − 𝑚𝑚 log𝜎𝜎 − ∑𝑖𝑖𝑦𝑦(𝑖𝑖) − 𝑦𝑦(𝑖𝑖) 2

2𝜎𝜎2

Equivalent to maximizing which has thesame form as or MSE optimization:

( )2( ) ( )ˆ i ii y y−∑

𝑀𝑀𝑀𝑀𝐸𝐸train =1𝑚𝑚

𝑦𝑦train − 𝑦𝑦train2

2

MLE

• MLE can only recover the distribution if the parametric distribution is appropriate for the data.

• If data drawn from multiple distributions with varying parameters, MLE can still estimate distribution, but information about underlying distributions is lost.

55

Optimization

• We have seen closed form (single equation) optimization with linear regression.

• Not always so lucky… how do we optimize more complicated things?

56

Optimization

• Select an objective function f(x) to optimizee.g. MSE, KL divergence

• Without loss of generality, we will always consider minimization.

Why can we do this?

57

Gradient descent

58

Goodfellow

et al. Fig. 4.1

Critical points

59

Goodfellow

et al. Fig. 4.2

Global vs. local minima

60

Goodfellow

et al. Fig. 4.3

Functions on

• Gradient vector of partial derivatives

• Moving in gradient directionwill increase f(x) if we don’t gotoo far…

• To move to an x with a smaller f(x)

what we pick for ε will make a difference61

ℝ𝑚𝑚 → ℝ1

1

2

( )

( )

( )

( )

xm

x

xx

x

xf x

x

f x

f x

f x

∇ =

∂∂∂∂

∂ ∂

𝑥𝑥′ = 𝑥𝑥 − ϵ∇𝑥𝑥𝑓𝑓𝑥𝑥(𝑥𝑥)

IMPORTANT: f is fn to be optimized, e.g. MSE, not learner

Putting this all togethernote change in notation, objective is J()

• Want to learn: fϴ(xi) = yi

• To improve fϴ(xi), define objective J(ϴ)• Optimize J(ϴ)

– Gradients defined for each sample– Average over data set (e.g. mini-batch) and update ϴ

62

Putting this all together

• Sample J(ϴ), a loss function:

• which is like our effort to get a MLE by minimizing KL divergence:

63

[ ] ( ) ( )ˆ, ~

1

1

1) ( , , ) )

where ( , , ) log ( | , ) remember D1 or ) log ( | ,

( ( ,

?

( )

,data

mi i

x y pi

KLm

i

E L x ym

L x y

J L x y

P y x

P y xm

J

θ θ θ

θ θ

θ θ

=

=

= =

= −

== −

Loss (aka cost) is a measurement of how farwe are from the desired result.

𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)

“There’s the rub”Shakespeare’s Hamlet

• Computing J(ϴ) is expensiveOne example not so bad…

massive data sets… • Sample expectation relies on having enough

samples that the 1/m term estimates P(X).• What if we only evaluated some of them…

64

Stochastic gradient descent (SGD)

while not converged:pick minibatch of samples (x,y)compute gradientupdate estimate

minibatch is usually up to a 100 samples

65

Recommended