76
Curse of Dimensionality and Statistical Models MATH 697 AM:ST Tuesday, September 12th

Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of Dimensionality and Statistical Models

MATH 697 AM:ST

Tuesday, September 12th

Page 2: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

WarmupParameter inference

Let Y1, . . . , YN be distributed via some law fy depending on y.

An estimator for y is any random variable of the form

Y = g(Y1, . . . , YN )

The Bias and Mean Squared Error (MSE) of Y are defined as

Bias(Y ) := E[Y ]− y, MSE(Y ) := E[|Y − y|2]

Page 3: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

WarmupParameter inference

An estimator with Bias(Y ) = 0 is called unbiased.

Observe that

MSE(Y ) = Var(Y ) + (Bias(Y ))2

When using the MSE to judge an estimator, one sees there is abalance between Bias and Variance. Being unbiased is often adesirable property, however, some estimators could be biasedand have such a small Variance that their MSE is smaller inpractice, and are thus preferred.

Page 4: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

WarmupThe most basic estimators

Fix a sequence of i.i.d. variables X1, X2, . . .

The following is the (obvious) estimator for the mean

MN :=1

N

N∑i=1

Xi

and the two estimators below are used for the variance

S2N :=

1

N

N∑i=1

(Xi −MN )2

S2N :=

1

N − 1

N∑i=1

(Xi −MN )2

Page 5: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

WarmupA consequence of the Strong Law of Large Numbers

Denote the common mean and variance of the Xi by µ and σ2.Then, for all large N we have with probability ≈ 1

MN ≈ µSN ≈ σ2

SN ≈ σ2

It is thus that if x1, x2, . . . are a particular sample of thesequence X1, X2, . . ., then we may guess µ and σ2 via

1

N

N∑i=1

xi,1

N

N∑i=1

(xi −mN )2, . . .

Page 6: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warmup –from last time

Expected (Squared) Prediction Error (EPE)

EPE(f) := E[|f(X)− Y |2]

and the conditional EPE at x0

EPE(f , x) := E[|f(X)− Y |2 | X = x]

Page 7: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warmup –from last time

Looking for the best linear predictor, we sought to minimizeEPE within f of the form x · β, leading to the equation

E[X(X · β)] = E[Y X]

In other words, for i = 1, . . . , p, we have

p∑j=1

E[XiXj ]βj = E[Y Xi]

Page 8: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warmup –from last time

Let (x1, y1), . . . , (xN , yN ).Then, the Strong Law of Large Numbers says that

E[XiXj ] ≈1

N

N∑l=1

xlixlj

E[Y Xi] ≈1

N

N∑l=1

ylxli

Then,

XtXβ ≈ Xty

with the ≈ becoming = in the limit N →∞.

Page 9: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warning: A source of confusion

A few things to always keep in mind when studying thestatistical properties of a predictive model:

• What is a random variable and what isn’t?

• What are you sampling from?

• What are you taking the expectation with respect to?

Page 10: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of Dimensionality

In view of all that we have said in the foregoing sections, themany obstacles we appear to have surmounted, what casts thepall over our victory celebration?

It is the curse of dimensionality, a malediction that has plaguedthe scientist from earliest days.

Richard Bellman in Adaptive Control Processes, p. 94, 1961.

Page 11: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of Dimensionality

Richard Bellman in Adaptive Control Processes, p. 94, 1961.

Page 12: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of Dimensionality

It is worth looking at k-NN and least squares in highdimensions.

Page 13: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityO fortuna

First, a quick observation regarding sampling.

The p-dimensional cube Q`(0) := [−`/2, `/2]p has volume

`p

while the cube Qα`(0) for say α ∈ (0, 1) has volume

αp`p

Page 14: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalitySors immanis. . .

The percentage of points of Q`(0) which lie in Qα`(0) is

αp

and, for fixed α, this becomes very small when p is large.

Page 15: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of Dimensionality. . . et inanis

In high dimensions, the mass of a cube concentratesnear its boundary

One cannot escape this fate by resorting to some of the mostwell known norms, it is true of the Euclidean ball too, forinstance. It is clearly a consequence of the fact that the volumescales like length to the power p.

Page 16: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of Dimensionalityk-NN

This hurts k-NN in large dimensions.

Let us apply for instance 1-NN to y and x whose functionalrelation is deterministic, and known, say

y = f(x) = e−5|x|2x ∈ Rp, p ≥ 10

A computer computation tells you that

f(x) ≤ 0.3 if |x| ≥ 0.5

Page 17: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of Dimensionalityk-NN

Let us draw, at random, uniformly in Q1(0), points X1, . . . , XN .Then, 1-NN estimate for y for x = 0 is

y = e−5|Xk∗ |2

k∗ being so Xk∗ has smallest norm among X1, . . . , XN .

Question: How large N needs to be for y to have a reasonablechance (say ≥ 0.5) of being closer to 1 than to 0?

Answer: One needs to choose N ≥ Cp for a certain C > 1.

Page 18: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of Dimensionalityk-NN

The moral of the story is that the training set we need for 1-NNin some instances must have a number of elements which growsexponentially with the dimension of the problem.

Page 19: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityLinear model with noise

Not all is lost, however, certain high dimensional problems havebeen tackled in a computationally practical manner underfurther assumptions, i.e. linearity.

Let us take the method of least squares as an illustration of this.

Page 20: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityLinear model with noise

Consider three random variables X ∈ Rp, Y, ε ∈ R, with

Y = X · β + ε

Assume X and ε are independent.

Page 21: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityLinear model with noise

Important: In what follows, we are going to study thestatistics of all i.i.d. samples of size N for this model.

So, fix N , and take (X1, Y1, ε1), . . . , (XN , YN , ε) i.i.d.distributed as X, Y , and ε, and independent from the originalvariables X,Y, and ε, which will be used as our test variables.

In particular, for each i we have Yi = Xi · β + εi, these are notmerely real numbers and vectors, but are random variables!

Page 22: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityLinear model with noise

Now, from the X1, . . . , XN we construct a N × p randommatrix, defined as follows: for v ∈ Rp, let

Xv := (X1 · v, . . . ,XN · v) ∈ RN

Moreover, we construct two random vectors in RN ,

y = (Y1, . . . , YN )

ε = (ε1, . . . , εN )

Observe that the relationship between Xi, Yi, and εi becomes

y = Xv + ε

Page 23: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityLinear model with noise

In summary, we have:A random N × p matrix X and vectors y, ε ∈ RN with

y = Xv + ε

where the rows of X are made out of X1, X2, . . .

We now bring in the the least squares solution, which is given by

β = (XtX)−1Xty

Observe: this β is a random vector, and a function of theXi, Yi, and εi. We intend to use it as an estimator for β.

Page 24: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityLinear model with noise

Let’s compute the Mean Squared Error associated to β

MSE(fβ, x0) = E[|Y − β · x0|2 | X = x0]

By independence of X and ε, we have

MSE(fβ, x0) = E[|(β − β) · x0|2 | X = x0] + E[ε2]

Page 25: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityLinear model with noise

Here is where we put to use our random variable set up!Recalling that,

y = Xβ + ε ∈ RN

where ε = (ε1, . . . , εN ) ∈ RN , we have that

β = (XtX)−1Xty

= (XtX)−1XtXβ + (XtX)−1Xtε

= β + (XtX)−1Xtε

Page 26: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityLinear model with noise

So, we have

β = β + (XtX)−1Xtε

From the independence of X1, X2 from ε1, ε2, it follows that

E[(XtX)−1Xtε] = E[(XtX)−1Xt]E[ε] = 0

Therefore E[β] = β, i.e. β is an unbiased estimator for β.

Page 27: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityLinear model with noise

Back to the MSE, using our new found formula we have

E[|(β − β) · x0|2 | X = x0]

= E[|((XtX)−1Xtε, x0)|2 | X = x0]

and therefore,

MSE(β, x0) = E[|((XtX)−1Xtε, x0)|2 | X = x0] + σ2

Page 28: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityLinear model with noise

Following the last identity with some meticulous calculations,one can see that

MSE(β, x0) = E[(XtX)−1x0, x0) | X = x0] + σ2

Further, with a bit more work, it can be shown that in fact

MSE(β) =p

Nσ2 + σ2

Voila! The growth in p it’s only linear in the dimension,and the number of samples N only needs to be linear in p tomake MSE reasonably small.

Page 29: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of DimensionalityLinear model with noise

Voila! The growth in p it’s only linear in the dimension,and the number of samples N only needs to be linear in p tocontrol the MSE.

Thus, with extra structure, the curse of dimensionality wascircumvented.

A lot of ongoing research in ML entails developing ways ofgetting around the curse of dimensionality in situations wherelinearity is not a good approximation.

Page 30: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical models

Having met with the curse of dimensionality and seen the typeof problems lying ahead for our learning methods, let’s discuss afew structures which, together with the formalism of the EPE,lead to many important algorithms.

We continue to think in terms of two random variables X andY , for which we seek to understand any deterministic relation ofthe form Y = f(X). We already saw the EPE leads to

f(x) = E[Y | X = x]

Page 31: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Noise Model

Let X and ε be independent random variables, with X ∈ Rp,and ε ∈ R being such that E[ε] = 0, Var(ε) = σ2ε .

We assume there is some f (the exact form of which isunknown), and consider the random variable Y given by

Y = f(X) + ε

Problem: Knowing this structure, and knowing as little aspossible about f , can we estimate f by sampling X and Y ?

Page 32: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Noise Model

Note that for the additive noise model we have

f(x) = E[Y | X = x] = f(x).

If f does not oscillate nor change too rapidly, k-NN could bequite adept at recovering f(x) from training data T = {(xi, yi)}.

(From the curse of dimensionality, the larger the dimension, themore rapid changes in f affect the computations)

Page 33: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsRestricted function classes

One way to impose further structure on our problem is byrestricting function classes where one minimizes the EPE.

We have already done this in two cases: linear regression andk-NN. The former was too restrictive, the latter too unstable.

Page 34: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsRestricted function classes

In between these extremes we have classes of functions which,while still restricted, go far beyond complexity when comparedto linear functions.

Often, a vector θ is used to parametrize the space of functionsunder consideration. In such a case, we shall write

f(x, θ) = fθ(x).

Then minimizing EPE(fθ) becomes a (possibly nonlinear, oreven non-convex) minimization problem.

Page 35: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsRestricted function classes

Example: The sigmoid function with slope β,

hβ(x) = σ(x · β) =1

1 + e−x·β, for β ∈ Rp

here, σ denotes the well known sigmoid function

σ(t) =1

1 + e−t

Page 36: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionRestricted function classes

Example: One may use linear basis expansions, which meansthat one first preselects and fixes a family of functions

h1(x), . . . , hM (x) : Rp 7→ R

Then, for θ ∈ RM , we set

fθ(x) :=

M∑m=1

θmhm(x)

If the hm(x) = xm are the components of the vector, fromminimizing EPE(fθ) we get back the least squares method.

Page 37: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionBasis functions

Example: A single layer, feed forward, neural net*

fθ(x) :=

M∑m=1

βmσ(αm · x+ bm)

here the parameters allows both to change the slope andlocation of the sigmoids by choosing the αm’s and bm’s, andtheir amplitude, by changing the βm’s.

Accordingly, we have θ = (β1, α1, b1, . . . , βM , αM , bM ).

*also known as basis pursuit.

Page 38: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Next class

1. More on EPE(fθ)

2. Other functionals besides squared error.

3. Further restricted function classes (often equivalents to theones seen today): kernel methods and roughnesspenalization.

4. Model selection and more on bias versus variance.

Reminders: 1) Don’t forget the first problem set is dueThursday of the coming week, and the second problem set isdue the Thursday after that.2) Check your email today for a questionnaire –you must alsouse it to tell me who will be your project partner.

Page 39: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical Models and Model Selection

MATH 697 AM:ST

Thursday, September 14th

Page 40: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm up

Today, let us start by discussing, briefly, two simple ideas

• Maximum likelihood as an statistical estimation method.

• How least squares changes when one has a weighted sum.

Page 41: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm up IMaximum Likelihood

Let Y1, . . . , YN be random variables with joint distribution

p(y1, . . . , yN ; θ)

with a parameter vector θ = (θ1, . . . , θM ), for some M .

Example:The Yi are i.i.d. Normals in Rp with Cov(Yi) = σ2I and mean θ

p(y1, . . . , yN ; θ) =

N∏i=1

1

(2πσ2)p2

e−|yi−θ|

2

2σ2

Page 42: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm up IMaximum Likelihood

Given a sample y1, y2, . . . , yN , choose θ so as to maximize theirprobability, that is, take

θ := argmaxθ

p(y1, . . . , yN ; θ)

This θ is somestimes called the maximum likelihoodestimator. It corresponds to the idea that if y1, . . . , yN wasobserved, then it is a good guess to choose the θ that made thisobservation most likely.

This intuition is further justified by the same idea behind theStrong Law of Large Numbers: if N is large, then the aboveestimation should be more accurate.

Page 43: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm up IMaximum Likelihood

If the Yi are independent, what we are maximizing is a product

p(y1, . . . , yN ; θ) =

N∏i=1

p(yi; θ)

This leads naturally to the idea of log-probability, since

log(p(y1, . . . , yN ; θ))

=

N∑i=1

log(p(yi; θ))

Page 44: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm up IMaximum Likelihood

Therefore, it is clear in this case that maximizing

N∏i=1

p(yi; θ)

is exactly the same as maximizing the sum

N∑i=1

log(p(yi; θ))

Page 45: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm Up IMaximum Likelihood

Example: Following our previous example, for given a sampley1, . . . , yN we seek θ which maximizes

N∑i=1

log(p(yi; θ))

= −Np2

log(2πσ2)− 1

2σ2

N∑i=1

|yi − θ|2

Page 46: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm Up IMaximum Likelihood

Example: (continued)But, this is the same as minimizing

1

2σ2

N∑i=1

|yi − θ|2

It follows that θ is given by averaging the yi

θ =1

N

N∑i=1

yi

Page 47: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm Up IIWeighted Least Squares

We are given the following:

Data (x1, y1), . . . , (xN , yN )

“weights” K1, . . . ,KN all positive numbers.

Then, let us see how different is it to minimize the functional

J(β) =

N∑i=1

Ki|yi − xi · β|2

which yields back the usual least squares if K1 = . . . = KN = 1.

Page 48: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm Up IIWeighted Least Squares

What is different?

As usual for least squares, let us define y and X by

y = (y1, . . . , yN ), Xβ = (x1 · β, . . . , xN · β),

and let K be the N ×N diagonal matrix whose ii-th entry is Ki.

With this notation, it turns out that

J(β) = (K(Xβ − y), (Xβ − y))

Page 49: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm Up IIWeighted Least Squares

What is different?As usual for least squares, let us define y and X by

y = (y1, . . . , yN ), Xβ = (x1 · β, . . . , xN · β),

and let K be the N ×N diagonal matrix whose ii-th entry is Ki.

With this notation, it turns out that

J(β) = (K(Xβ − y), (Xβ − y))

Page 50: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm Up IIWeighted Least Squares

A responsible use of both linear algebra & the chain rule yields

∇J(β) = 2XtK(Xβ − y)

Thus, the minimizer(s) β solve

XtK(Xβ − y) = 0,

and, provided we can invert XtKX, the minimizer is unique,and given by

β = (XtKX)−1XtKy

Page 51: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Warm Up IIWeighted Least Squares

It is rather satisfactory to see that the formula for β

β = (XtKX)−1XtKy

Arises from a minor modification of the standard formula

β = (XtX)−1Xty

Exercise: Can one easily check the invertibility of (XtKX)−1?

Page 52: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Models and the Likelihood Functional

We continue to study the additive model: three randomvariables X,Y and ε satisfy the relation

Y = f(X) + ε

for some unknown f . X and ε are assumed independent, and

E[ε] = 0.

Our purpose is, as always, is to estimate f at some x from agiven sample of N observations (xi, yi), where yi = f(xi) + εi.

Page 53: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Models and the Likelihood Functional

As we have seen, it follows from our assumptions that

E[Y | X] = f(X)

(and pY |X(y) is N (f(X), σ2) )

This is more or less where we were last time, before we startedour discussion of some of families of restricted functions.

Page 54: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Models and the Likelihood Functional

Thus, picking up from last time, we restrict ourselves to a(paremetrized) family of functions, labeled by parameter θ.

This does not mean the unknown f belongs to this family, northat we expect it to be. We make no assumption on whetherthe function f , which we seek to learn, is among the fθ . . .

. . .we merely contend ourselves with finding the “best”approximation to f among the fθ.

Of course,“best approximation” is up to us to define!

Page 55: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Models and the Likelihood Functional

Remember the Bayesian classifier: If Y is a categorical variable,

f(x) = argmaxg

P[Y = g | X = x]

A similar idea yields a criterium to choose a best fit f for giventraining data, where the best fit f is sought within apredetermined class of paremetrized functions fθ.

Page 56: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Models and the Likelihood Functional

Motivated by this, we consider a predictor for the additivemodel Yθ = fθ(X) + ε where we also have

fθ a family of fuctions parametrized by θ

x1, . . . , xN a fixed training set ⊂ Rp

One possible choice for “best ”fθ” is to choose θ whichmaximizes the Likelihood of the observed data. . .

Page 57: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Models and the Likelihood Functional

. . .which is done as follows: given a sample y1, . . . , yN , we set

θ := argmaxθ

N∑i=1

log(pYθ|X=xi(yi))

and then let, accordingly, f(x) = fθ(x).

Page 58: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Models and the Likelihood Functional

Example:Assume that ε ∼ N (0, σ2). Then, for each yi, we have

log(pYθ|X=x(yi)) =1√

2πσ2e−|y−fθ(x)|

2

2σ2

It follows that the log-probability is

−N2

log(2πσ2)− 1

2σ2

N∑i=1

|yi − fθ(xi)|2

Page 59: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Models and the Likelihood Functional

We conclude that for the additive model with Gaussian noise:

Maximum likelihood ≡ Least Squares

What other objective functionals may arise this way?

Page 60: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Models and the Likelihood Functional

We conclude that for the additive model with Gaussian noise:

Maximum likelihood ≡ Least Squares

What other objective functionals may arise this way?

Page 61: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsAdditive Models and the Likelihood Functional

Let H : R 7→ R be such that∫Re−H(x) dx <∞,

∫Re−H(x)x dx = 0.

e.g. if H(x) = H(−x) and H(x) ≥ α0|x| − β0 with α0 > 0.

Then, maximum likelihood leads amounts to minimzing

θ = argminθ

N∑i=1

H(yi − fθ(xi))

that is, provided ε has a Gibbs distribution Z−1H e−H(x).

Page 62: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Statistical modelsOther Objective functionals?

Of course, it is less obvious if other objective functionals ofinterest arise in the same way

1

N

N∑i=1

|xi · β − yi|2 + λ

N∑i=1

β2i

1

N

N∑i=1

|xi · β − yi|2 + λ

N∑i=1

|βi|

1

N

N∑i=1

|f(xi)− yi|2 + λ

∫ L

−L|f ′′(x)|2 dx

This question we will not explore, but we will be studying thesefunctionals, and will be returning to maximum likelihood.

Page 63: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regression

A way to get around the curse of dimensionality

1. Roughness penalty

2. Kernel methods

3. Basis functions

The third item above we discussed last class, where we lookedat restricted families of the form

fθ(x) =

M∑i=1

θihi(x)

As we are about to see, these three methods are closelyconnected

Page 64: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionPenalizing lack of regularity

Let us revisit roughness penalization, where to N data points(xi, yi) we associate the functional

J (f) :=1

N

N∑i=1

|yi − f(xi)|2 + λ

∫|f ′′(x)|2 dx

Page 65: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionPenalizing lack of regularity

We let f be the minimizer of J (f), where

J (f) :=1

N

N∑i=1

|yi − f(xi)|2 + λ

∫|f ′′(x)|2 dx

Theorem: There exists a function K(x, y) such that

f(x) =1

N

N∑i=1

K(x, xi)yi ∀ x.

See B. Silverman, Annals of Statistics, 1984.

Page 66: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionPenalizing lack of regularity

This theorem, follows in turn, from another basic fact about theminimizer, f : it may be written as

f(x) =

M∑i=1

θihi(x)

where the θi can be computed explicitly from the data (xi, yi),and the hi is a certainly family of piece wise polynomialfunctions known as cubic splines and which depend only onthe xi and not the yi.

Page 67: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionPenalizing lack of regularity

That this is so follows easily from the exercises in the firstproblem set. One of them corresponds to the followingobservation: if

g : [a, b] 7→ R

minimizes, among all g’s with fixed values at a and b,∫ b

a|g′′(t)|2 dt

Then, g must be a third order polynomial in [a, b].

Page 68: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionPenalizing lack of regularity

For the functional J , we have that minimizers need to bepiecewise second order polynomials, with the polynomialpossibly changing from the interval (xi, xi+1) to (xi+1, xi+2)(assuming that we have labeled the xi in increasing order).

A bit more work will show that this solution is also such thatit’s second order derivative is continouous across each xi.

Page 69: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionKernels & Local Regression

Now, to motivate the kernel expression, let us introduce theideal of a local average

RSS(fθ, x0) =

N∑i=1

Kλ(x0, xi)(yi − fθ(xi))2

The idea is that Kλ(x0, xi) is very small when |x0 − xi| issufficiently large, this being controlled by the parameter λ.

Thus, one does at each point x0, a least squares fit adapted tothe points close to x0.

Page 70: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionKernels & Local Regression

Gaussian weights with variance λ

Kλ(x, y) =1

(2πλ)p2

e−|x−y|2λ

The Epanechnikov Quadratic kernel

Kλ(x, y) = D(λ−1|x− y|),where D(t) = 3

4(1− t2)χ[−1,1](t)

Page 71: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionKernels & Local Regression

RSS(fθ, x0) =

N∑i=1

Kλ(x0, xi)(yi − fθ(xi))2

Two classes of functions

fθ(x) := θ (locally constant),

fθ(x) := θ0 + θ1 · x (locally linear)

Page 72: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionKernels & Local Regression

In the first case, fθ is constant, and

RSS(fθ, x0) =

N∑i=1

Kλ(x0, xi)(yi − θ)2

This is easily seen to be a quadratic polynomial in θ,differentiating to obtain the minimum θ, one arrives at

N∑i=1

Kλ(x0, xi)(yi − θ) = 0

⇒ θ =

N∑i=1

Kλ(x0, xi)yi

N∑i=1

Kλ(x0, xi)

Page 73: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionKernels & Local Regression

This is known as the Nadaraya-Watson weighted average

θ =

N∑i=1

Kλ(x0, xi)yi

N∑i=1

Kλ(x0, xi)

Page 74: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionKernels & Local Regression

In the second case, we have, for θ = (θ0, θ1) ∈ Rp+1

RSS(fθ, x0) =

N∑i=1

Kλ(x0, xi)(yi − θ0 − θ1 · xi)2

This turns out to be a quadratic polynomial in θ0 and thecomponents of θ1, computing the gradient as we did for leastsquares, we obtain this time the formula

(θ0(x0), θ1(x0)) = (XtK(x0)X)−1XtK(x0)y

Page 75: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Structured regressionKernels & Local Regression

The corresponding prediction f(x0) is then given by

f(x0) =

N∑i=1

`i(x0)yi

where the coefficients `i(x0) can be computed from x0 and thecomponents of the matrix (XtK(x0)X)−1XtK(x0).

Page 76: Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

The second week, in one slide

1. The Curse of Dimensionality means high dimensionalproblems are in general computationally intractable, and inparticular hampers the usability of k-NN.

2. If further structure is available (e.g. linearity ) the curse ofdimensionality may be avoided.

3. When selecting a predictive model, one may judge it bylooking at the training error and the test error. Of thesetwo, the training error is by the better evaluation metric.

4. A biased algorithm is often preferable to an unbiased one:the tradeoff may lead to a reduced mean squared error.

5. There are many choices to restrict the function classes forregression: roughness penalization, basis functions, kernelmethods, local regression. . .