Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that

Curse of Dimensionality and Statistical Models

MATH 697 AM:ST

Tuesday, September 12th

WarmupParameter inference

Let Y1, . . . , YN be distributed via some law fy depending on y.

An estimator for y is any random variable of the form

Y = g(Y1, . . . , YN )

The Bias and Mean Squared Error (MSE) of Y are defined as

Bias(Y ) := E[Y ]− y, MSE(Y ) := E[|Y − y|2]

WarmupParameter inference

An estimator with Bias(Y ) = 0 is called unbiased.

Observe that

MSE(Y ) = Var(Y ) + (Bias(Y ))2

When using the MSE to judge an estimator, one sees there is abalance between Bias and Variance. Being unbiased is often adesirable property, however, some estimators could be biasedand have such a small Variance that their MSE is smaller inpractice, and are thus preferred.

WarmupThe most basic estimators

Fix a sequence of i.i.d. variables X1, X2, . . .

The following is the (obvious) estimator for the mean

MN :=1

N

N∑i=1

Xi

and the two estimators below are used for the variance

S2N :=

1

N

N∑i=1

(Xi −MN )2

S2N :=

1

N − 1

N∑i=1

(Xi −MN )2

WarmupA consequence of the Strong Law of Large Numbers

Denote the common mean and variance of the Xi by µ and σ2.Then, for all large N we have with probability ≈ 1

MN ≈ µSN ≈ σ2

SN ≈ σ2

It is thus that if x1, x2, . . . are a particular sample of thesequence X1, X2, . . ., then we may guess µ and σ2 via

1

N

N∑i=1

xi,1

N

N∑i=1

(xi −mN )2, . . .

Warmup –from last time

Expected (Squared) Prediction Error (EPE)

EPE(f) := E[|f(X)− Y |2]

and the conditional EPE at x0

EPE(f , x) := E[|f(X)− Y |2 | X = x]


Looking for the best linear predictor, we sought to minimizeEPE within f of the form x · β, leading to the equation

E[X(X · β)] = E[Y X]

In other words, for i = 1, . . . , p, we have

p∑j=1

E[XiXj ]βj = E[Y Xi]


Let (x1, y1), . . . , (xN , yN ).Then, the Strong Law of Large Numbers says that

E[XiXj ] ≈1

N

N∑l=1

xlixlj

E[Y Xi] ≈1

N

N∑l=1

ylxli

Then,

XtXβ ≈ Xty

with the ≈ becoming = in the limit N →∞.

Warning: A source of confusion

A few things to always keep in mind when studying thestatistical properties of a predictive model:

• What is a random variable and what isn’t?

• What are you sampling from?

• What are you taking the expectation with respect to?

Curse of Dimensionality

In view of all that we have said in the foregoing sections, themany obstacles we appear to have surmounted, what casts thepall over our victory celebration?

It is the curse of dimensionality, a malediction that has plaguedthe scientist from earliest days.

Richard Bellman in Adaptive Control Processes, p. 94, 1961.


Richard Bellman in Adaptive Control Processes, p. 94, 1961.


It is worth looking at k-NN and least squares in highdimensions.

Curse of DimensionalityO fortuna

First, a quick observation regarding sampling.

The p-dimensional cube Q`(0) := [−`/2, `/2]p has volume

`p

while the cube Qα`(0) for say α ∈ (0, 1) has volume

αp`p

Curse of DimensionalitySors immanis. . .

The percentage of points of Q`(0) which lie in Qα`(0) is

αp

and, for fixed α, this becomes very small when p is large.

Curse of Dimensionality. . . et inanis

In high dimensions, the mass of a cube concentratesnear its boundary

One cannot escape this fate by resorting to some of the mostwell known norms, it is true of the Euclidean ball too, forinstance. It is clearly a consequence of the fact that the volumescales like length to the power p.

Curse of Dimensionalityk-NN

This hurts k-NN in large dimensions.

Let us apply for instance 1-NN to y and x whose functionalrelation is deterministic, and known, say

y = f(x) = e−5|x|2x ∈ Rp, p ≥ 10

A computer computation tells you that

f(x) ≤ 0.3 if |x| ≥ 0.5


Let us draw, at random, uniformly in Q1(0), points X1, . . . , XN .Then, 1-NN estimate for y for x = 0 is

y = e−5|Xk∗ |2

k∗ being so Xk∗ has smallest norm among X1, . . . , XN .

Question: How large N needs to be for y to have a reasonablechance (say ≥ 0.5) of being closer to 1 than to 0?

Answer: One needs to choose N ≥ Cp for a certain C > 1.


The moral of the story is that the training set we need for 1-NNin some instances must have a number of elements which growsexponentially with the dimension of the problem.

Curse of DimensionalityLinear model with noise

Not all is lost, however, certain high dimensional problems havebeen tackled in a computationally practical manner underfurther assumptions, i.e. linearity.

Let us take the method of least squares as an illustration of this.


Consider three random variables X ∈ Rp, Y, ε ∈ R, with

Y = X · β + ε

Assume X and ε are independent.


Important: In what follows, we are going to study thestatistics of all i.i.d. samples of size N for this model.

So, fix N , and take (X1, Y1, ε1), . . . , (XN , YN , ε) i.i.d.distributed as X, Y , and ε, and independent from the originalvariables X,Y, and ε, which will be used as our test variables.

In particular, for each i we have Yi = Xi · β + εi, these are notmerely real numbers and vectors, but are random variables!


Now, from the X1, . . . , XN we construct a N × p randommatrix, defined as follows: for v ∈ Rp, let

Xv := (X1 · v, . . . ,XN · v) ∈ RN

Moreover, we construct two random vectors in RN ,

y = (Y1, . . . , YN )

ε = (ε1, . . . , εN )

Observe that the relationship between Xi, Yi, and εi becomes

y = Xv + ε


In summary, we have:A random N × p matrix X and vectors y, ε ∈ RN with

y = Xv + ε

where the rows of X are made out of X1, X2, . . .

We now bring in the the least squares solution, which is given by

β = (XtX)−1Xty

Observe: this β is a random vector, and a function of theXi, Yi, and εi. We intend to use it as an estimator for β.


Let’s compute the Mean Squared Error associated to β

MSE(fβ, x0) = E[|Y − β · x0|2 | X = x0]

By independence of X and ε, we have

MSE(fβ, x0) = E[|(β − β) · x0|2 | X = x0] + E[ε2]


Here is where we put to use our random variable set up!Recalling that,

y = Xβ + ε ∈ RN

where ε = (ε1, . . . , εN ) ∈ RN , we have that

β = (XtX)−1Xty

= (XtX)−1XtXβ + (XtX)−1Xtε

= β + (XtX)−1Xtε


So, we have

β = β + (XtX)−1Xtε

From the independence of X1, X2 from ε1, ε2, it follows that

E[(XtX)−1Xtε] = E[(XtX)−1Xt]E[ε] = 0

Therefore E[β] = β, i.e. β is an unbiased estimator for β.


Back to the MSE, using our new found formula we have

E[|(β − β) · x0|2 | X = x0]

= E[|((XtX)−1Xtε, x0)|2 | X = x0]

and therefore,

MSE(β, x0) = E[|((XtX)−1Xtε, x0)|2 | X = x0] + σ2


Following the last identity with some meticulous calculations,one can see that

MSE(β, x0) = E[(XtX)−1x0, x0) | X = x0] + σ2

Further, with a bit more work, it can be shown that in fact

MSE(β) =p

Nσ2 + σ2

Voila! The growth in p it’s only linear in the dimension,and the number of samples N only needs to be linear in p tomake MSE reasonably small.


Voila! The growth in p it’s only linear in the dimension,and the number of samples N only needs to be linear in p tocontrol the MSE.

Thus, with extra structure, the curse of dimensionality wascircumvented.

A lot of ongoing research in ML entails developing ways ofgetting around the curse of dimensionality in situations wherelinearity is not a good approximation.

Statistical models

Having met with the curse of dimensionality and seen the typeof problems lying ahead for our learning methods, let’s discuss afew structures which, together with the formalism of the EPE,lead to many important algorithms.

We continue to think in terms of two random variables X andY , for which we seek to understand any deterministic relation ofthe form Y = f(X). We already saw the EPE leads to

f(x) = E[Y | X = x]

Statistical modelsAdditive Noise Model

Let X and ε be independent random variables, with X ∈ Rp,and ε ∈ R being such that E[ε] = 0, Var(ε) = σ2ε .

We assume there is some f (the exact form of which isunknown), and consider the random variable Y given by

Y = f(X) + ε

Problem: Knowing this structure, and knowing as little aspossible about f , can we estimate f by sampling X and Y ?

Statistical modelsAdditive Noise Model

Note that for the additive noise model we have

f(x) = E[Y | X = x] = f(x).

If f does not oscillate nor change too rapidly, k-NN could bequite adept at recovering f(x) from training data T = {(xi, yi)}.

(From the curse of dimensionality, the larger the dimension, themore rapid changes in f affect the computations)

Statistical modelsRestricted function classes

One way to impose further structure on our problem is byrestricting function classes where one minimizes the EPE.

We have already done this in two cases: linear regression andk-NN. The former was too restrictive, the latter too unstable.


In between these extremes we have classes of functions which,while still restricted, go far beyond complexity when comparedto linear functions.

Often, a vector θ is used to parametrize the space of functionsunder consideration. In such a case, we shall write

f(x, θ) = fθ(x).

Then minimizing EPE(fθ) becomes a (possibly nonlinear, oreven non-convex) minimization problem.


Example: The sigmoid function with slope β,

hβ(x) = σ(x · β) =1

1 + e−x·β, for β ∈ Rp

here, σ denotes the well known sigmoid function

σ(t) =1

1 + e−t

Structured regressionRestricted function classes

Example: One may use linear basis expansions, which meansthat one first preselects and fixes a family of functions

h1(x), . . . , hM (x) : Rp 7→ R

Then, for θ ∈ RM , we set

fθ(x) :=

M∑m=1

θmhm(x)

If the hm(x) = xm are the components of the vector, fromminimizing EPE(fθ) we get back the least squares method.

Structured regressionBasis functions

Example: A single layer, feed forward, neural net*

fθ(x) :=

M∑m=1

βmσ(αm · x+ bm)

here the parameters allows both to change the slope andlocation of the sigmoids by choosing the αm’s and bm’s, andtheir amplitude, by changing the βm’s.

Accordingly, we have θ = (β1, α1, b1, . . . , βM , αM , bM ).

*also known as basis pursuit.

Next class

1. More on EPE(fθ)

2. Other functionals besides squared error.

3. Further restricted function classes (often equivalents to theones seen today): kernel methods and roughnesspenalization.

4. Model selection and more on bias versus variance.

Reminders: 1) Don’t forget the first problem set is dueThursday of the coming week, and the second problem set isdue the Thursday after that.2) Check your email today for a questionnaire –you must alsouse it to tell me who will be your project partner.

Statistical Models and Model Selection

MATH 697 AM:ST

Thursday, September 14th

Warm up

Today, let us start by discussing, briefly, two simple ideas

• Maximum likelihood as an statistical estimation method.

• How least squares changes when one has a weighted sum.

Warm up IMaximum Likelihood

Let Y1, . . . , YN be random variables with joint distribution

p(y1, . . . , yN ; θ)

with a parameter vector θ = (θ1, . . . , θM ), for some M .

Example:The Yi are i.i.d. Normals in Rp with Cov(Yi) = σ2I and mean θ

p(y1, . . . , yN ; θ) =

N∏i=1

1

(2πσ2)p2

e−|yi−θ|

2

2σ2


Given a sample y1, y2, . . . , yN , choose θ so as to maximize theirprobability, that is, take

θ := argmaxθ

p(y1, . . . , yN ; θ)

This θ is somestimes called the maximum likelihoodestimator. It corresponds to the idea that if y1, . . . , yN wasobserved, then it is a good guess to choose the θ that made thisobservation most likely.

This intuition is further justified by the same idea behind theStrong Law of Large Numbers: if N is large, then the aboveestimation should be more accurate.


If the Yi are independent, what we are maximizing is a product

p(y1, . . . , yN ; θ) =

N∏i=1

p(yi; θ)

This leads naturally to the idea of log-probability, since

log(p(y1, . . . , yN ; θ))

=

N∑i=1

log(p(yi; θ))


Therefore, it is clear in this case that maximizing

N∏i=1

p(yi; θ)

is exactly the same as maximizing the sum

N∑i=1

log(p(yi; θ))

Warm Up IMaximum Likelihood

Example: Following our previous example, for given a sampley1, . . . , yN we seek θ which maximizes

N∑i=1

log(p(yi; θ))

= −Np2

log(2πσ2)− 1

2σ2

N∑i=1

|yi − θ|2

Warm Up IMaximum Likelihood

Example: (continued)But, this is the same as minimizing

1

2σ2

N∑i=1

|yi − θ|2

It follows that θ is given by averaging the yi

θ =1

N

N∑i=1

yi

Warm Up IIWeighted Least Squares

We are given the following:

Data (x1, y1), . . . , (xN , yN )

“weights” K1, . . . ,KN all positive numbers.

Then, let us see how different is it to minimize the functional

J(β) =

N∑i=1

Ki|yi − xi · β|2

which yields back the usual least squares if K1 = . . . = KN = 1.


What is different?

As usual for least squares, let us define y and X by

y = (y1, . . . , yN ), Xβ = (x1 · β, . . . , xN · β),

and let K be the N ×N diagonal matrix whose ii-th entry is Ki.

With this notation, it turns out that

J(β) = (K(Xβ − y), (Xβ − y))


What is different?As usual for least squares, let us define y and X by

y = (y1, . . . , yN ), Xβ = (x1 · β, . . . , xN · β),

and let K be the N ×N diagonal matrix whose ii-th entry is Ki.

With this notation, it turns out that

J(β) = (K(Xβ − y), (Xβ − y))


A responsible use of both linear algebra & the chain rule yields

∇J(β) = 2XtK(Xβ − y)

Thus, the minimizer(s) β solve

XtK(Xβ − y) = 0,

and, provided we can invert XtKX, the minimizer is unique,and given by

β = (XtKX)−1XtKy


It is rather satisfactory to see that the formula for β

β = (XtKX)−1XtKy

Arises from a minor modification of the standard formula

β = (XtX)−1Xty

Exercise: Can one easily check the invertibility of (XtKX)−1?

Statistical modelsAdditive Models and the Likelihood Functional

We continue to study the additive model: three randomvariables X,Y and ε satisfy the relation

Y = f(X) + ε

for some unknown f . X and ε are assumed independent, and

E[ε] = 0.

Our purpose is, as always, is to estimate f at some x from agiven sample of N observations (xi, yi), where yi = f(xi) + εi.


As we have seen, it follows from our assumptions that

E[Y | X] = f(X)

(and pY |X(y) is N (f(X), σ2) )

This is more or less where we were last time, before we startedour discussion of some of families of restricted functions.


Thus, picking up from last time, we restrict ourselves to a(paremetrized) family of functions, labeled by parameter θ.

This does not mean the unknown f belongs to this family, northat we expect it to be. We make no assumption on whetherthe function f , which we seek to learn, is among the fθ . . .

. . .we merely contend ourselves with finding the “best”approximation to f among the fθ.

Of course,“best approximation” is up to us to define!


Remember the Bayesian classifier: If Y is a categorical variable,

f(x) = argmaxg

P[Y = g | X = x]

A similar idea yields a criterium to choose a best fit f for giventraining data, where the best fit f is sought within apredetermined class of paremetrized functions fθ.


Motivated by this, we consider a predictor for the additivemodel Yθ = fθ(X) + ε where we also have

fθ a family of fuctions parametrized by θ

x1, . . . , xN a fixed training set ⊂ Rp

One possible choice for “best ”fθ” is to choose θ whichmaximizes the Likelihood of the observed data. . .


. . .which is done as follows: given a sample y1, . . . , yN , we set

θ := argmaxθ

N∑i=1

log(pYθ|X=xi(yi))

and then let, accordingly, f(x) = fθ(x).


Example:Assume that ε ∼ N (0, σ2). Then, for each yi, we have

log(pYθ|X=x(yi)) =1√

2πσ2e−|y−fθ(x)|

2

2σ2

It follows that the log-probability is

−N2

log(2πσ2)− 1

2σ2

N∑i=1

|yi − fθ(xi)|2


We conclude that for the additive model with Gaussian noise:

Maximum likelihood ≡ Least Squares

What other objective functionals may arise this way?


We conclude that for the additive model with Gaussian noise:

Maximum likelihood ≡ Least Squares

What other objective functionals may arise this way?


Let H : R 7→ R be such that∫Re−H(x) dx <∞,

∫Re−H(x)x dx = 0.

e.g. if H(x) = H(−x) and H(x) ≥ α0|x| − β0 with α0 > 0.

Then, maximum likelihood leads amounts to minimzing

θ = argminθ

N∑i=1

H(yi − fθ(xi))

that is, provided ε has a Gibbs distribution Z−1H e−H(x).

Statistical modelsOther Objective functionals?

Of course, it is less obvious if other objective functionals ofinterest arise in the same way

1

N

N∑i=1

|xi · β − yi|2 + λ

N∑i=1

β2i

1

N

N∑i=1

|xi · β − yi|2 + λ

N∑i=1

|βi|

1

N

N∑i=1

|f(xi)− yi|2 + λ

∫ L

−L|f ′′(x)|2 dx

This question we will not explore, but we will be studying thesefunctionals, and will be returning to maximum likelihood.

Structured regression

A way to get around the curse of dimensionality

1. Roughness penalty

2. Kernel methods

3. Basis functions

The third item above we discussed last class, where we lookedat restricted families of the form

fθ(x) =

M∑i=1

θihi(x)

As we are about to see, these three methods are closelyconnected

Structured regressionPenalizing lack of regularity

Let us revisit roughness penalization, where to N data points(xi, yi) we associate the functional

J (f) :=1

N

N∑i=1

|yi − f(xi)|2 + λ

∫|f ′′(x)|2 dx


We let f be the minimizer of J (f), where

J (f) :=1

N

N∑i=1

|yi − f(xi)|2 + λ

∫|f ′′(x)|2 dx

Theorem: There exists a function K(x, y) such that

f(x) =1

N

N∑i=1

K(x, xi)yi ∀ x.

See B. Silverman, Annals of Statistics, 1984.


This theorem, follows in turn, from another basic fact about theminimizer, f : it may be written as

f(x) =

M∑i=1

θihi(x)

where the θi can be computed explicitly from the data (xi, yi),and the hi is a certainly family of piece wise polynomialfunctions known as cubic splines and which depend only onthe xi and not the yi.


That this is so follows easily from the exercises in the firstproblem set. One of them corresponds to the followingobservation: if

g : [a, b] 7→ R

minimizes, among all g’s with fixed values at a and b,∫ b

a|g′′(t)|2 dt

Then, g must be a third order polynomial in [a, b].


For the functional J , we have that minimizers need to bepiecewise second order polynomials, with the polynomialpossibly changing from the interval (xi, xi+1) to (xi+1, xi+2)(assuming that we have labeled the xi in increasing order).

A bit more work will show that this solution is also such thatit’s second order derivative is continouous across each xi.

Structured regressionKernels & Local Regression

Now, to motivate the kernel expression, let us introduce theideal of a local average

RSS(fθ, x0) =

N∑i=1

Kλ(x0, xi)(yi − fθ(xi))2

The idea is that Kλ(x0, xi) is very small when |x0 − xi| issufficiently large, this being controlled by the parameter λ.

Thus, one does at each point x0, a least squares fit adapted tothe points close to x0.


Gaussian weights with variance λ

Kλ(x, y) =1

(2πλ)p2

e−|x−y|2λ

The Epanechnikov Quadratic kernel

Kλ(x, y) = D(λ−1|x− y|),where D(t) = 3

4(1− t2)χ[−1,1](t)


RSS(fθ, x0) =

N∑i=1

Kλ(x0, xi)(yi − fθ(xi))2

Two classes of functions

fθ(x) := θ (locally constant),

fθ(x) := θ0 + θ1 · x (locally linear)


In the first case, fθ is constant, and

RSS(fθ, x0) =

N∑i=1

Kλ(x0, xi)(yi − θ)2

This is easily seen to be a quadratic polynomial in θ,differentiating to obtain the minimum θ, one arrives at

N∑i=1

Kλ(x0, xi)(yi − θ) = 0

⇒ θ =

N∑i=1

Kλ(x0, xi)yi

N∑i=1

Kλ(x0, xi)


This is known as the Nadaraya-Watson weighted average

θ =

N∑i=1

Kλ(x0, xi)yi

N∑i=1

Kλ(x0, xi)


In the second case, we have, for θ = (θ0, θ1) ∈ Rp+1

RSS(fθ, x0) =

N∑i=1

Kλ(x0, xi)(yi − θ0 − θ1 · xi)2

This turns out to be a quadratic polynomial in θ0 and thecomponents of θ1, computing the gradient as we did for leastsquares, we obtain this time the formula

(θ0(x0), θ1(x0)) = (XtK(x0)X)−1XtK(x0)y


The corresponding prediction f(x0) is then given by

f(x0) =

N∑i=1

`i(x0)yi

where the coefficients `i(x0) can be computed from x0 and thecomponents of the matrix (XtK(x0)X)−1XtK(x0).

The second week, in one slide

1. The Curse of Dimensionality means high dimensionalproblems are in general computationally intractable, and inparticular hampers the usability of k-NN.

2. If further structure is available (e.g. linearity ) the curse ofdimensionality may be avoided.

3. When selecting a predictive model, one may judge it bylooking at the training error and the test error. Of thesetwo, the training error is by the better evaluation metric.

4. A biased algorithm is often preferable to an unbiased one:the tradeoff may lead to a reduced mean squared error.

5. There are many choices to restrict the function classes forregression: roughness penalization, basis functions, kernelmethods, local regression. . .

Documents

Curse of Dimensionality and Statistical Modelsnguillen/697/697Week2.pdf · 2018. 6. 23. · while the cube Q ‘(0) for say 2(0;1) has volume p ... The moral of the story is that