Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Curse of Dimensionality and Statistical Models
MATH 697 AM:ST
Tuesday, September 12th
WarmupParameter inference
Let Y1, . . . , YN be distributed via some law fy depending on y.
An estimator for y is any random variable of the form
Y = g(Y1, . . . , YN )
The Bias and Mean Squared Error (MSE) of Y are defined as
Bias(Y ) := E[Y ]− y, MSE(Y ) := E[|Y − y|2]
WarmupParameter inference
An estimator with Bias(Y ) = 0 is called unbiased.
Observe that
MSE(Y ) = Var(Y ) + (Bias(Y ))2
When using the MSE to judge an estimator, one sees there is abalance between Bias and Variance. Being unbiased is often adesirable property, however, some estimators could be biasedand have such a small Variance that their MSE is smaller inpractice, and are thus preferred.
WarmupThe most basic estimators
Fix a sequence of i.i.d. variables X1, X2, . . .
The following is the (obvious) estimator for the mean
MN :=1
N
N∑i=1
Xi
and the two estimators below are used for the variance
S2N :=
1
N
N∑i=1
(Xi −MN )2
S2N :=
1
N − 1
N∑i=1
(Xi −MN )2
WarmupA consequence of the Strong Law of Large Numbers
Denote the common mean and variance of the Xi by µ and σ2.Then, for all large N we have with probability ≈ 1
MN ≈ µSN ≈ σ2
SN ≈ σ2
It is thus that if x1, x2, . . . are a particular sample of thesequence X1, X2, . . ., then we may guess µ and σ2 via
1
N
N∑i=1
xi,1
N
N∑i=1
(xi −mN )2, . . .
Warmup –from last time
Expected (Squared) Prediction Error (EPE)
EPE(f) := E[|f(X)− Y |2]
and the conditional EPE at x0
EPE(f , x) := E[|f(X)− Y |2 | X = x]
Warmup –from last time
Looking for the best linear predictor, we sought to minimizeEPE within f of the form x · β, leading to the equation
E[X(X · β)] = E[Y X]
In other words, for i = 1, . . . , p, we have
p∑j=1
E[XiXj ]βj = E[Y Xi]
Warmup –from last time
Let (x1, y1), . . . , (xN , yN ).Then, the Strong Law of Large Numbers says that
E[XiXj ] ≈1
N
N∑l=1
xlixlj
E[Y Xi] ≈1
N
N∑l=1
ylxli
Then,
XtXβ ≈ Xty
with the ≈ becoming = in the limit N →∞.
Warning: A source of confusion
A few things to always keep in mind when studying thestatistical properties of a predictive model:
• What is a random variable and what isn’t?
• What are you sampling from?
• What are you taking the expectation with respect to?
Curse of Dimensionality
In view of all that we have said in the foregoing sections, themany obstacles we appear to have surmounted, what casts thepall over our victory celebration?
It is the curse of dimensionality, a malediction that has plaguedthe scientist from earliest days.
Richard Bellman in Adaptive Control Processes, p. 94, 1961.
Curse of Dimensionality
Richard Bellman in Adaptive Control Processes, p. 94, 1961.
Curse of Dimensionality
It is worth looking at k-NN and least squares in highdimensions.
Curse of DimensionalityO fortuna
First, a quick observation regarding sampling.
The p-dimensional cube Q`(0) := [−`/2, `/2]p has volume
`p
while the cube Qα`(0) for say α ∈ (0, 1) has volume
αp`p
Curse of DimensionalitySors immanis. . .
The percentage of points of Q`(0) which lie in Qα`(0) is
αp
and, for fixed α, this becomes very small when p is large.
Curse of Dimensionality. . . et inanis
In high dimensions, the mass of a cube concentratesnear its boundary
One cannot escape this fate by resorting to some of the mostwell known norms, it is true of the Euclidean ball too, forinstance. It is clearly a consequence of the fact that the volumescales like length to the power p.
Curse of Dimensionalityk-NN
This hurts k-NN in large dimensions.
Let us apply for instance 1-NN to y and x whose functionalrelation is deterministic, and known, say
y = f(x) = e−5|x|2x ∈ Rp, p ≥ 10
A computer computation tells you that
f(x) ≤ 0.3 if |x| ≥ 0.5
Curse of Dimensionalityk-NN
Let us draw, at random, uniformly in Q1(0), points X1, . . . , XN .Then, 1-NN estimate for y for x = 0 is
y = e−5|Xk∗ |2
k∗ being so Xk∗ has smallest norm among X1, . . . , XN .
Question: How large N needs to be for y to have a reasonablechance (say ≥ 0.5) of being closer to 1 than to 0?
Answer: One needs to choose N ≥ Cp for a certain C > 1.
Curse of Dimensionalityk-NN
The moral of the story is that the training set we need for 1-NNin some instances must have a number of elements which growsexponentially with the dimension of the problem.
Curse of DimensionalityLinear model with noise
Not all is lost, however, certain high dimensional problems havebeen tackled in a computationally practical manner underfurther assumptions, i.e. linearity.
Let us take the method of least squares as an illustration of this.
Curse of DimensionalityLinear model with noise
Consider three random variables X ∈ Rp, Y, ε ∈ R, with
Y = X · β + ε
Assume X and ε are independent.
Curse of DimensionalityLinear model with noise
Important: In what follows, we are going to study thestatistics of all i.i.d. samples of size N for this model.
So, fix N , and take (X1, Y1, ε1), . . . , (XN , YN , ε) i.i.d.distributed as X, Y , and ε, and independent from the originalvariables X,Y, and ε, which will be used as our test variables.
In particular, for each i we have Yi = Xi · β + εi, these are notmerely real numbers and vectors, but are random variables!
Curse of DimensionalityLinear model with noise
Now, from the X1, . . . , XN we construct a N × p randommatrix, defined as follows: for v ∈ Rp, let
Xv := (X1 · v, . . . ,XN · v) ∈ RN
Moreover, we construct two random vectors in RN ,
y = (Y1, . . . , YN )
ε = (ε1, . . . , εN )
Observe that the relationship between Xi, Yi, and εi becomes
y = Xv + ε
Curse of DimensionalityLinear model with noise
In summary, we have:A random N × p matrix X and vectors y, ε ∈ RN with
y = Xv + ε
where the rows of X are made out of X1, X2, . . .
We now bring in the the least squares solution, which is given by
β = (XtX)−1Xty
Observe: this β is a random vector, and a function of theXi, Yi, and εi. We intend to use it as an estimator for β.
Curse of DimensionalityLinear model with noise
Let’s compute the Mean Squared Error associated to β
MSE(fβ, x0) = E[|Y − β · x0|2 | X = x0]
By independence of X and ε, we have
MSE(fβ, x0) = E[|(β − β) · x0|2 | X = x0] + E[ε2]
Curse of DimensionalityLinear model with noise
Here is where we put to use our random variable set up!Recalling that,
y = Xβ + ε ∈ RN
where ε = (ε1, . . . , εN ) ∈ RN , we have that
β = (XtX)−1Xty
= (XtX)−1XtXβ + (XtX)−1Xtε
= β + (XtX)−1Xtε
Curse of DimensionalityLinear model with noise
So, we have
β = β + (XtX)−1Xtε
From the independence of X1, X2 from ε1, ε2, it follows that
E[(XtX)−1Xtε] = E[(XtX)−1Xt]E[ε] = 0
Therefore E[β] = β, i.e. β is an unbiased estimator for β.
Curse of DimensionalityLinear model with noise
Back to the MSE, using our new found formula we have
E[|(β − β) · x0|2 | X = x0]
= E[|((XtX)−1Xtε, x0)|2 | X = x0]
and therefore,
MSE(β, x0) = E[|((XtX)−1Xtε, x0)|2 | X = x0] + σ2
Curse of DimensionalityLinear model with noise
Following the last identity with some meticulous calculations,one can see that
MSE(β, x0) = E[(XtX)−1x0, x0) | X = x0] + σ2
Further, with a bit more work, it can be shown that in fact
MSE(β) =p
Nσ2 + σ2
Voila! The growth in p it’s only linear in the dimension,and the number of samples N only needs to be linear in p tomake MSE reasonably small.
Curse of DimensionalityLinear model with noise
Voila! The growth in p it’s only linear in the dimension,and the number of samples N only needs to be linear in p tocontrol the MSE.
Thus, with extra structure, the curse of dimensionality wascircumvented.
A lot of ongoing research in ML entails developing ways ofgetting around the curse of dimensionality in situations wherelinearity is not a good approximation.
Statistical models
Having met with the curse of dimensionality and seen the typeof problems lying ahead for our learning methods, let’s discuss afew structures which, together with the formalism of the EPE,lead to many important algorithms.
We continue to think in terms of two random variables X andY , for which we seek to understand any deterministic relation ofthe form Y = f(X). We already saw the EPE leads to
f(x) = E[Y | X = x]
Statistical modelsAdditive Noise Model
Let X and ε be independent random variables, with X ∈ Rp,and ε ∈ R being such that E[ε] = 0, Var(ε) = σ2ε .
We assume there is some f (the exact form of which isunknown), and consider the random variable Y given by
Y = f(X) + ε
Problem: Knowing this structure, and knowing as little aspossible about f , can we estimate f by sampling X and Y ?
Statistical modelsAdditive Noise Model
Note that for the additive noise model we have
f(x) = E[Y | X = x] = f(x).
If f does not oscillate nor change too rapidly, k-NN could bequite adept at recovering f(x) from training data T = {(xi, yi)}.
(From the curse of dimensionality, the larger the dimension, themore rapid changes in f affect the computations)
Statistical modelsRestricted function classes
One way to impose further structure on our problem is byrestricting function classes where one minimizes the EPE.
We have already done this in two cases: linear regression andk-NN. The former was too restrictive, the latter too unstable.
Statistical modelsRestricted function classes
In between these extremes we have classes of functions which,while still restricted, go far beyond complexity when comparedto linear functions.
Often, a vector θ is used to parametrize the space of functionsunder consideration. In such a case, we shall write
f(x, θ) = fθ(x).
Then minimizing EPE(fθ) becomes a (possibly nonlinear, oreven non-convex) minimization problem.
Statistical modelsRestricted function classes
Example: The sigmoid function with slope β,
hβ(x) = σ(x · β) =1
1 + e−x·β, for β ∈ Rp
here, σ denotes the well known sigmoid function
σ(t) =1
1 + e−t
Structured regressionRestricted function classes
Example: One may use linear basis expansions, which meansthat one first preselects and fixes a family of functions
h1(x), . . . , hM (x) : Rp 7→ R
Then, for θ ∈ RM , we set
fθ(x) :=
M∑m=1
θmhm(x)
If the hm(x) = xm are the components of the vector, fromminimizing EPE(fθ) we get back the least squares method.
Structured regressionBasis functions
Example: A single layer, feed forward, neural net*
fθ(x) :=
M∑m=1
βmσ(αm · x+ bm)
here the parameters allows both to change the slope andlocation of the sigmoids by choosing the αm’s and bm’s, andtheir amplitude, by changing the βm’s.
Accordingly, we have θ = (β1, α1, b1, . . . , βM , αM , bM ).
*also known as basis pursuit.
Next class
1. More on EPE(fθ)
2. Other functionals besides squared error.
3. Further restricted function classes (often equivalents to theones seen today): kernel methods and roughnesspenalization.
4. Model selection and more on bias versus variance.
Reminders: 1) Don’t forget the first problem set is dueThursday of the coming week, and the second problem set isdue the Thursday after that.2) Check your email today for a questionnaire –you must alsouse it to tell me who will be your project partner.
Statistical Models and Model Selection
MATH 697 AM:ST
Thursday, September 14th
Warm up
Today, let us start by discussing, briefly, two simple ideas
• Maximum likelihood as an statistical estimation method.
• How least squares changes when one has a weighted sum.
Warm up IMaximum Likelihood
Let Y1, . . . , YN be random variables with joint distribution
p(y1, . . . , yN ; θ)
with a parameter vector θ = (θ1, . . . , θM ), for some M .
Example:The Yi are i.i.d. Normals in Rp with Cov(Yi) = σ2I and mean θ
p(y1, . . . , yN ; θ) =
N∏i=1
1
(2πσ2)p2
e−|yi−θ|
2
2σ2
Warm up IMaximum Likelihood
Given a sample y1, y2, . . . , yN , choose θ so as to maximize theirprobability, that is, take
θ := argmaxθ
p(y1, . . . , yN ; θ)
This θ is somestimes called the maximum likelihoodestimator. It corresponds to the idea that if y1, . . . , yN wasobserved, then it is a good guess to choose the θ that made thisobservation most likely.
This intuition is further justified by the same idea behind theStrong Law of Large Numbers: if N is large, then the aboveestimation should be more accurate.
Warm up IMaximum Likelihood
If the Yi are independent, what we are maximizing is a product
p(y1, . . . , yN ; θ) =
N∏i=1
p(yi; θ)
This leads naturally to the idea of log-probability, since
log(p(y1, . . . , yN ; θ))
=
N∑i=1
log(p(yi; θ))
Warm up IMaximum Likelihood
Therefore, it is clear in this case that maximizing
N∏i=1
p(yi; θ)
is exactly the same as maximizing the sum
N∑i=1
log(p(yi; θ))
Warm Up IMaximum Likelihood
Example: Following our previous example, for given a sampley1, . . . , yN we seek θ which maximizes
N∑i=1
log(p(yi; θ))
= −Np2
log(2πσ2)− 1
2σ2
N∑i=1
|yi − θ|2
Warm Up IMaximum Likelihood
Example: (continued)But, this is the same as minimizing
1
2σ2
N∑i=1
|yi − θ|2
It follows that θ is given by averaging the yi
θ =1
N
N∑i=1
yi
Warm Up IIWeighted Least Squares
We are given the following:
Data (x1, y1), . . . , (xN , yN )
“weights” K1, . . . ,KN all positive numbers.
Then, let us see how different is it to minimize the functional
J(β) =
N∑i=1
Ki|yi − xi · β|2
which yields back the usual least squares if K1 = . . . = KN = 1.
Warm Up IIWeighted Least Squares
What is different?
As usual for least squares, let us define y and X by
y = (y1, . . . , yN ), Xβ = (x1 · β, . . . , xN · β),
and let K be the N ×N diagonal matrix whose ii-th entry is Ki.
With this notation, it turns out that
J(β) = (K(Xβ − y), (Xβ − y))
Warm Up IIWeighted Least Squares
What is different?As usual for least squares, let us define y and X by
y = (y1, . . . , yN ), Xβ = (x1 · β, . . . , xN · β),
and let K be the N ×N diagonal matrix whose ii-th entry is Ki.
With this notation, it turns out that
J(β) = (K(Xβ − y), (Xβ − y))
Warm Up IIWeighted Least Squares
A responsible use of both linear algebra & the chain rule yields
∇J(β) = 2XtK(Xβ − y)
Thus, the minimizer(s) β solve
XtK(Xβ − y) = 0,
and, provided we can invert XtKX, the minimizer is unique,and given by
β = (XtKX)−1XtKy
Warm Up IIWeighted Least Squares
It is rather satisfactory to see that the formula for β
β = (XtKX)−1XtKy
Arises from a minor modification of the standard formula
β = (XtX)−1Xty
Exercise: Can one easily check the invertibility of (XtKX)−1?
Statistical modelsAdditive Models and the Likelihood Functional
We continue to study the additive model: three randomvariables X,Y and ε satisfy the relation
Y = f(X) + ε
for some unknown f . X and ε are assumed independent, and
E[ε] = 0.
Our purpose is, as always, is to estimate f at some x from agiven sample of N observations (xi, yi), where yi = f(xi) + εi.
Statistical modelsAdditive Models and the Likelihood Functional
As we have seen, it follows from our assumptions that
E[Y | X] = f(X)
(and pY |X(y) is N (f(X), σ2) )
This is more or less where we were last time, before we startedour discussion of some of families of restricted functions.
Statistical modelsAdditive Models and the Likelihood Functional
Thus, picking up from last time, we restrict ourselves to a(paremetrized) family of functions, labeled by parameter θ.
This does not mean the unknown f belongs to this family, northat we expect it to be. We make no assumption on whetherthe function f , which we seek to learn, is among the fθ . . .
. . .we merely contend ourselves with finding the “best”approximation to f among the fθ.
Of course,“best approximation” is up to us to define!
Statistical modelsAdditive Models and the Likelihood Functional
Remember the Bayesian classifier: If Y is a categorical variable,
f(x) = argmaxg
P[Y = g | X = x]
A similar idea yields a criterium to choose a best fit f for giventraining data, where the best fit f is sought within apredetermined class of paremetrized functions fθ.
Statistical modelsAdditive Models and the Likelihood Functional
Motivated by this, we consider a predictor for the additivemodel Yθ = fθ(X) + ε where we also have
fθ a family of fuctions parametrized by θ
x1, . . . , xN a fixed training set ⊂ Rp
One possible choice for “best ”fθ” is to choose θ whichmaximizes the Likelihood of the observed data. . .
Statistical modelsAdditive Models and the Likelihood Functional
. . .which is done as follows: given a sample y1, . . . , yN , we set
θ := argmaxθ
N∑i=1
log(pYθ|X=xi(yi))
and then let, accordingly, f(x) = fθ(x).
Statistical modelsAdditive Models and the Likelihood Functional
Example:Assume that ε ∼ N (0, σ2). Then, for each yi, we have
log(pYθ|X=x(yi)) =1√
2πσ2e−|y−fθ(x)|
2
2σ2
It follows that the log-probability is
−N2
log(2πσ2)− 1
2σ2
N∑i=1
|yi − fθ(xi)|2
Statistical modelsAdditive Models and the Likelihood Functional
We conclude that for the additive model with Gaussian noise:
Maximum likelihood ≡ Least Squares
What other objective functionals may arise this way?
Statistical modelsAdditive Models and the Likelihood Functional
We conclude that for the additive model with Gaussian noise:
Maximum likelihood ≡ Least Squares
What other objective functionals may arise this way?
Statistical modelsAdditive Models and the Likelihood Functional
Let H : R 7→ R be such that∫Re−H(x) dx <∞,
∫Re−H(x)x dx = 0.
e.g. if H(x) = H(−x) and H(x) ≥ α0|x| − β0 with α0 > 0.
Then, maximum likelihood leads amounts to minimzing
θ = argminθ
N∑i=1
H(yi − fθ(xi))
that is, provided ε has a Gibbs distribution Z−1H e−H(x).
Statistical modelsOther Objective functionals?
Of course, it is less obvious if other objective functionals ofinterest arise in the same way
1
N
N∑i=1
|xi · β − yi|2 + λ
N∑i=1
β2i
1
N
N∑i=1
|xi · β − yi|2 + λ
N∑i=1
|βi|
1
N
N∑i=1
|f(xi)− yi|2 + λ
∫ L
−L|f ′′(x)|2 dx
This question we will not explore, but we will be studying thesefunctionals, and will be returning to maximum likelihood.
Structured regression
A way to get around the curse of dimensionality
1. Roughness penalty
2. Kernel methods
3. Basis functions
The third item above we discussed last class, where we lookedat restricted families of the form
fθ(x) =
M∑i=1
θihi(x)
As we are about to see, these three methods are closelyconnected
Structured regressionPenalizing lack of regularity
Let us revisit roughness penalization, where to N data points(xi, yi) we associate the functional
J (f) :=1
N
N∑i=1
|yi − f(xi)|2 + λ
∫|f ′′(x)|2 dx
Structured regressionPenalizing lack of regularity
We let f be the minimizer of J (f), where
J (f) :=1
N
N∑i=1
|yi − f(xi)|2 + λ
∫|f ′′(x)|2 dx
Theorem: There exists a function K(x, y) such that
f(x) =1
N
N∑i=1
K(x, xi)yi ∀ x.
See B. Silverman, Annals of Statistics, 1984.
Structured regressionPenalizing lack of regularity
This theorem, follows in turn, from another basic fact about theminimizer, f : it may be written as
f(x) =
M∑i=1
θihi(x)
where the θi can be computed explicitly from the data (xi, yi),and the hi is a certainly family of piece wise polynomialfunctions known as cubic splines and which depend only onthe xi and not the yi.
Structured regressionPenalizing lack of regularity
That this is so follows easily from the exercises in the firstproblem set. One of them corresponds to the followingobservation: if
g : [a, b] 7→ R
minimizes, among all g’s with fixed values at a and b,∫ b
a|g′′(t)|2 dt
Then, g must be a third order polynomial in [a, b].
Structured regressionPenalizing lack of regularity
For the functional J , we have that minimizers need to bepiecewise second order polynomials, with the polynomialpossibly changing from the interval (xi, xi+1) to (xi+1, xi+2)(assuming that we have labeled the xi in increasing order).
A bit more work will show that this solution is also such thatit’s second order derivative is continouous across each xi.
Structured regressionKernels & Local Regression
Now, to motivate the kernel expression, let us introduce theideal of a local average
RSS(fθ, x0) =
N∑i=1
Kλ(x0, xi)(yi − fθ(xi))2
The idea is that Kλ(x0, xi) is very small when |x0 − xi| issufficiently large, this being controlled by the parameter λ.
Thus, one does at each point x0, a least squares fit adapted tothe points close to x0.
Structured regressionKernels & Local Regression
Gaussian weights with variance λ
Kλ(x, y) =1
(2πλ)p2
e−|x−y|2λ
The Epanechnikov Quadratic kernel
Kλ(x, y) = D(λ−1|x− y|),where D(t) = 3
4(1− t2)χ[−1,1](t)
Structured regressionKernels & Local Regression
RSS(fθ, x0) =
N∑i=1
Kλ(x0, xi)(yi − fθ(xi))2
Two classes of functions
fθ(x) := θ (locally constant),
fθ(x) := θ0 + θ1 · x (locally linear)
Structured regressionKernels & Local Regression
In the first case, fθ is constant, and
RSS(fθ, x0) =
N∑i=1
Kλ(x0, xi)(yi − θ)2
This is easily seen to be a quadratic polynomial in θ,differentiating to obtain the minimum θ, one arrives at
N∑i=1
Kλ(x0, xi)(yi − θ) = 0
⇒ θ =
N∑i=1
Kλ(x0, xi)yi
N∑i=1
Kλ(x0, xi)
Structured regressionKernels & Local Regression
This is known as the Nadaraya-Watson weighted average
θ =
N∑i=1
Kλ(x0, xi)yi
N∑i=1
Kλ(x0, xi)
Structured regressionKernels & Local Regression
In the second case, we have, for θ = (θ0, θ1) ∈ Rp+1
RSS(fθ, x0) =
N∑i=1
Kλ(x0, xi)(yi − θ0 − θ1 · xi)2
This turns out to be a quadratic polynomial in θ0 and thecomponents of θ1, computing the gradient as we did for leastsquares, we obtain this time the formula
(θ0(x0), θ1(x0)) = (XtK(x0)X)−1XtK(x0)y
Structured regressionKernels & Local Regression
The corresponding prediction f(x0) is then given by
f(x0) =
N∑i=1
`i(x0)yi
where the coefficients `i(x0) can be computed from x0 and thecomponents of the matrix (XtK(x0)X)−1XtK(x0).
The second week, in one slide
1. The Curse of Dimensionality means high dimensionalproblems are in general computationally intractable, and inparticular hampers the usability of k-NN.
2. If further structure is available (e.g. linearity ) the curse ofdimensionality may be avoided.
3. When selecting a predictive model, one may judge it bylooking at the training error and the test error. Of thesetwo, the training error is by the better evaluation metric.
4. A biased algorithm is often preferable to an unbiased one:the tradeoff may lead to a reduced mean squared error.
5. There are many choices to restrict the function classes forregression: roughness penalization, basis functions, kernelmethods, local regression. . .