Upload
claire-potter
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Dimension reduction (1)
OverviewPCAFactor AnalysisEDR spaceSIR
References: Applied Multivariate Analysis.http://www.stat.ucla.edu/~kcli/sir-PHD.pdf
Overview
The purpose of dimension reduction:
Data simplification
Data visualization
Reduce noise (if we can assume only the dominating dimensions are signals)
Variable selection for prediction
Overview
Data separation Dimension reduction
Outcome variable y exists (learning the association rule)
Classification, regression
SIR, Class-preserving projection, Partial least squares
No outcome variable (learning intrinsic structure)
Clustering PCA, MDS, Factor Analysis, ICA, NCA…
An analogy:
PCA
Explain the variance-covariance structure among a set of random variables by a few linear combinations of the variables;
Does not require normality!
8
A1/2 P1/2 P ,
y P x
x Axx x
x A1/2 A1/2 x
x P P x
x P1/2 P P1/2 P xy y
y yy y
iy i
2
i1
p
y i2
i1
p
1
y i2
i1
p
y i2
i1
p
1
x e1, y (Pe1 ) [10 0],y yy y
1 e1Ae1
e1e1
Reminder of some results for random vectors
Proof of the first (and second) point of the previous slide.
PCA
The eigen values are the variance components:
Proportion of total variance explained by the kth PC:
PCA using the correlation matrix, instead of the covariance matrix?This is equivalent to first standardizing all X vectors.
PCA
Using the correlation matrix avoids the domination from one X variable due to scaling (unit changes), for example using inch instead of foot. Example:
PCA
PCA
Selecting the number of components?Based on eigen values (% variation explained). Assumption: the small amount of variation explained by low-rank PCs is noise.
Factor Analysis
If we take the first several PCs that explain most of the variation in the data, we have one form of factor model.
L: loading matrixF: unobserved random vector (latent variables).ε: unobserved random vector (noise)
Factor AnalysisOrthogonal factor model assumes no correlation between the factor RVs.
is a diagonal matrix
Factor Analysis
Rotations in the m-dimensional subspace defined by the factors make the solution non-unique:
PCA is one unique solution, as the vectors are sequentially selected. Maximum likelihood estimator is another solution:
Factor Analysis
As we said, rotations within the m-dimensional subspace doesn’t change the overall amount of variation explained. Do rotation to make the results more interpretable:
Factor Analysis
Varimax criterion:
Find T such that
is maximized.V is proportional to the summation of the variance of the squared loadings. Maximizing V makes the squared loadings as spread out as possible --- some are real small, and some are real big.
21
Orthogonal simple factor rotation: Rotate the orthogonal factors around the origin until the system is maximally aligned with the separate clusters of variables.
Oblique Simple Structure Rotation:Allow the factors to become correlated. Each factor is rotated individually to fit a cluster.
Factor Analysis
MDS
Multidimensional scaling is a dimension reduction procedure that maps the distances between observations to a lower dimensional space.Minimize this objective function:
D: distance in the original spaced: distance in the reduced dimension space.Numerical method is used for the minimization.
EDR space
Now we start talking about regression. The data is {xi, yi}
Is dimension reduction on X matrix alone helpful here? Possibly, if the dimension reduction preserves the essential structure about Y|X. This is suspicious.
Effective Dimension Reduction --- reduce the dimension of X without losing information which is essential to predict Y.
EDR space
The model: Y is predicted by a set of linear combinations of X.
If g() is known, this is not very different from a generalized linear model.
For dimension reduction purpose, is there a scheme which can work on almost any g(), without knowledge of its actual form?
Under this general model,
The space B generated by β1, β2, ……, βK is called the e.d.r. space.
Reducing to this sub-space causes no loss of information regarding predicting Y.
Similar to factor analysis, the subspace B is identifiable, but the vectors aren’t.
Any non-zero vector in the e.d.r. space is called an e.d.r. direction.
EDR space
EDR space
This equation assumes almost the weakest form, to reflect the hope that a low-dimensional projection of a high-dimensional regresser variable contains most of the information that can be gathered from a sample of modest size.
It doesn’t impose any structure on how the projected regresser variables effect the output variable.
Most regression models assume K=1, plus additional structures on g().
EDR space
The philosophical point of Sliced Inverse Regression:
the estimation of the projection directions can be a more important statistical issue than the estimation of the structure of g() itself.
After finding a good e.d.r. space, we can project data to this smaller space. Then we are in a better position to identify what should be pursued further : model building, response surface estimation, cluster analysis, heteroscedasticity analysis, variable selection, ……
SIR
Sliced Inverse Regression.
In regular regression, our interest is the conditional density h(Y|X). Most important is E(Y|x) and var(Y|x).
SIR treats Y as independent variable and X as the dependent variable.Given Y=y, what values will X take?
This takes us from a p-dimensional problem (subject to curse of dimensionality) back to a 1-dimensional curve-fitting problem: E(xi|y), i=1,…, p
SIR
covariance matrix for the slice means of x, weighted by the slice sizes
sample covariance for xi ’s
Find the SIR directions by conducting the eigenvalue decomposition of with respect to :
SIR and LDA
Reminder: Fisher’s linear discriminant analysis seeks a projection direction that maximized class separation. When the underlying distributions are Gaussian, it agrees with the Bayes decision rule. It seeks to maximize:
Between-group variance:Within-group variance:
The solution is the first eigen vector in this eigen value decomposition:
If we let , the LDA agrees with SIR up to a scaling.
SIR and LDA
Multi-class LDA
Structure-preserving dimension reduction in classification.
Within-class scatter:
Between-class scatter:
Mixture scatter:
a: observations, c: class centers
Kim et al. Pattern Recognition 2007, 40:2939
Maximize:
The solution come from the eigen value/vectors of
When we have N<<p, Sw is singular. Let
Multi-class LDA
Kim et al. Pattern Recognition 2007, 40:2939