Upload
ursula-gray
View
238
Download
0
Tags:
Embed Size (px)
Citation preview
•Nonlinear Dimensionality Reduction , John A. Lee, Michel Verleysen
1
Nonlinear Dimensionality Reduction
By: sadatnejadBy: sadatnejad
دانشگاه صنعتي اميرکبيردانشگاه صنعتي اميرکبيرپلي تکنيک تهران(پلي تکنيک تهران())
The goals to be reached
2
Discover and extract information that lies hidden in the huge quantity of data.
Understand and classify the existing data
Infer and generalize to new data
Dim. Reduction- Practical Motivations
3
By essence, the world is multidimensional.
Redundancy means that parameters or features that could characterize the set of various units are not independent from each other.
The large set of parameters or features must be summarized into a smaller set, with no or less redundancy. This is the goal of dimensionality reduction (DR), which is one of the key tools for analyzing high-dimensional data.
Fields of application Image processingProcessing of sensor arraysMultivariate data analysis
Theoretical Motivations
4
Curse of dimensionality
1- How can we visualize high-dimensional spaces?
2- Curse of dimensionality and empty space phenomenon
Theoretical Motivations
1- How can we visualize high-dimensional spaces?
5
Spatial dataTemporal data
Two-dimensional representation of a four-dimensional cube. In addition to perspective, the color indicates the depth in the fourth dimension.
6
•Two plots of the same temporal data. In the first representation, data are displayed in a single coordinate system (spatial representation).
• In the second representation, each variable is plotted in its own coordinate system, with time as the abscissa (time representation).
Theoretical Motivations
2- Curse of dimensionality and empty space phenomenon
7
The curse of dimensionality also refers to the fact that in the absence of simplifying assumptions, the number of data samples required to estimate a function of several variables to a given accuracy (i.e., to get a reasonably low-variance estimate) on a given domain grows exponentially with the number of dimensions.
Empty space phenomenon: Because the amount of available data is generally restricted to a few observations, high-dimensional spaces are inherently sparse.
Hyper volume of cubes and spheres
8
Surprisingly, the ratio Vsphere/Vcube tends to zero when D increases
As dimensionality increases, a cube becomes more and more spiky:
The spherical body gets smaller and smaller while the number of spikes increases
Now, assigning the value 1/2 to r, Vcube equals 1, leading to
The volume of a sphere vanishes when dimensionality increases
r is the radius of the sphere.
Hyper volume of a thin spherical shell
9
ε<<1 is the thickness of the shell
Tail probability of isotropic Gaussian distribution
10
Tail probability of isotropic Gaussian distribution (CONT.)
11
By computing r0.95 defined as the radius of a hypersphere that contains 95% of the distribution, the value of r0.95 is such that
Where Ssphere(r) is the surface of a D-dimensional hypersphere of radius r
The radius r0.95 grows as the dimensionality D increases
توزیع رسد می نظر دام heavy tailبه از اجتناب منظور به ناگزیر گردد، overfittingمیاندازه ولاجرم گردیده برداری نمونه تر گسترده فضای در آموزشی های داده مجموعه بایست می
... یابد افزایش مجموعه
Some directions to be explored
12
In the presence of high-dimensional data, two possibilities exist to avoid or at least attenuate the effects of the above-mentioned phenomena.
Relevance of the variables For example by computing the correlations between known
pairs of input/output
Dependencies between the variablesAmong relevant variables, one of relevant variables
brings information about the other.
The new set should obviously contain a smaller number of variables but should also preserve the interesting characteristics of the initial set.
Goals of projection
13
The determination of a projection may also follow two different goals.
The first and simplest one aims to just detect and eliminate the dependencies. PCA
The second goal of a projection is not only to reduce the dimensionality, but also to retrieve the so-called latent variables, i.e., those that are at the origin of the observed ones but cannot be measured directly.
Blind source separation (BSS), in signal processing, or Independent component analysis (ICA), in multivariate data analysis, are particular cases of latent variable separation
About topology, spaces, and manifolds
14
From a geometrical point of view, when two or more variables depend on each other, their joint distribution does not span the whole space.
Actually, the dependence induces some structure in the distribution, in the form of a geometrical locus that can be seen as a kind of object in the space.
Dimensionality reduction aims at giving a new representation of these objects while preserving their structure.
Topology
15
In mathematics, topology studies the properties of objects that are preserved through deformations, twisting, and stretching.
For example, a circle is topologically equivalent to an ellipse, and a sphere is equivalent to an ellipsoid.
Topology
16
The knowledge of objects does not depend on how they are represented, or embedded, in space.
For example, the statement, “If you remove a point from a circle, you get a (curved) line segment” holds just as well for a circle as for an ellipse.
In other words, topology is used to abstract the intrinsic connectivity of objects while ignoring their detailed form. If two objects have the same topological properties, they are said to be homeomorphic.
Topological space
17
A topological space is a set for which a topology is specified
For a set Y, a topology T is defined as a collection of subsets of Y that obey the following properties:
Trivially, ∅ ∈ T and Y ∈ T .Whenever two sets are in T , then so is their intersection.Whenever two or more sets are in T, then so is their union.
Topological space
Geometrical view
18
From a more geometrical point of view, a topological space can also be defined using neighborhoods and Haussdorf’s axioms.
The neighborhood of a point y ∈ RD also called a ε-neighborhood or infinitesimal open set, is often defined as the open ε-ball, i.e. the set of points inside a D-dimensional hollow
sphere of radius ε> 0 and centered on y.
A set containing an open neighborhood is also called a neighborhood.
Topological space
Geometrical view (Cont.)
19
Then, a topological space is such that
To each point y there corresponds at least one neighborhood U(y), and U(y) contains y.
If U(y) and V(y) are neighborhoods of the same point y, then a neighborhood W(y) exists such that W(y) ⊂ U(y) ∪ V(y).
If z ∈ U(y), then a neighborhood V(z) of z exists such that V(z) ⊂ U(y).
For two distinct points, two disjoint neighborhoods of these points exist.
Manifold
20
Within this framework, a (topological) manifold M is a topological space that is locally Euclidean, meaning that around every point of M is a neighborhood that is topologically the same as the open unit ball in RD .
In general, any object that is nearly “flat” on small scales is a manifold. For example, the Earth is spherical but looks flat on the
human scale.
Embedding
21
An embedding is a representation of a topological object (a manifold, a graph, etc.) in a certain space, usually RD for some D, in such a way that its topological properties are preserved.
For example, the embedding of a manifold preserves open sets.
Differentiable manifold
22
Swiss roll
23
The challenge of the Swiss roll consists of finding a two-dimensional embedding that “unrolls” it, in order to avoid superpositions of the successive turns of the spiral and to obtain a bijective mapping between the initial and final embeddings of the manifold.
The Swiss roll is a smooth, and connected
manifold.
Open box
24
For Open box manifold the goal is to reduce the embedding dimensionality from three to two.
As can be seen, the open box is connected but neither compact (in contrast with a cube or closed box) nor smooth (there are sharp edges and corners).
Intuitively, it is not so obvious to guess what an embedding of the open box should look like. Would the lateral faces be stretched? Or torn? Or would the bottom face be shrunk? Actually, the open box helps to show the way each particular method behaves.
In practice, all DR methods work with a discrete representation of the manifold to be embedded.
25
•Nonlinear Dimensionality Reduction , John A. Lee, Michel Verleysen, Chapter2
26
Characteristics of an analysis method
دانشگاه صنعتي اميرکبيردانشگاه صنعتي اميرکبيرپلي تکنيک تهران(پلي تکنيک تهران())
Expected Functionalities
27
The analysis of high-dimensional data amounts to identifying and eliminating the redundancies among the observed variables. This requires three main functionalities: an ideal method should indeed be able to
Estimate the number of latent variables. Embed data in order to reduce their dimensionality.
Embed data in order to recover the latent variables.
Estimation of the number of latent variables
28
Sometimes latent variables are also called degrees of freedom. The number of latent variables is often computed from a
topological point of view, by estimating the intrinsic dimension(ality) of data.
In contrast with a number of variables, which is necessarily an integer value, the intrinsic dimension hides a more generic concept and may take on real values.
P = D There is no structure.
P < D A low intrinsic dimension indicates that a topological object or
structure underlies the data set.
Intrinsic Dim.
29
A two-dimensional manifold embedded in a three-dimensional space. Thedata set contains only a finite number of points (or observations).
Without a good estimate of the intrinsic dimension, dimensionality reduction is no more than a risky bet since one does not known to what extent the dimensionality can be reduced.
Embedding for dimensionality reduction
30
The knowledge of the intrinsic dimension P indicates that data have some topological structure and do not completely fill the embedding space.
DR consist of re-embedding the data in a lower dimensional space that would be better filled.
The aims are both to get the most compact representation and to make any subsequent processing of data more easy.
The main problem is, how to measure or characterize the structure of a manifold in order to preserve it
31
Possible two-dimensional embedding for the object in Fig. 2.1. The dimensionality of the data set has been reduced from three to two.
Internal characteristics
32
Behind the expected functionalities of an analysis method, less visible characteristics are hidden, though they play a key role. These characteristics are:
The model that data are assumed to follow.
The type of algorithm that identifies the model parameters
The criterion to be optimized, which guides the algorithm.
Underlying model
33
All methods of analysis rely on the assumption that the data sets they are fed with, have been generated according to a well-defined model.
For example, principal component analysis assumes that the dependencies between the variables are linear
The type of model determines the power and/or limitations of the method.
For this model choice (Linear), PCA often delivers poor results when trying to project data lying on a nonlinear subspace.
34
Fig. 2.4. Dimensionality reduction by PCA from 3 to 2 for the data set of Fig. 2.1. Obviously, data do not fit the model of PCA, and the initial rectangular distribution cannot be retrieved.
Algorithm
35
For the same model, several algorithms can implement the desired method of analysis. For example, in the case of PCA, the model
parameters are computed in closed form by using general-purpose algebraic procedures.
Most often, these procedures work quickly, without any external hyperparameter to tune, and are guaranteed to find the best possible solution (depending on the criterion, see ahead).
PCA as a batch methodsOnline or adaptative PCA algorithms
Criterion
36
Criterion probably plays the most important role among characteristics of the method.
For example, a well-known criterion for dimensionality reduction is the mean square error.
Most often the loss of information or deterioration of the data structure occurs solely in the first step, but the second is necessary in order to have a comparison reference.
Other criteria
37
From statistical view, a projection that preserves the variance initially observed in the raw data.
From a more geometrical or topological point of view, the projection of the object which preserve its structure.For example, by preserving the pairwise distances measured
between the observations in the data set.
If the aim is latent variable separation, then the criterion can be decorrelation.
PCA Data model of PCA
38
Random vector y = [y1, . . . , yd, . . . , yD]T , result from a linear transformation W of P unknown latent variables x = [x1, . . , xp, . . , xP ]T .
All latent variables are assumed to have a Gaussian distribution.
Transformation W is constrained to be an axis change, meaning that the columns wd of W are orthogonal to each other and of unit norm. In other words, the D-by-P matrix W is a matrix such that WTW = IP (but the permuted product WWT may differ from ID).
Both the observed variables y and the latent ones x are centered. Matrix form of N observations
PCA Preprocessing
39
Before determining P and W, the observations can be centered by removing the expectation of y from each observation y(n).
The centering can be rewritten for the entire data set as:
In some situation it is required standardizing the variables, i.e., dividing each yd by its standard deviation after centering.
PCA - standardization
40
The standardization could even be dangerous when some variable has a low standard deviation. When a variable is zero, its standard deviation is also zero. Trivially, the division
by zero must be avoided, and the variable should be discarded.
When noise pollutes an observed variable having a small standard deviation, the contribution of the noise to the standard deviation may be proportionally large. This means that discovering the dependency between that variable and the other ones can be difficult.
The standardization can be useful but may not be achieved blindly. Some knowledge about the data set is necessary.
After centering (and standardization if appropriate), the parameters P and W can be identified by PCA.
Criteria leading to PCA
41
PCA can be derived from several criteria, all leading to the same method and/or results.
Minimal reconstruction error
Maximal preserved variance
PCA CriterionMinimal reconstruction error
42
The reconstruction mean square error
WTW= IP, but WWT = ID is not necessarily true. Therefore, no simplification may occur in the reconstruction error.
43
However, in a perfect world, the observed vector y has been generated precisely according to the PCA model. In this case only, y can be perfectly retrieved. Indeed, if y = Wx, then
The reconstruction error is zero.
Almost all real situations, the observed variables in y are polluted by some noise, or do not fully respect the linear PCA model, yielding a nonzero reconstruction error.
The best approximation is determined by developing and minimizing the reconstruction error.
44
where the first term is constant. Hence, minimizing Ecodec turns out to maximize the term
As only a few observations y)n) are available, the latter expression is approximated by the sample mean:
To maximize this last expression, Y has to be factored by singular value decomposition
45
Where V, U are unitary matrices and where Σ is a matrix with the same size as Y but with at most D nonzero entries σd, called singular values and located on the first diagonal of Σ.
The D singular values are usually sorted in descending order. Substituting in the approximation of the expectation leads to
Since the columns of V and U are orthonormal vectors by construction, it is easy to see that
46
for a given P (ID×P is a matrix made of the first P columns of the identity matrix ID). Indeed, the above expression reaches its maximum when the P columns of W are colinear with the columns of V that are associated with the P largest singular values in Σ.
Additionally, it can be trivially proved that Ecodec = 0 for W= V.
Finally, P-dimensional latent variables are approximated by computing the product
Maximal preserved variance and decorrelation
47
From a statistical point of view:
Assumed that the latent variables in x are uncorrelated )no linear dependencies bind them).
In practice, this means that the covariance matrix of x, defined as provided x is centered, is diagonal.
Maximal preserved variance and decorrelation (Cont.)
48
After the axis change induced by W, it is very likely that the observed variables in y are correlated, i.e., Cyy is no longer diagonal.
The goal of PCA is then to get back the P uncorrelated latent variables in x.Assuming that the PCA model holds and the
covariance of y is known,
Maximal preserved variance and decorrelation (Cont.)
49
Since WTW = I, left and right multiplications by, respectively, WT and W lead to
The covariance matrix Cyy can be factored by eigenvalue decomposition
Where V is a matrix of normed eigenvectors vd and Λ a diagonal matrix containing their associated eigenvalues λd, in descending order.
Maximal preserved variance and decorrelation (Cont.)
50
Because the covariance matrix is symmetric and semipositive definite, the eigenvectors are orthogonal and the eigenvalues are nonnegative real numbers.
This equality holds only when the P columns of W are taken colinear with P columns of V, among D ones.
If the PCA model is fully respected, then only the first P eigenvalues in Λ are strictly larger than zero; the other ones are zero. The eigenvectors associated with these P nonzero eigenvalues must be kept:
Maximal preserved variance and decorrelation (Cont.)
51
This shows that the eigenvalues in Λ correspond to the variances of the latent variables
In real situations, some noise may corrupt the observed variables in y. As a consequence, all eigenvalues of Cyy are larger than zero, and the choice of P columns in V becomes more difficult.
Maximal preserved variance and decorrelation (Cont.)
52
Assuming that the latent variables have larger variances than the noise, it suffices to choose the eigenvectors associated with the largest eigenvalues.
If the global variance of y is defined as
Then the proposed solution is guaranteed to preserve a maximal fraction of the global variance.
Maximal preserved variance and decorrelation (Cont.)
53
From a geometrical point of view, the columns of V indicates the directions in RD that span the subspace of the latent variables.
To conclude, it must be emphasized that in real situations the true covariance of y is not known but can be approximated by the sample covariance:
PCAIntrinsic dimension estimation
54
If the model of PCA is fully respected, then only the P largest eigen values of Cyy will depart from zero.
The rank of the covariance matrix (the number of nonzero eigen values) indicates trivially the number of latent variables.
D eigenvalues are often different from zero: When having only a finite sample, the covariance can only be
approximated. The data probably do not entirely respect the PCA model
(presence of noise, etc.). making the estimation of the intrinsic dimension more difficult.
PCAIntrinsic dimension estimation (Cont.)
55
Normally, if the model holds reasonably well, large (significant) eigenvalues correspond to the variances of the latent variables, while smaller (negligible) ones are due to noise and other imperfections.
A rather visible gap should separate the two kinds of eigenvalues.
The gap can be visualized by plotting the eigenvalues in descending order: a sudden fall should appear right after the Pth eigenvalue.
If the gap is not visible, plotting minus the logarithm of the normalized eigenvalues may help. In this plot, the intrinsic dimension is indicated by a sudden ascent.
PCAIntrinsic dimension estimation (Cont.)
56
Unfortunately, when the data dimensionality D is high, there may also be numerous latent variables showing a wide spectrum of variances.
In the extreme, the variances of latent variables can no longer be distinguished from the variances related to noise. In this case, the intrinsic dimension P is chosen in order to preserve an arbitrarily chosen fraction of the global variance. Then the dimension P is determined so as to preserve at least this fraction of variance. For example, if it is assumed that the latent variables bear 95% of
the global variance, then P is the smaller integer such that the following inequality holds
PCAIntrinsic dimension estimation (Cont.)
57
Sometimes the threshold is set on individual variances instead of cumulated ones.
The best way to set the threshold consists of finding a threshold that separates the significant variances from the negligible ones. This turns out to be equivalent to the visual methods.
More complex methods exist to set the frontier between the latent and noise subspaces, such as Akaike’s information criterion (AIC), the Bayesian information criterion (BIC), and the minimum description length (MDL).
These methods determine the value of P on the basis of information-theoretic considerations.
Examples and limitations of PCA
58
Gaussian variables and linear embedding If columns of the mixing matrix W are neither orthogonal
nor normed. Consequently, PCA cannot retrieve exactly the true latent variables.
PCA Result
59
Nonlinear embedding
60
PCA is unable to completely reconstruct the curved object displayed in Fig. 2.7
PCA Result
61
Non-Gaussian distributions
62
PCA Result
63