Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Principal Components Analysis
with application to remote sensing image analysis
D G Rossiter
Cornell University, Section of Soi & Cropl Sciences
April 12, 2020
PCA 1
Topic: Factor Analysis
A generic term for methods that consider the inter-relations between a set ofvariables.
• Often the set of predictors which might be used in a multiple linear regression.
– Multivariate observations on same objects (e.g., soil samples)– Remote sensing: a set of co-registered images of a scene∗ all bands of one image∗ bands of multiple (co-registered) sensors∗ one band or band product (e.g., NDVI) of a time-series of images
• This is an analysis of the structure of the multivariate feature space coveredby a set of variables.
D G Rossiter
PCA 2
Uses of factor analysis in remote sensing
1. Discover relations between images, and possible groupings of them
2. Discover groupings of pixels in a set of images (→ classification)
3. Interpret the resulting groupings in terms of processes
4. Diagnose multi-collinearity, since images are usually correlated
• determine which images are most correlated• quantify redundancy, find the most informative subset of images
5. For data reduction for model inputs; two approaches:
• Identify representative images for a minimum data set• Compute synthetic images
D G Rossiter
PCA 3
Topic: Principal Components Analysis (PCA)
• The simplest form of factor analysis; a data reduction technique.
• Gives insight into the relation between a set of variables within a dataset
– This is completely data driven; different sets of observations from the samepopulation will give different relations
– So, a data mining approach
• Gives insight into the relation between a set of pixels in the multivariatespace spanned by the set of images
D G Rossiter
PCA 4
What does PCA do?
1. The vector space made up of the original observations (e.g., stack of pixelvalues in a set of images) is projected onto another vector space;
2. The new space has the same dimensionality as the original1, i.e., there are asmany variables in the new space as in the old;
3. In this space the new synthetic images, also called principal components areorthogonal to each other, i.e. completely uncorrelated;
4. The synthetic images are arranged in decreasing order of varianceexplained; and the total variance is unchanged;
5. The contribution of each original variable to each synthetic image is given;
6. Each observation can be re-projected into the new (PC) space, by its value ofthe synthetic images
1unless the original was rank-deficientD G Rossiter
PCA 5
Mathematics: the data matrix
PCA is a direct calculation from a matrix constructed from the multivariatedataset.
X: centred data matrix (i.e., difference from mean), where:
• rows are observations (e.g., pixels, observation locations, sampledindividuals)
• columns are variables (e.g., reflectance in a band, pixel values) measured ateach observation
• may scale by dividing values by the variable’s sample standard deviation
– standardized vs. unstandardized, see below
This gives the location of each observation in multivariate attribute space.
D G Rossiter
PCA 6
Mathematics: the covariance or correlation matrix
The data matrix X is used to build a matrix that shows the relation between dataitems:
• C = XTX : the covariance (unscaled) or correlation (scaled) matrix
• this is symmetric and positive (semi-)definite, so has all real roots
D G Rossiter
PCA 7
Why can the correlation matrix be misleading?
• The individual pairwise correlations do not take into account the degree towhich both of the variables may be correlated to others
• In the case where both are highly correlated to a several others this is apparentcorrelation, which may not reflect a real process
• Solution: compute partial correlations
– the bivariate correlation between the two residuals from linear regression ofeach variable on all the others, less the one with which to pair
– this accounts for the “lurking” effect of other variables, and shows whatcorrelation remains that can not be otherwise accounted for
– (see example below)
D G Rossiter
PCA 8
Mathematics: Eigen decomposition
The key insight is that the Eigen decomposition2 of C orders the syntheticvariables into descending amounts of variance, and ensures they are orthogonal(Hotelling 1933).
• Decompose a square, symmetric positive-definite matrix, e.g., the correlationmatrix C formed from a data matrix such that AC = λC
• Eigenvalues: a diagonal matrix λ; off-diagonals 0, i.e., no covariances, soorthogonal; Eigenvectors: the transformation matrix A
• The eigenvectors provide a coördinate transformation such that the matrixmultiplied by the diagonal eigenvalues matrix is the same as multiplication bya matrix made up of the eigenvectors
• Eigenvectors span an orthogonal vector space onto which we can project theoriginal data.
2 (German eigen ≈ English “own, belonging to oneself”)D G Rossiter
PCA 9
Computation
• |C− λI| = 0: a determinant to find the eigenvalues of the correlation matrix
– these are sometimes called the characteristic values– their relative magnitude is the proportion of the original covariance
explained
• Then the axes of the new space, the eigenvectors γj (one per dimension) arethe solutions to (C− λjI)γj = 0
• Obtain synthetic variables by projection: Y = PC where P is the row-wisematrix of eigenvectors (rotations).
D G Rossiter
PCA 10
Details
In practice the system is solved by the Singular Value Decomposition (SVD) of thedata matrix.
This is equivalent but more stable than directly extracting the eigenvectors of thecorrelation matrix.
Accessible explanations with worked examples:
• Davis, J. C. (2002). Statistics and data analysis in geology. New York: JohnWiley & Sons.
• Legendre, P., & Legendre, L. (1998). Numerical ecology. Oxford: ElsevierScience.
D G Rossiter
PCA 11
Standardized vs. unstandardized – 1
Standardized each variable (e.g., reflectance in a band) has its mean subtracted(so x.j = 0) and is divided by its sample standard deviation (so σ(x.j) = 1);
• All variables (e.g., bands) are equally important, no matter their absolutevalues or spreads;
• Gives equal weight to all variables;• This is usually what we want if variables are measured on different scales.
– e.g., multivariate measurements of soil constituents– e.g., co-registered images from different sensors
D G Rossiter
PCA 12
Standardized vs. unstandardized – 2
Unstandardized use the original variables, in their original scales ofmeasurement; generally the means are also subtracted to centre the variables
• Variables with wider spreads (often due to measurement scale) are moreimportant, since they contribute more to the original variance
• This preserves the importance of variables with more variance = moreinformation
• E.g., sensor with different radiometric resolutions (so wider range of numericvalues); higher resolution will have more weight
• Bands with more variability will have more weight – maybe we want this.
D G Rossiter
PCA 13
Potential difference between un/standardized!
Example: variables with three orders of magnitude difference in standarddeviation (in original measurement scale):
First, unstandardized PCs:
k.a k.b p.a caco3.a caco3.b sand.b206.1333028 172.1004094 53.2891699 25.5945396 24.9265695 20.3849822
p.b clay.b clay.a sand.a silt.b silt.a19.1486608 15.1824385 15.0226513 14.7909094 13.1101435 11.3642023
cec.b cec.a oc.a oc.b ph.b ph.a1.3351728 0.9138645 0.7702299 0.7265857 0.4260936 0.4153947
dens.b dens.a0.2721703 0.2313593
Importance of components:PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 250.7955 100.7590 47.69829 34.18739 27.84061 16.31755Proportion of Variance 0.8092 0.1306 0.02927 0.01504 0.00997 0.00343Cumulative Proportion 0.8092 0.9398 0.96909 0.98413 0.99410 0.99753
PC1 is numerically very large; K values have much larger standard deviations(higher absolute values of all measurements).
D G Rossiter
PCA 14
Second, standardized PCs
dens.a silt.a ph.a cec.a caco3.a p.a dens.b sand.b clay.b oc.b1 1 1 1 1 1 1 1 1 1
cec.b caco3.b p.b sand.a clay.a oc.a k.a silt.b ph.b k.b1 1 1 1 1 1 1 1 1 1
Importance of components:PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.4798 1.8233 1.4605 1.36289 1.16650 1.08224 0.85127Proportion of Variance 0.3075 0.1663 0.1067 0.09289 0.06805 0.05857 0.03624Cumulative Proportion 0.3075 0.4738 0.5805 0.67336 0.74141 0.79998 0.83622
Same standard deviation (by design), much less total variance explained by PC1.
D G Rossiter
PCA 15
Difference in variance represented by PCs
Standardized
Var
ianc
es
01
23
45
6Unstandardized
Var
ianc
es
020
000
4000
060
000
D G Rossiter
PCA 16
Difference in biplots
−0.3 −0.2 −0.1 0.0 0.1 0.2
−0.
3−
0.2
−0.
10.
00.
10.
2
PC1
PC
2
12
3
4
5
67
89
10
11 12131415
1617
18
19202122
23
24
25
27
2829
30
31
32
33
34
3536
37
38
39
40
41
42
43
44
454647
48
49
50
51
52
53
5455
56
57585960
61
6263
6465
66
6768
69
70
71
72
73
74
75
7677
7879
8081
82
83
84
85
8687
88
−10 −5 0 5
−10
−5
05
dens.a
sand.a
silt.a
clay.a
oc.a
ph.acec.a
caco3.a
p.a
k.a
dens.b
sand.b
silt.b
clay.b
oc.b
ph.b
cec.b
caco3.b
p.bk.b
−0.4 −0.2 0.0 0.2 0.4
−0.
4−
0.2
0.0
0.2
0.4
PC1
PC
2
12
3
4
5678 9
10111213
14
15
1617
181920
21222324
25
272829303132333435
36
37
38
39
40
41
42
4344
45464748
49
5051
5253
54
5556
57
5859
606162
63
6465
666768
69
707172
7374
75
76
777879
808182
8384
85
868788
−1500 −500 0 500 1000 1500
−15
00−
500
050
010
0015
00
dens.asand.asilt.aclay.aoc.aph.acec.acaco3.ap.a
k.a
dens.bsand.bsilt.bclay.boc.bph.bcec.bcaco3.bp.b
k.b
Standardized UnstandardizedMore or less equal loadings (lengths of arrows) K dominates loadings
(see below for explanation of biplots).D G Rossiter
PCA 17
Topic: Remote sensing examples
These will help visualize the transformation from original space into PC space.
D G Rossiter
PCA 18
Example: Time series of diurnal temperature differences
• -ý_Ï¿ Pei county in JiangSu province, PRC
• MODIS3 daily land-surface temperature product4
• Images are diurnal temperature differences (day – night) between two MODISproducts, units are ∆°C
• 30 Oct – 03 Nov 2000 (Julian days 304 ff.); Soil is drying after a heavy rain
• Objective: try to relate DTD to soil texture and organic matter
• Reference: Zhao, M.-S., Rossiter, D. G., Li, D.-C., Zhao, Y.-G., Liu, F., & Zhang,G.-L. (2014). Mapping soil organic matter in low-relief areas based on landsurface diurnal temperature difference and a vegetation index. EcologicalIndicators, 39, 120–133.5
3http://modis.gsfc.nasa.gov4https://modis.gsfc.nasa.gov/data/dataprod/mod11.php5http://doi.org/10.1016/j.ecolind.2013.12.015
D G Rossiter
http://modis.gsfc.nasa.govhttps://modis.gsfc.nasa.gov/data/dataprod/mod11.phphttp://doi.org/10.1016/j.ecolind.2013.12.015
PCA 19
Original images
3820000
3825000
3830000
3835000
3840000
3845000CK_DTD_304 CK_DTD_306
480000485000490000495000500000
CK_DTD_307 CK_DTD_308
9
10
11
12
13
14
15
16
17 ∆°C(day - night)
Low DTD on first day afterrain; increases overall thendecreases (closer day/nightair T?)
But: similar overall patternof high vs. low DTD(redundancy)
D G Rossiter
PCA 20
Reading in the dataset to R
## read images as a raster stackrequire(raster)(list
PCA 21
Simple and partial correlations
> with(stackDTD.df, cor(CK_DTD_307, CK_DTD_308)) # simple correlation of two days[1] 0.9029475> # compute residuals from linear model on other days> r307 r308 cor(r307, r308) # partial correlation = simple correlation of residuals[1] 0.5054682> plot(CK_DTD_308 ~ CK_DTD_307, data=stackDTD.df); plot(r308 ~ r307)
●●
●
●●
●
●
●●
●
●
●
●●
●●
●●
●●
●●
●●●●
●●
●
●
●
●●
●
●
●●
●
●●
●●
●●
●●
●
●
●●
●
●●●●
●
●
●
●●
●
●
●
●●●
● ●
●●●●
●
●
●
●●
●
●
● ●●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●●●
●
●
●
●●
●●
●●●●
●
●
●●
●
●
●●●
●
●●
●
●●
●●
●
●
●●
●
●
●
●●●●
●
●
●
●●●●
●●
●
●
●
●●
●●●
●●
●
●●
●
●● ●●
●●
●
●
●
●●
●●
●●
●
●●●
●
●
●
●
●●
●
●●●
●●
●●
●
●●●
●●●
●●●●
●●
●●
●●●
●
●●
●●
●●
●
●
●
●
●
●●
●●●●
●●
●
●●
●●●
●●
●
●
●●
●●
●
●
●
●●
●
●
●●●●
●●●
●
●●●●
●
●●
●
●●●
●●
●
●
● ●
●
●
●●
●
●●●
●
●
●
●
●
●
●●
●●●●●
●●
●
●
●
● ●
●●
●
●
●●
●
●●
●●
●
●
●
●
●●
●
●●●
●●
●●●
●●
●
●
●●
●●
●●●
●●
●
●●
●●
●●●●●●
●●●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●●
●●●●
●● ●●●● ●
●●
●●●
●
●●
●● ●
●●
●
●
●
●
●
●
●
●●●●●●● ●
●●●
●●
●
●
●●●
●
●
●
●●
●●
●●●
●●●●●●●
●
●●
●●●
●
●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●●● ●
●
●
●●●●
●●●●●
●
●
●
●●
●●●
●●
●●●●
●●●●
●●●
●●●
●●●
●
●
●
●
●●●●
●●●●●
●
●
●
●●●
●●●
●
●●●
●
●●
●●●●
●
●
●●
●●
●●
●
●
●●
●●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●●●●
●
●●
●●
●●
●●
●
●
●
●
●●
●
●●●
●
●
●●
●
●●
●●
●●●●●
●●●●
●
●
● ●
●
●● ●
●●
●●
●●
●
●●
●
●
●●●●●●
●
●●
●●
●
●●●● ●
●●
●
●
●●
●
●
●●
●
●●●
●●●
●●●
●
●
●
●●●
●●
●●
●
●
●
●
●
●
●●●
●●
●●●●
●●
●
●●
●
●
●
●●●●
●
●
●
●●●
●
●
●●
●●●●
●●●
●●
●●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●●●●
●●●●
●
●
●
●●
●
●●●●●
● ● ●
●
●●
●
●
●●
●●
●
●●●●●●
●
12 13 14 15 16
1011
1213
1415
Simple
CK_DTD_307
CK
_DT
D_3
08
●
●
●
●●
●
●
●
●● ●
●
●
●
●●●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
● ●
●
●●
●●
●
●
● ●
●
●
●
●
●●
● ●●
●●
●
●
●
●
●●
●
●
● ●
● ●
●
●● ●
●●
●●
●●
●
●
●●●
●●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●●
●● ●
●●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●●
●
●
●●
●
●●●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●●
●●●
●
●●
●
●
●
●●
●
●
●
● ●
●
●●
●●
● ●
●●
●
● ●●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●●● ●●
●
●
● ●●
●
●●
●●
●●
●●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●●
●
●●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●● ●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●●●
●
●●●
●
●
●
●
● ●
●
●●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●●
●
●●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●●
●●
●
●
●●
●
●
●
●
● ●●●
●
●
●●●
● ●
●●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●●
●●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●
●
●●
●●●
●
● ●
●●
●
●●
●●
●●
●
●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
● ●
●
●
●
●●
●●
●●
●
●●
● ●
●●
● ●
● ●
●
●
●
●
●
●●
●
●
●
●
●●
●●●
●●
●
●●
●
●●●
●●
●
●
●●●
● ●●
●
● ●
●
●
● ● ●
●
●
●●●
●●
●
● ●●
●
●
●●
●
●
● ●
●
●
●
●
●
●
● ●
●●
●
●
●●
●●●
●
● ●● ●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●●
●●●
●
●●
●
●
●●
● ●
●
●
●
●●
●
●●
●
●●
●
●●●● ●
●●
●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●● ●
●●
●●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●
●
−0.5 0.0 0.5 1.0−
1.5
−0.
50.
51.
01.
52.
0
Partial
r307
r308
Partial correlation accounts for correlations of each with the other days
D G Rossiter
PCA 22
PCA processing in R
In this example we use unstandardized PCA because the four DTD images are onthe same scale from the same sensor and represent the same phenomenon.
## PCA -- unstandardized, use original DTD, ignore coordinatespca
PCA 23
Structure of prcomp object
> str(pca)List of 5$ sdev : num [1:4] 1.823 0.404 0.323 0.208$ rotation: num [1:4, 1:4] -0.459 -0.59 -0.466 -0.474 -0.675 .....- attr(*, "dimnames")=List of 2.. ..$ : chr [1:4] "CK_DTD_304" "CK_DTD_306" "CK_DTD_307" "CK_DTD_308".. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"$ center : Named num [1:4] 10.6 13.4 12.9 11.8..- attr(*, "names")= chr [1:4] "CK_DTD_304" "CK_DTD_306" "CK_DTD_307" "CK_DTD_308"$ scale : logi FALSE$ x : num [1:784, 1:4] -6.17 -5.71 -4.64 -4.83 -4.43 .....- attr(*, "dimnames")=List of 2.. ..$ : NULL.. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"- attr(*, "class")= chr "prcomp"
rotation are the eigenvectors
x are the PC scores (location of each pixel in the PC space)
center are the image means (subtracted from all values)
D G Rossiter
PCA 24
PCA results – Importance of components
> summary(pca)Importance of components:
PC1 PC2 PC3 PC4Standard deviation 1.8232 0.40406 0.3230 0.20810Proportion of Variance 0.9145 0.04492 0.0287 0.01191Cumulative Proportion 0.9145 0.95938 0.9881 1.00000
The four DTD images are highly-correlated, 92% of the information is incommon
I.e., over the four days the same areas tend to have narrow and wide DTD ranges
D G Rossiter
PCA 25
PCA results – rotations
> pc$rotationPC1 PC2 PC3 PC4
DTD304 -0.4447 0.5893 0.5870 0.3322DTD306 -0.6084 0.3379 -0.5200 -0.4952DTD307 -0.4717 -0.3857 -0.3677 0.7025DTD308 -0.4578 -0.6243 0.4998 -0.3884
PC1 “intensity” of the phenomenon over all days – all signs the same (arbitrary),similar magnitudes
PC2 contrast between first two and second two days
PC3 contrast between middle two and end two days
PC4 Third and first days, contrasted with second and fourth days
Note PCs are orthogonal (independent)
D G Rossiter
PCA 26
Screeplot
pca
Var
ianc
es
0.0
0.5
1.0
1.5
2.0
2.5
3.0
D G Rossiter
PCA 27
Biplots
These show scores of the observations as synthetic variables in the 2-PC space(observation numbers)Loadings of each variable in the 2-PC space shown by the arrows (longer = higherloading).
−0.10 −0.05 0.00 0.05 0.10
−0.
10−
0.05
0.00
0.05
0.10
PC1
PC
2
1
2
3
45
67
8
9
10
11
12
13
14
151617
1819
20
21
2223
24
25
26
27
28
29
30
31
32
33
34
35
36
37
383940
4142
43
44
45
4647
48
49
50
51
52
53
54
55
56
57
58
5960
61
62
63
6465
66
6768
69
70
71
72
73
7475
7677
7879
80
81
82 83
84
85
8687
88
89
90
91
92
93
94
959697
98
99
100
101
102
103104
105
106
107
108
109110
111
112
113114
115
116 117
118
119
120121
122
123
124125
126127
128
129
130
131
132
133
134
135
136137
138
139
140
141 142
143144145
146
147
148
149
150
151
152
153
154155
156
157158
159
160
161
162
163164
165
166
167
168
169
170
171
172
173 174
175
176177
178
179
180181
182183
184
185
186
187
188
189
190
191
192
193
194
195
196
197198199
200
201202
203
204
205
206
207208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228229
230231
232233
234
235
236
237
238239
240
241
242243
244
245
246
247
248
249
250
251
252
253
254
255256
257
258
259
260
261
262
263
264265
266
267
268
269
270271
272
273
274275
276
277
278279
280
281
282
283
284285
286
287288
289
290
291
292
293
294295
296
297
298
299
300
301
302
303304
305306
307308
309
310
311
312
313
314
315
316
317
318
319
320321
322
323
324
325
326327
328
329
330331
332333
334335
336
337
338
339
340
341
342
343344345
346347
348
349
350
351352
353354355
356357
358
359
360
361362363
364
365
366
367
368
369
370
371372
373
374375
376377
378
379
380381382383
384
385
386387
388
389390
391
392
393
394
395396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411412
413414
415
416417418
419
420421
422
423
424
425
426
427
428429
430
431
432
433434
435
436
437
438
439
440441
442
443444445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468469
470
471
472473474
475
476
477478
479
480
481
482
483
484
485486
487
488
489
490
491
492
493
494
495
496
497
498
499500
501
502
503504
505506
507
508
509
510
511
512
513
514515
516
517
518
519
520
521
522
523524
525
526
527
528
529
530
531
532533
534
535
536
537
538539
540541
542
543
544
545546
547
548
549
550
551
552
553
554
555
556
557
558559
560561562
563
564
565
566
567
568
569
570571
572
573
574575
576577578
579580
581582
583584
585
586
587588
589
590
591
592
593594
595
596597
598
599 600601
602603
604
605
606
607608
609610611
612613
614615
616
617
618
619
620
621
622
623624625
626
627
628
629
630
631632
633
634
635
636637
638639640
641
642643644
645
646647
648649650
651652
653
654655
656
657658
659 660
661
662
663
664
665666
667
668
669
670
671672
673
674
675
676
677
678
679680
681682
683
684685
686
687
688
689
690
691
692693694695
696697
698
699
700
701
702703
704705
706
707
708
709710
711
712713
714
715
716
717
718719720721
722
723
724
725726
727
728
729
730731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747748
749750
751
752
753
754755756
757758
759
760761
762
763
764
765
766
767
768769
770
771
772
773
774775776
777778
779780
781
782
783
784
−30 −20 −10 0 10 20 30
−30
−20
−10
010
2030
CK_DTD_304
CK_DTD_306
CK_DTD_307
CK_DTD_308
−0.10 −0.05 0.00 0.05 0.10
−0.
10−
0.05
0.00
0.05
0.10
PC2
PC
3
1
2
3
4
5
6
7
8
9 10
11
12
13
14
1516
17
18
19
2021
2223
24
25
26
27
28
29
30
31
32
33
34
35
3637
3839
40
41
42
4344
45
46
47
48
49
50
51
52
53
54
55
56 57
5859
60
61
62
63
64
65
66
67
68
6970
71
72 7374
75
76
77
78
79
8081
8283
84
85
86
87
88
89
90
91
9293
94
95
96
97
98
99 100101
102
103104 105
106
107
108
109
110 111
112
113
114
115
116
117
118119
120
121
122
123124
125126
127 128
129 130
131
132
133
134135
136
137138
139
140
141
142
143
144145
146
147148
149
150
151
152
153
154
155
156
157
158
159160
161
162
163
164 165
166
167
168
169170
171
172
173
174
175
176
177
178
179 180181
182183
184
185
186
187
188
189 190
191
192
193
194
195
196
197
198
199
200201
202
203
204
205
206207208
209
210
211
212213
214
215
216
217
218
219220
221
222
223
224
225
226
227
228
229
230231
232
233
234
235
236
237
238
239
240
241
242
243
244245246
247
248249
250 251252
253
254
255
256
257
258
259
260
261
262
263
264
265
266267 268
269
270
271
272273
274
275
276
277278
279
280
281282 283
284
285
286
287
288
289290
291
292
293294
295
296 297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320321
322
323
324
325
326
327328
329
330331
332
333 334
335
336
337
338
339
340341
342
343
344
345
346
347
348
349
350
351352
353
354
355
356
357
358359360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377378379 380
381
382
383384
385
386
387
388
389
390391392393
394
395
396
397398
399
400
401
402
403
404
405
406
407
408
409
410
411412
413414
415
416
417
418419420421
422 423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438439
440
441
442
443
444
445
446447
448
449
450
451452
453
454
455
456457
458
459
460 461
462
463464
465
466467
468
469
470
471
472
473
474
475476
477
478479
480481
482
483
484485
486
487
488
489
490491
492
493
494
495
496
497498 499
500
501
502
503504
505
506
507
508
509
510
511512 513
514
515
516517
518519
520
521
522
523
524
525526 527
528
529
530
531532
533 534
535
536
537538
539
540
541
542
543544 545
546
547
548
549
550551
552553
554
555
556
557 558
559560
561
562
563564
565
566
567
568
569
570
571 572
573
574575
576
577
578
579580
581
582
583584
585
586
587588589
590
591
592
593
594
595596
597 598
599
600
601
602603
604605606
607608 609
610
611612
613
614615 616617
618619
620
621
622
623624
625
626
627
628
629
630631632
633
634
635
636
637
638639640
641
642643
644
645
646
647648
649
650
651
652
653
654
655656
657
658
659660
661
662663
664
665
666
667
668
669
670
671672
673
674
675676
677678
679
680681
682
683684
685
686
687
688689
690691
692
693
694
695
696697698
699
700
701702
703
704705
706
707
708
709710711
712
713
714 715
716717
718
719
720
721722723
724
725726
727
728
729730731
732
733
734
735
736737
738739
740
741742
743
744
745
746
747
748
749
750 751752753
754
755
756
757758
759
760761
762
763764
765766
767
768
769
770
771
772
773
774
775776
777
778
779
780781
782
783
784
−5 0 5
−5
05
CK_DTD_304
CK_DTD_306
CK_DTD_307
CK_DTD_308
−0.10 −0.05 0.00 0.05 0.10
−0.
10−
0.05
0.00
0.05
0.10
PC3
PC
4
1
2
3
4
5 6
7
8
9
10
11
121314
15
1617 18
19
202122
23
24
25
26
27
28
29
30
31
32
33
34
35
363738
3940
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
6263
64
65
66
67
68
69
70 71
72
7374
7576
77
78
79 80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101102
103104
105
106
107
108
109
110
111
112
113
114 115116
117
118
119
120121
122
123
124
125
126
127
128
129130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147148
149
150151
152
153
154
155
156
157158
159
160
161162
163
164
165
166
167
168
169
170
171
172173
174175
176177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195196
197
198
199
200
201
202
203
204
205
206
207
208209
210
211
212213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230231
232
233
234
235
236
237
238
239
240
241
242
243
244245
246
247
248
249
250
251
252253
254
255
256
257
258
259
260
261
262
263
264
265
266
267268
269
270
271
272273
274275
276
277278
279
280
281282
283
284 285286
287
288
289290
291 292
293
294
295296
297
298
299
300
301
302
303
304
305
306
307
308
309
310311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330331
332333
334
335
336
337
338
339
340
341
342
343
344
345
346347
348349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371372
373
374375
376
377
378
379
380381
382
383
384
385
386387
388
389390391
392
393
394
395
396
397398
399
400
401
402
403 404
405
406407
408
409
410 411412
413414
415
416
417 418
419420421
422
423 424425
426427
428
429
430431
432
433
434
435
436
437
438439
440
441442
443444
445
446
447
448
449
450451
452453
454
455
456
457
458
459
460
461
462
463
464465
466
467 468469
470
471472
473
474
475
476
477
478
479
480
481
482483484
485
486487
488
489
490
491
492
493
494
495496
497
498
499
500501
502
503
504
505
506
507 508
509510
511
512513
514
515516
517
518
519
520
521
522
523
524525
526
527528
529
530
531
532
533
534
535536
537538
539
540
541
542
543
544
545
546
547
548
549
550
551552
553
554555
556
557558559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581582
583
584
585
586
587
588
589590
591 592
593
594
595
596
597
598
599
600
601
602
603
604
605
606607608
609
610611
612
613
614
615
616
617
618
619
620
621
622
623
624625
626627
628
629630
631
632
633
634
635
636
637638
639
640
641
642
643
644
645
646
647
648
649
650651
652653
654
655
656
657 658 659
660
661662
663
664665
666
667
668
669
670
671
672
673674
675
676
677678
679
680
681
682
683
684
685
686687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709710
711712
713
714
715
716
717
718719720
721
722
723724
725
726
727
728
729730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747 748
749
750
751
752
753754
755
756
757
758
759
760
761762
763
764765
766
767
768
769
770
771
772
773
774
775
776
777778
779
780
781
782
783
784
−4 −2 0 2 4
−4
−2
02
4
CK_DTD_304
CK_DTD_306
CK_DTD_307
CK_DTD_308
First PC is overall intensity across all 4 dates; other PCs show contrasts betweendates
D G Rossiter
PCA 28
PCs (synthetic bands) – same stretch
3820000
3825000
3830000
3835000
3840000
3845000PC1 PC2
480000485000490000495000500000
PC3 PC4
−6
−4
−2
0
2
Note decreasinginformation content asPC number increases
There is still someinformation in PC2, 3, 4that contrasts with theoverall pattern
D G Rossiter
PCA 29
PCs (synthetic bands) – individual stretch
480000 485000 490000 495000 500000
3820
000
3830
000
3840
000
PC1
x
y
480000 485000 490000 495000 500000
3820
000
3830
000
3840
000
PC2
xy
480000 485000 490000 495000 500000
3820
000
3830
000
3840
000
PC3
x
y
480000 485000 490000 495000 500000
3820
000
3830
000
3840
000
PC4
x
yThis allows to visualizecontrasts within a PC
D G Rossiter
PCA 30
Another example of synthetic images
Cordillera de la Costa, W branch Río Aragua, Aragua state, Venezuela
D G Rossiter
PCA 31
Conclusion
• PCA is a powerful data reduction technique
• It is linear – so if variables are highly skewed or multi-model, apply atransformation
• It can also reveal the inter-relation between variables (e.g., reflectances invarious spectral bands)
• It is completely data-driven and is scene-specific
• An important choice is between standardized and unstandardized
D G Rossiter
PCA 32
Topic: PCA of long time series of imagery to reveal seasonality
Groundbreaking paper, 200+ citations:
Eastman, J., & Fulk, M. (1993). Long Sequence Time-Series Evaluation UsingStandardized Principal Components (reprinted from PhotogrammetricEngineering and Remote-Sensing , Vol 59, Pg 991–996, 1993).Photogrammetric Engineering and Remote Sensing, 59(8), 1307–1312.
• Sensor was AVHRR
• Images were monthly maximum NDVI, 1986-1988 (three full years)
D G Rossiter
PCA 33
Synthetic images
Green = “+” score, Red= “-”
Sign is arbitrary, herechosen to show highNDVI in green.
D G Rossiter
PCA 34
Loadings
PC1 All bands contribute ≈equally:(overall intensity)
PC2 shows annual movement ofIntertropical Convergence Zone
PC3/4 show deviations from PC2seasonality
D G Rossiter
PCA 35
Interpretation
Synthetic bands images summarizing all original images, according to the PCs
Loadings relation between original images (dates, x-axis) and contribution to thePC (y-axis).
• Note decreased absolute loadings at higher PCs, this is because theyrepresent less of the total variability of the 36 images
• PC1: ≈ equal contribution of all dates (overall vegetation vigour averaged overthe 3 years)
• PC2: + correlation in N. hemisphere summer, - in winter (seasonality) –Intertropical Convergence Zone
• PC3/4: deviations from PC2 seasonality – note time lag of greening/senescence
• PC5: detecting sensor drift
D G Rossiter
PCA 36
Topic: PCA for a multi- to hyper-variate dataset
A small dataset of 30 soil properties observed at 87 locations in the Lake Valenciabasin, Venezuela
Also recorded coördinates (not used here) and geomorphic region
Main interest is to see the inter-relation between variables (grouping, contrasts)
• Could we only measure surface soil properties w.o. loss of informtion?
• Could we omit some (expensive?) measurements w.o. loss of informtion?
> dim(dlv.r)[1] 87 33> names(dlv.r)[1] "utm.n" "utm.e" "r.geo" "dens.a" "vcs.a" "cs.a" "ms.a" "fs.a"[9] "vfs.a" "sand.a" "silt.a" "clay.a" "oc.a" "ph.a" "cec.a" "caco3.a"
[17] "p.a" "k.a" "dens.b" "vcs.b" "cs.b" "ms.b" "fs.b" "vfs.b"[25] "sand.b" "silt.b" "clay.b" "oc.b" "ph.b" "cec.b" "caco3.b" "p.b"[33] "k.b"
D G Rossiter
PCA 37
Geomorphic regions explain some variation
> ## boxplot of bulk density of the subsoil, by geomorphic region> require(ggplot2)> qplot(x=as.factor(r.geo), y=dens.b, data=dlv.r, geom=c("boxplot", "point"), shape=I(3))
0.50
0.75
1.00
1.25
1.50
2 3 6 7as.factor(r.geo)
dens
.b
D G Rossiter
PCA 38
Computing standardized PCs
> pc summary(pc)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8Standard deviation 3.1651 1.9973 1.8312 1.48272 1.33402 1.14400 1.06427 1.01107Proportion of Variance 0.3339 0.1330 0.1118 0.07328 0.05932 0.04362 0.03776 0.03408Cumulative Proportion 0.3339 0.4669 0.5787 0.65195 0.71127 0.75490 0.79265 0.82673
PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16Standard deviation 0.87600 0.79861 0.76065 0.71565 0.68304 0.60664 0.5797 0.52760Proportion of Variance 0.02558 0.02126 0.01929 0.01707 0.01555 0.01227 0.0112 0.00928Cumulative Proportion 0.85231 0.87357 0.89285 0.90992 0.92548 0.93774 0.9489 0.95822
PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24Standard deviation 0.50731 0.43862 0.42948 0.38868 0.33886 0.3146 0.28247 0.25723Proportion of Variance 0.00858 0.00641 0.00615 0.00504 0.00383 0.0033 0.00266 0.00221Cumulative Proportion 0.96680 0.97322 0.97936 0.98440 0.98823 0.9915 0.99418 0.99639
PC25 PC26 PC27 PC28 PC29 PC30Standard deviation 0.20782 0.1647 0.14647 0.11458 0.04917 0.03113Proportion of Variance 0.00144 0.0009 0.00072 0.00044 0.00008 0.00003Cumulative Proportion 0.99783 0.9987 0.99945 0.99989 0.99997 1.00000
12 PCs (out of 30) explained 90% of the standardized variance of 30standardized variables
D G Rossiter
PCA 39
Screeplot
> screeplot(pc, npcs=16)
pcV
aria
nces
02
46
810
D G Rossiter
PCA 40
Synthetic variablesThe standardized variables are multiplied by these factors to make up the synthetic variables (PCs)
> print(pc$rotation[,1:4])PC1 PC2 PC3 PC4
dens.a 0.18958475 0.3139849696 -0.096232326 0.088356551vcs.a 0.07924278 0.0403831655 0.230370094 -0.048419791cs.a 0.16208849 -0.1202754319 0.363705662 -0.078683279ms.a 0.25194385 -0.1284850503 0.210080111 0.008324169fs.a 0.28293845 -0.1363107368 -0.014925221 0.054220372vfs.a 0.27325614 -0.0910810904 -0.064745942 0.041049247sand.a 0.22462929 -0.2282354263 -0.122083768 0.004537925silt.a -0.04055455 -0.0339314591 0.094532854 -0.128362638clay.a -0.19025200 0.2451180696 0.037298357 0.096288845oc.a -0.22319928 -0.0981217660 0.117350816 -0.103240123ph.a -0.02002915 -0.1607898185 -0.341449972 -0.032629362cec.a -0.11061555 -0.3136921769 0.084300743 0.073849593caco3.a -0.16458980 -0.3322511108 -0.087971037 -0.214771086p.a -0.04422812 -0.1751203044 0.041317218 0.493486269k.a -0.14509102 -0.1129631442 0.068684059 0.389384538dens.b 0.21263155 0.2860476925 -0.081790307 0.119293244vcs.b 0.09246395 0.0124449952 0.416921421 -0.096156354cs.b 0.14040571 -0.0003804787 0.390813767 -0.054448951ms.b 0.23184462 -0.0820839309 0.213227766 -0.002106178fs.b 0.26409667 -0.1415713026 -0.048437418 0.072954308vfs.b 0.27090087 -0.0808317887 -0.116662538 0.078520254sand.b 0.25024710 -0.1834095557 -0.092293422 0.045293178silt.b -0.16691593 -0.0321539033 0.149702956 -0.016192942
D G Rossiter
PCA 41
clay.b -0.19685387 0.2769034074 -0.000293272 -0.046612720oc.b -0.18170694 -0.1227445391 0.191119295 0.086771096ph.b 0.07542964 -0.1271823250 -0.326496078 -0.036204018cec.b -0.15933624 -0.2679633668 -0.019482653 -0.050358586caco3.b -0.14353695 -0.3045767375 -0.026160011 -0.290825175p.b -0.07072839 -0.0569701860 0.100738984 0.438684722k.b -0.15217588 -0.1233052242 -0.006211632 0.404872772
D G Rossiter
PCA 42
Biplot
> plot.it
PCA 43
dens
.a
vcs.a
cs.a ms.afs.avfs.a
sand.a
silt.a
clay.a
oc.a
ph.a
cec.
a
caco
3.a
p.a
k.a
dens
.b
vcs.bcs.b
ms.bfs.b
vfs.b
sand.b
silt.b
clay.b
oc.b ph.b
cec.b
caco
3.b
p.b
k.b
−5.0
−2.5
0.0
2.5
−5 0 5PC1 (33.4% explained var.)
PC
2 (1
3.3%
exp
lain
ed v
ar.)
2 3 6 7
– normal ellipse of each group in PC space; 1 s.d. by default– distance between the points ≈ Mahalanobis distance; inner product between thevariables ≈ correlation– graph produced with ggbiplot package
D G Rossiter
PCA 44
Interpretation of PC1 and 2
• strong correlations between top/sub for most properties
• sand fractions highly correlated
• clay opposite sand
• bulk density opposite CEC/OC/silt/CaCO3
• PC1 +dense, sandy, low silt, low OC
• PC2 +low fertility, base saturation, carbonates
• But these are not exactly aligned with the PCs
• Geomorphic regions cluster points in PC1/2 space
• especially 2 (piedmont) and 7 (recently-emerged lake sediments)D G Rossiter
PCA 45
dens.a
vcs.acs.ams.a
fs.avfs.asand.a
silt.a
clay
.a
oc.a
ph.a
cec.a
caco
3.a
p.a
k.a
dens.b
vcs.bcs.b
ms.b
fs.bvfs.bsand.bsilt.b
clay
.b
oc.b
ph.b
cec.
bca
co3.
b
p.bk.b
−2.5
0.0
2.5
5.0
−2.5 0.0 2.5 5.0PC3 (11.2% explained var.)
PC
4 (7
.3%
exp
lain
ed v
ar.)
2 3 6 7
– PC3 contrasts top and subsoil sand– PC4 contrasts K, P with CaCO3
D G Rossiter
PCA 46
Topic: Factor analysis: beyond PCA
Limitations of PCA:
• PCA is a data reduction technique
• It does not impose any structure on the PCs or synthetic variables, they comeout directly from the eigen decomposition of the correlation or covariancematrix
• So, the resulting PCs or synthetic variables may not be easily interpretable
– the loadings may not line up with any of the axes – see Lake Valenciaexample
– Clear associations of variables but none lined up with a PC
D G Rossiter
PCA 47
Latent variable analysis – concept
• “Factor analysis” as used in social sciences
• Hypothesis: the set of observed variables is a measureable expression ofsome (smaller) number of latent variables
– These can not be directly measured,– They influence a number of the observed variables.– Example: “math ability”, “ability to think abstractly” are assumed to exist
(based on external evidence or theory) and measured with various tests(observed variables).
• The set of observed variables can be analyzed with PCA and then (1) reduced;(2) rotated into interpretable components
D G Rossiter
PCA 48
Latent variable analysis – computation
• from k observed variables hypothesize p < k latent variables (“factors”)
• decompose k× k variance-covariance matrix Σ of the original variables into:1. a p × k loadings matrix Λ (the k columns are the original variables, the p
rows are the latent variables);2. and a k× k diagonal matrix of unexplained variances per original variable
(its uniqueness) Ψ , so Σ = Λ′Λ+ Ψ• In PCA p = k, there is no Ψ , and all variance is explained by the synthetic
variables; there is only one way to do this.
• In factor analysis, the loadings matrix Λ is not unique– it can be multiplied by any k× k orthogonal matrix, known as rotations.– The factor analysis algorithm finds a rotation to satisfy user-specified
conditions; e.g., varimax.
D G Rossiter
PCA 49
Latent variables – example
Lake Valencia, topsoils only; 15 observed variablesHypothesize two latent variables (1) particle-size distribution (texture), (2)reaction (pH, carbonates)
> (fa
PCA 50
Latent variables – interpretation
• uniqueness: Ψ , noise left over after factors are fitted, i.e., variable isunexplained
– here P, K, pH, CaCO3 most unique → more factors are needed
• variance explained as in PCA. Note in PCA k = p and all variance is eventuallyexplained
• loadings as in PCA
– Factor 1 is associated with all sand fractions (coarse-textured soils) and bulkdensity, opposed to clay concentration and organic C
– Factor 2 is associated with silt opposed to clay– So both latent variables are most associated with particle-size distribution;
not according to our original hypothesis – this should be modified
D G Rossiter
PCA 51
Latent variables – plot
Factor 1 (most variance explained) was rotated to show the maximum contrast(varimax rotation)
D G Rossiter
PCA 52
Latent variables – DTD images example
Hypothesis: one factor, “intensity”
Uniquenesses:CK_DTD_304 CK_DTD_306 CK_DTD_307 CK_DTD_308
0.181 0.047 0.073 0.171
Loadings:Factor1
CK_DTD_304 0.905CK_DTD_306 0.976CK_DTD_307 0.963CK_DTD_308 0.910
Factor1SS loadings 3.527Proportion Var 0.882
Strong correlation with each image, little uniqueness, 88% of variance explained.
Compare with PCA results.
D G Rossiter
PCA 53
Single “best” image under the hypothesis of one processCompare with PCA synthetic band 1.
D G Rossiter
PCA 54
Conclusion: PCA vs. latent variable analysis
PCA pure data reduction, maybe can interpret
Latent variable analysis aim is to produce interpretable variables representingthe latent process
D G Rossiter
1: Factor Analysis1.1. Uses of factor analysis in remote sensing
2: Principal Components Analysis2.1. What does PCA do?2.2. Mathematics: the data matrix2.3. Mathematics: the covariance or correlation matrix2.4. Mathematics: Eigen decomposition2.5. Computation2.6. Standardized vs. unstandardized -- 1
3: Remote sensing examples3.1. Example: Time series of diurnal temperature differences3.2. Original images3.3. Reading in the dataset to R3.4. Simple and partial correlations3.5. PCA processing in R3.6. Structure of prcomp object3.7. PCA results -- Importance of components3.8. PCA results -- rotations3.9. Screeplot3.10. Biplots3.11. PCs (synthetic bands) -- same stretch3.12. PCs (synthetic bands) -- individual stretch3.13. Another example of synthetic images
4: PCA of Long time series to reveal seasonality4.1. Synthetic images4.2. Loadings4.3. Interpretation
5: PCA for a multi- to hyper-variate dataset5.1. Geomorphic regions explain some variation5.2. Computing standardized PCs5.3. Screeplot5.4. Synthetic variables5.5. Biplot5.6. Interpretation of PC1 and 2
6: Factor analysis6.1. Latent variable analysis -- concept6.2. Latent variable analysis -- computation6.3. Latent variables -- example6.4. Latent variables -- interpretation6.5. Latent variables -- plot6.6. Latent variables -- DTD images example6.7. Conclusion: PCA vs. latent variable analysis