29
Technical note: Optimal partitioning of soil transects with R D G Rossiter October 16, 2013 Contents 1 Example transect 1 2 Split moving-window 3 2.1 Functions for the SMW method .................. 4 2.2 Options for the SMW method ................... 4 2.2.1 Standardized principal components ........... 5 2.2.2 Original variables ...................... 5 2.2.3 Window width ........................ 6 2.2.4 Window parity ........................ 7 2.2.5 Mullion ............................. 7 2.2.6 What Mahanalobis distance to report? ......... 8 2.3 Analyzing the example transect ................. 8 2.3.1 Using PCs ........................... 8 2.3.2 Using original variables .................. 11 2.4 Visualizing the boundaries ..................... 12 3 When SMW fails to find a solution 14 4 Maximum Level Variance 16 References 17 A Example transect 18 B Split moving-window functions 19 B.1 PCA ................................... 19 Version 2.1 Copyright © 2004, 2009, 2013 D G Rossiter. All rights re- served. Reproduction and dissemination of the work as a whole (not parts) freely permitted if this original copyright notice is included. Sale or placement on a web site where payment must be made to access this document is strictly prohibited. To adapt or translate please contact the author (http://www.itc.nl/personal/rossiter).

Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

Technical noteOptimal partitioning of soil transects with R

D G Rossiter

October 16 2013

Contents

1 Example transect 1

2 Split moving-window 321 Functions for the SMW method 422 Options for the SMW method 4

221 Standardized principal components 5222 Original variables 5223 Window width 6224 Window parity 7225 Mullion 7226 What Mahanalobis distance to report 8

23 Analyzing the example transect 8231 Using PCs 8232 Using original variables 11

24 Visualizing the boundaries 12

3 When SMW fails to find a solution 14

4 Maximum Level Variance 16

References 17

A Example transect 18

B Split moving-window functions 19B1 PCA 19

Version 21 Copyright copy 2004 2009 2013 D G Rossiter All rights re-served Reproduction and dissemination of the work as a whole (notparts) freely permitted if this original copyright notice is included Saleor placement on a web site where payment must be made to access thisdocument is strictly prohibited To adapt or translate please contact theauthor (httpwwwitcnlpersonalrossiter)

B2 Find boundaries 19B3 Plot the transect 21

C Insight into Mahalanobis distance 22

Index of R concepts 26

ii

This note describes procedures to identify soil boundaries along a tran-sect where soil samples have been taken at regular intervals It is basedon the work of Webster [3ndash6] and used as an example in the text of Davis[1 pp 234-243]

Two partitioning methods are described by Webster [4] Split moving-window (SMW) (sect2) and Maximum Level Variance (MLV) (sect4) Only thefirst is developed in this technical note

Note The code in this document was tested with R version 301 (2013-05-16) and packages from that version or later running on Mac OS X1075 The text and graphical output you see here was written as aNoWeb file including both R code and regular LATEX source and then runthrough the excellent knitr package Version 141 [7] on R to generatea LATEX document that includes formatted R code input and the resultsof running the code both text results and graphs Then the LATEX docu-ment was compiled into the PDF version you are now reading If you runthe R code from this document your output may be slightly different ondifferent versions and on different platforms

1 Example transect

We illustrate optimal partitioning with a transect of 42 stations spacedat 25 m intervals in an undisturbed forest near Juruena Mato Grosso inthe southern Amazon collected by Steven Jirka of Cornell University aspart of the LBA1 project Both field and laboratory measurements weremade for illustrative purposes we use only the sand and clay contentmeasured in g kg-1 of three layers These are organized as an R dataframe with the rows being the stations and the columns the variablesThe sample dataset is provided as a comma-separated value file trcsvand is reproduced in Appendix sectA

Task 1 Read the transect from the CSV file into R and display itsstructure bullgt transect lt- readcsv(trcsv)gt str(transect)

dataframe 42 obs of 6 variables$ sandA num 438 449 560 549 428 $ clayA num 283 316 216 261 472 $ sandB num 349 349 449 516 404 $ clayB num 372 405 305 272 429 $ sandC num 483 616 616 516 271 $ clayC num 239 205 172 239 463

Task 2 Plot the six variables along the transect bullgt plot(transect$sandA type=l ylim=c(0900)+ main=Particle-size fraction along transect+ xlab=station ylab=g kg-1+ sub=black A blue B green C solid sand dashed clay)gt lines(transect$sandB type=l col=blue)

1 Large-scale Biosphere-Atmosphere Experiment in Amazonia see httpearthobservatorynasagovStudyLBA

1

gt lines(transect$sandC type=l col=green)gt lines(transect$clayA type=l col=1 lty=2)gt lines(transect$clayB type=l col=blue lty=2)gt lines(transect$clayC type=l col=green lty=2)

0 10 20 30 40

020

040

060

080

0

Particleminussize fraction along transect

black A blue B green C solid sand dashed claystation

g kg

minus1

It is evident that there is spatial autocorrelation ie nearby stations arelikely to be similar It also appears that the sequence can be broken upinto several more homogeneous sections A clear difference is that thebeginning of the sequence has fairly equal sand and clay whereas afterabout station 15 there is much more sand than clay However there is abreak in this pattern near station 21 where again the clay increases

It is also evident that there is much redundant information this is be-cause an increase in one particle-size fraction must be compensated by adecrease in one of the others Further the sand and clay contents in thethree horizons at each station are similar We can check the redundancyand reduce the number of variables with principal components analysis(PCA) Although all the variables are in the same units of measure usingstandardized components gives equal weight to all variables regardlessof their ranges

Task 3 Compute the standardized principal components of the sixvariables and plot the scores of the first two PCs along the transect bullgt pc lt- prcomp(transect center=T scale=T retx=T)$xgt plot(pc[1] type=l col=blue+ main=PCs of particle-size fraction along transect+ xlab=station ylab=PC score+ sub=solid PC1 dashed PC2+ ylim=c(-45 45))gt lines(pc[2] type=l lty=2 col=blue)gt abline(h=0)

2

0 10 20 30 40

minus4

minus2

02

4

PCs of particleminussize fraction along transect

solid PC1 dashed PC2station

PC

sco

re

The graph of the PCs shows that much more variance is captured by thefirst PC than by the second since the PC scores are much larger for PC1than PC2 The first PC has many sharp peaks but there is nonetheless aclear trend from negative to positive scores moving along the transect

For both the original variables and the PCs the question is where toput boundaries in particular which algorithm will find the ones that webelieve correspond to real soil type differences

2 Split moving-window

The split moving-window (SMW) approach computes the contrast be-tween two halves of a window as it moves along the transect and re-ports this contrast The higher the contrast the more likely it is thatthe station at which the window is split is a soil boundary The contrastis measured by the squared Mahalanobis distance between halves of thewindow This is defined as

D2 = (xs minus xd)TWminus1(xs minus xd) (1)

where xs is the average vector of the left half xd is is the average vectorof the right half and W is the pooled within-half variance-covariance ma-trix after correcting for the respective means The squared differencesbetween the mean vectors of each window half are thus corrected in twoways2

1 Variables with lower pooled within-half residual variances are givenmore weight

2 Two variables with positively-correlated residuals are not ldquodouble-countedrdquo that is a difference with the same sign in both axes iscorrected downwards by contrast a difference with opposite signis emphasized (because it is unexpected)

2 See numeric example in Appendix D

3

The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves

If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance

E2 = (xs minus xd)T I(xs minus xd) (2)

where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances

Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero

21 Functions for the SMW method

I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

These must be called in the following order

1 smwpc (optional) extract PCs

2 smwdw SMW analysis of PCs or original variables

3 smwgraph visualise results of SMW analysis

22 Options for the SMW method

The analyst must make several important choices that have a large effecton the presumed boundaries

1 Whether to use the original variables or some number of their stan-dardized principal components

2 Whether to use Euclidean or Mahalanobis distances

3 How wide a window to use

4 Window parity ie whether to use an odd or even window

5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary

6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported

4

221 Standardized principal components

Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)

[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components

PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631

gt str(pc)

dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252

Here we see that the two PCs explain most of the variance

The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast

The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables

222 Original variables

Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)

dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031

3 rather than the variance-covariance matrix

5

$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642

Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)

sandA clayA sandB clayB sandC clayC0 0 0 0 0 0

gt var(trscale)

sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000

223 Window width

The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect

This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))

Autocorrelations of series pc[ 1] by lag

0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331

10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131

6

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Serial autominuscorrelation PC1

The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8

If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect

224 Window parity

If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45

225 Mullion

Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note

7

that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1

226 What Mahanalobis distance to report

The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output

23 Analyzing the example transect

The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity

231 Using PCs

The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating

Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584

There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38

Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull

8

Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold

Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)

[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409

Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results

Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)

[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639

Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate

Using a narrow window changes the picture radically

Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)

[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 6 4764

9

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 2: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

B2 Find boundaries 19B3 Plot the transect 21

C Insight into Mahalanobis distance 22

Index of R concepts 26

ii

This note describes procedures to identify soil boundaries along a tran-sect where soil samples have been taken at regular intervals It is basedon the work of Webster [3ndash6] and used as an example in the text of Davis[1 pp 234-243]

Two partitioning methods are described by Webster [4] Split moving-window (SMW) (sect2) and Maximum Level Variance (MLV) (sect4) Only thefirst is developed in this technical note

Note The code in this document was tested with R version 301 (2013-05-16) and packages from that version or later running on Mac OS X1075 The text and graphical output you see here was written as aNoWeb file including both R code and regular LATEX source and then runthrough the excellent knitr package Version 141 [7] on R to generatea LATEX document that includes formatted R code input and the resultsof running the code both text results and graphs Then the LATEX docu-ment was compiled into the PDF version you are now reading If you runthe R code from this document your output may be slightly different ondifferent versions and on different platforms

1 Example transect

We illustrate optimal partitioning with a transect of 42 stations spacedat 25 m intervals in an undisturbed forest near Juruena Mato Grosso inthe southern Amazon collected by Steven Jirka of Cornell University aspart of the LBA1 project Both field and laboratory measurements weremade for illustrative purposes we use only the sand and clay contentmeasured in g kg-1 of three layers These are organized as an R dataframe with the rows being the stations and the columns the variablesThe sample dataset is provided as a comma-separated value file trcsvand is reproduced in Appendix sectA

Task 1 Read the transect from the CSV file into R and display itsstructure bullgt transect lt- readcsv(trcsv)gt str(transect)

dataframe 42 obs of 6 variables$ sandA num 438 449 560 549 428 $ clayA num 283 316 216 261 472 $ sandB num 349 349 449 516 404 $ clayB num 372 405 305 272 429 $ sandC num 483 616 616 516 271 $ clayC num 239 205 172 239 463

Task 2 Plot the six variables along the transect bullgt plot(transect$sandA type=l ylim=c(0900)+ main=Particle-size fraction along transect+ xlab=station ylab=g kg-1+ sub=black A blue B green C solid sand dashed clay)gt lines(transect$sandB type=l col=blue)

1 Large-scale Biosphere-Atmosphere Experiment in Amazonia see httpearthobservatorynasagovStudyLBA

1

gt lines(transect$sandC type=l col=green)gt lines(transect$clayA type=l col=1 lty=2)gt lines(transect$clayB type=l col=blue lty=2)gt lines(transect$clayC type=l col=green lty=2)

0 10 20 30 40

020

040

060

080

0

Particleminussize fraction along transect

black A blue B green C solid sand dashed claystation

g kg

minus1

It is evident that there is spatial autocorrelation ie nearby stations arelikely to be similar It also appears that the sequence can be broken upinto several more homogeneous sections A clear difference is that thebeginning of the sequence has fairly equal sand and clay whereas afterabout station 15 there is much more sand than clay However there is abreak in this pattern near station 21 where again the clay increases

It is also evident that there is much redundant information this is be-cause an increase in one particle-size fraction must be compensated by adecrease in one of the others Further the sand and clay contents in thethree horizons at each station are similar We can check the redundancyand reduce the number of variables with principal components analysis(PCA) Although all the variables are in the same units of measure usingstandardized components gives equal weight to all variables regardlessof their ranges

Task 3 Compute the standardized principal components of the sixvariables and plot the scores of the first two PCs along the transect bullgt pc lt- prcomp(transect center=T scale=T retx=T)$xgt plot(pc[1] type=l col=blue+ main=PCs of particle-size fraction along transect+ xlab=station ylab=PC score+ sub=solid PC1 dashed PC2+ ylim=c(-45 45))gt lines(pc[2] type=l lty=2 col=blue)gt abline(h=0)

2

0 10 20 30 40

minus4

minus2

02

4

PCs of particleminussize fraction along transect

solid PC1 dashed PC2station

PC

sco

re

The graph of the PCs shows that much more variance is captured by thefirst PC than by the second since the PC scores are much larger for PC1than PC2 The first PC has many sharp peaks but there is nonetheless aclear trend from negative to positive scores moving along the transect

For both the original variables and the PCs the question is where toput boundaries in particular which algorithm will find the ones that webelieve correspond to real soil type differences

2 Split moving-window

The split moving-window (SMW) approach computes the contrast be-tween two halves of a window as it moves along the transect and re-ports this contrast The higher the contrast the more likely it is thatthe station at which the window is split is a soil boundary The contrastis measured by the squared Mahalanobis distance between halves of thewindow This is defined as

D2 = (xs minus xd)TWminus1(xs minus xd) (1)

where xs is the average vector of the left half xd is is the average vectorof the right half and W is the pooled within-half variance-covariance ma-trix after correcting for the respective means The squared differencesbetween the mean vectors of each window half are thus corrected in twoways2

1 Variables with lower pooled within-half residual variances are givenmore weight

2 Two variables with positively-correlated residuals are not ldquodouble-countedrdquo that is a difference with the same sign in both axes iscorrected downwards by contrast a difference with opposite signis emphasized (because it is unexpected)

2 See numeric example in Appendix D

3

The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves

If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance

E2 = (xs minus xd)T I(xs minus xd) (2)

where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances

Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero

21 Functions for the SMW method

I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

These must be called in the following order

1 smwpc (optional) extract PCs

2 smwdw SMW analysis of PCs or original variables

3 smwgraph visualise results of SMW analysis

22 Options for the SMW method

The analyst must make several important choices that have a large effecton the presumed boundaries

1 Whether to use the original variables or some number of their stan-dardized principal components

2 Whether to use Euclidean or Mahalanobis distances

3 How wide a window to use

4 Window parity ie whether to use an odd or even window

5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary

6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported

4

221 Standardized principal components

Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)

[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components

PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631

gt str(pc)

dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252

Here we see that the two PCs explain most of the variance

The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast

The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables

222 Original variables

Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)

dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031

3 rather than the variance-covariance matrix

5

$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642

Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)

sandA clayA sandB clayB sandC clayC0 0 0 0 0 0

gt var(trscale)

sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000

223 Window width

The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect

This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))

Autocorrelations of series pc[ 1] by lag

0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331

10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131

6

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Serial autominuscorrelation PC1

The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8

If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect

224 Window parity

If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45

225 Mullion

Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note

7

that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1

226 What Mahanalobis distance to report

The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output

23 Analyzing the example transect

The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity

231 Using PCs

The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating

Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584

There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38

Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull

8

Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold

Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)

[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409

Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results

Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)

[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639

Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate

Using a narrow window changes the picture radically

Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)

[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 6 4764

9

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 3: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

This note describes procedures to identify soil boundaries along a tran-sect where soil samples have been taken at regular intervals It is basedon the work of Webster [3ndash6] and used as an example in the text of Davis[1 pp 234-243]

Two partitioning methods are described by Webster [4] Split moving-window (SMW) (sect2) and Maximum Level Variance (MLV) (sect4) Only thefirst is developed in this technical note

Note The code in this document was tested with R version 301 (2013-05-16) and packages from that version or later running on Mac OS X1075 The text and graphical output you see here was written as aNoWeb file including both R code and regular LATEX source and then runthrough the excellent knitr package Version 141 [7] on R to generatea LATEX document that includes formatted R code input and the resultsof running the code both text results and graphs Then the LATEX docu-ment was compiled into the PDF version you are now reading If you runthe R code from this document your output may be slightly different ondifferent versions and on different platforms

1 Example transect

We illustrate optimal partitioning with a transect of 42 stations spacedat 25 m intervals in an undisturbed forest near Juruena Mato Grosso inthe southern Amazon collected by Steven Jirka of Cornell University aspart of the LBA1 project Both field and laboratory measurements weremade for illustrative purposes we use only the sand and clay contentmeasured in g kg-1 of three layers These are organized as an R dataframe with the rows being the stations and the columns the variablesThe sample dataset is provided as a comma-separated value file trcsvand is reproduced in Appendix sectA

Task 1 Read the transect from the CSV file into R and display itsstructure bullgt transect lt- readcsv(trcsv)gt str(transect)

dataframe 42 obs of 6 variables$ sandA num 438 449 560 549 428 $ clayA num 283 316 216 261 472 $ sandB num 349 349 449 516 404 $ clayB num 372 405 305 272 429 $ sandC num 483 616 616 516 271 $ clayC num 239 205 172 239 463

Task 2 Plot the six variables along the transect bullgt plot(transect$sandA type=l ylim=c(0900)+ main=Particle-size fraction along transect+ xlab=station ylab=g kg-1+ sub=black A blue B green C solid sand dashed clay)gt lines(transect$sandB type=l col=blue)

1 Large-scale Biosphere-Atmosphere Experiment in Amazonia see httpearthobservatorynasagovStudyLBA

1

gt lines(transect$sandC type=l col=green)gt lines(transect$clayA type=l col=1 lty=2)gt lines(transect$clayB type=l col=blue lty=2)gt lines(transect$clayC type=l col=green lty=2)

0 10 20 30 40

020

040

060

080

0

Particleminussize fraction along transect

black A blue B green C solid sand dashed claystation

g kg

minus1

It is evident that there is spatial autocorrelation ie nearby stations arelikely to be similar It also appears that the sequence can be broken upinto several more homogeneous sections A clear difference is that thebeginning of the sequence has fairly equal sand and clay whereas afterabout station 15 there is much more sand than clay However there is abreak in this pattern near station 21 where again the clay increases

It is also evident that there is much redundant information this is be-cause an increase in one particle-size fraction must be compensated by adecrease in one of the others Further the sand and clay contents in thethree horizons at each station are similar We can check the redundancyand reduce the number of variables with principal components analysis(PCA) Although all the variables are in the same units of measure usingstandardized components gives equal weight to all variables regardlessof their ranges

Task 3 Compute the standardized principal components of the sixvariables and plot the scores of the first two PCs along the transect bullgt pc lt- prcomp(transect center=T scale=T retx=T)$xgt plot(pc[1] type=l col=blue+ main=PCs of particle-size fraction along transect+ xlab=station ylab=PC score+ sub=solid PC1 dashed PC2+ ylim=c(-45 45))gt lines(pc[2] type=l lty=2 col=blue)gt abline(h=0)

2

0 10 20 30 40

minus4

minus2

02

4

PCs of particleminussize fraction along transect

solid PC1 dashed PC2station

PC

sco

re

The graph of the PCs shows that much more variance is captured by thefirst PC than by the second since the PC scores are much larger for PC1than PC2 The first PC has many sharp peaks but there is nonetheless aclear trend from negative to positive scores moving along the transect

For both the original variables and the PCs the question is where toput boundaries in particular which algorithm will find the ones that webelieve correspond to real soil type differences

2 Split moving-window

The split moving-window (SMW) approach computes the contrast be-tween two halves of a window as it moves along the transect and re-ports this contrast The higher the contrast the more likely it is thatthe station at which the window is split is a soil boundary The contrastis measured by the squared Mahalanobis distance between halves of thewindow This is defined as

D2 = (xs minus xd)TWminus1(xs minus xd) (1)

where xs is the average vector of the left half xd is is the average vectorof the right half and W is the pooled within-half variance-covariance ma-trix after correcting for the respective means The squared differencesbetween the mean vectors of each window half are thus corrected in twoways2

1 Variables with lower pooled within-half residual variances are givenmore weight

2 Two variables with positively-correlated residuals are not ldquodouble-countedrdquo that is a difference with the same sign in both axes iscorrected downwards by contrast a difference with opposite signis emphasized (because it is unexpected)

2 See numeric example in Appendix D

3

The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves

If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance

E2 = (xs minus xd)T I(xs minus xd) (2)

where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances

Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero

21 Functions for the SMW method

I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

These must be called in the following order

1 smwpc (optional) extract PCs

2 smwdw SMW analysis of PCs or original variables

3 smwgraph visualise results of SMW analysis

22 Options for the SMW method

The analyst must make several important choices that have a large effecton the presumed boundaries

1 Whether to use the original variables or some number of their stan-dardized principal components

2 Whether to use Euclidean or Mahalanobis distances

3 How wide a window to use

4 Window parity ie whether to use an odd or even window

5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary

6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported

4

221 Standardized principal components

Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)

[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components

PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631

gt str(pc)

dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252

Here we see that the two PCs explain most of the variance

The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast

The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables

222 Original variables

Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)

dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031

3 rather than the variance-covariance matrix

5

$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642

Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)

sandA clayA sandB clayB sandC clayC0 0 0 0 0 0

gt var(trscale)

sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000

223 Window width

The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect

This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))

Autocorrelations of series pc[ 1] by lag

0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331

10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131

6

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Serial autominuscorrelation PC1

The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8

If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect

224 Window parity

If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45

225 Mullion

Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note

7

that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1

226 What Mahanalobis distance to report

The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output

23 Analyzing the example transect

The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity

231 Using PCs

The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating

Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584

There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38

Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull

8

Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold

Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)

[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409

Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results

Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)

[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639

Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate

Using a narrow window changes the picture radically

Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)

[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 6 4764

9

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 4: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

gt lines(transect$sandC type=l col=green)gt lines(transect$clayA type=l col=1 lty=2)gt lines(transect$clayB type=l col=blue lty=2)gt lines(transect$clayC type=l col=green lty=2)

0 10 20 30 40

020

040

060

080

0

Particleminussize fraction along transect

black A blue B green C solid sand dashed claystation

g kg

minus1

It is evident that there is spatial autocorrelation ie nearby stations arelikely to be similar It also appears that the sequence can be broken upinto several more homogeneous sections A clear difference is that thebeginning of the sequence has fairly equal sand and clay whereas afterabout station 15 there is much more sand than clay However there is abreak in this pattern near station 21 where again the clay increases

It is also evident that there is much redundant information this is be-cause an increase in one particle-size fraction must be compensated by adecrease in one of the others Further the sand and clay contents in thethree horizons at each station are similar We can check the redundancyand reduce the number of variables with principal components analysis(PCA) Although all the variables are in the same units of measure usingstandardized components gives equal weight to all variables regardlessof their ranges

Task 3 Compute the standardized principal components of the sixvariables and plot the scores of the first two PCs along the transect bullgt pc lt- prcomp(transect center=T scale=T retx=T)$xgt plot(pc[1] type=l col=blue+ main=PCs of particle-size fraction along transect+ xlab=station ylab=PC score+ sub=solid PC1 dashed PC2+ ylim=c(-45 45))gt lines(pc[2] type=l lty=2 col=blue)gt abline(h=0)

2

0 10 20 30 40

minus4

minus2

02

4

PCs of particleminussize fraction along transect

solid PC1 dashed PC2station

PC

sco

re

The graph of the PCs shows that much more variance is captured by thefirst PC than by the second since the PC scores are much larger for PC1than PC2 The first PC has many sharp peaks but there is nonetheless aclear trend from negative to positive scores moving along the transect

For both the original variables and the PCs the question is where toput boundaries in particular which algorithm will find the ones that webelieve correspond to real soil type differences

2 Split moving-window

The split moving-window (SMW) approach computes the contrast be-tween two halves of a window as it moves along the transect and re-ports this contrast The higher the contrast the more likely it is thatthe station at which the window is split is a soil boundary The contrastis measured by the squared Mahalanobis distance between halves of thewindow This is defined as

D2 = (xs minus xd)TWminus1(xs minus xd) (1)

where xs is the average vector of the left half xd is is the average vectorof the right half and W is the pooled within-half variance-covariance ma-trix after correcting for the respective means The squared differencesbetween the mean vectors of each window half are thus corrected in twoways2

1 Variables with lower pooled within-half residual variances are givenmore weight

2 Two variables with positively-correlated residuals are not ldquodouble-countedrdquo that is a difference with the same sign in both axes iscorrected downwards by contrast a difference with opposite signis emphasized (because it is unexpected)

2 See numeric example in Appendix D

3

The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves

If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance

E2 = (xs minus xd)T I(xs minus xd) (2)

where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances

Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero

21 Functions for the SMW method

I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

These must be called in the following order

1 smwpc (optional) extract PCs

2 smwdw SMW analysis of PCs or original variables

3 smwgraph visualise results of SMW analysis

22 Options for the SMW method

The analyst must make several important choices that have a large effecton the presumed boundaries

1 Whether to use the original variables or some number of their stan-dardized principal components

2 Whether to use Euclidean or Mahalanobis distances

3 How wide a window to use

4 Window parity ie whether to use an odd or even window

5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary

6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported

4

221 Standardized principal components

Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)

[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components

PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631

gt str(pc)

dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252

Here we see that the two PCs explain most of the variance

The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast

The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables

222 Original variables

Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)

dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031

3 rather than the variance-covariance matrix

5

$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642

Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)

sandA clayA sandB clayB sandC clayC0 0 0 0 0 0

gt var(trscale)

sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000

223 Window width

The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect

This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))

Autocorrelations of series pc[ 1] by lag

0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331

10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131

6

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Serial autominuscorrelation PC1

The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8

If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect

224 Window parity

If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45

225 Mullion

Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note

7

that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1

226 What Mahanalobis distance to report

The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output

23 Analyzing the example transect

The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity

231 Using PCs

The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating

Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584

There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38

Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull

8

Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold

Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)

[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409

Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results

Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)

[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639

Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate

Using a narrow window changes the picture radically

Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)

[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 6 4764

9

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 5: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

0 10 20 30 40

minus4

minus2

02

4

PCs of particleminussize fraction along transect

solid PC1 dashed PC2station

PC

sco

re

The graph of the PCs shows that much more variance is captured by thefirst PC than by the second since the PC scores are much larger for PC1than PC2 The first PC has many sharp peaks but there is nonetheless aclear trend from negative to positive scores moving along the transect

For both the original variables and the PCs the question is where toput boundaries in particular which algorithm will find the ones that webelieve correspond to real soil type differences

2 Split moving-window

The split moving-window (SMW) approach computes the contrast be-tween two halves of a window as it moves along the transect and re-ports this contrast The higher the contrast the more likely it is thatthe station at which the window is split is a soil boundary The contrastis measured by the squared Mahalanobis distance between halves of thewindow This is defined as

D2 = (xs minus xd)TWminus1(xs minus xd) (1)

where xs is the average vector of the left half xd is is the average vectorof the right half and W is the pooled within-half variance-covariance ma-trix after correcting for the respective means The squared differencesbetween the mean vectors of each window half are thus corrected in twoways2

1 Variables with lower pooled within-half residual variances are givenmore weight

2 Two variables with positively-correlated residuals are not ldquodouble-countedrdquo that is a difference with the same sign in both axes iscorrected downwards by contrast a difference with opposite signis emphasized (because it is unexpected)

2 See numeric example in Appendix D

3

The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves

If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance

E2 = (xs minus xd)T I(xs minus xd) (2)

where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances

Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero

21 Functions for the SMW method

I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

These must be called in the following order

1 smwpc (optional) extract PCs

2 smwdw SMW analysis of PCs or original variables

3 smwgraph visualise results of SMW analysis

22 Options for the SMW method

The analyst must make several important choices that have a large effecton the presumed boundaries

1 Whether to use the original variables or some number of their stan-dardized principal components

2 Whether to use Euclidean or Mahalanobis distances

3 How wide a window to use

4 Window parity ie whether to use an odd or even window

5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary

6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported

4

221 Standardized principal components

Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)

[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components

PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631

gt str(pc)

dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252

Here we see that the two PCs explain most of the variance

The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast

The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables

222 Original variables

Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)

dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031

3 rather than the variance-covariance matrix

5

$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642

Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)

sandA clayA sandB clayB sandC clayC0 0 0 0 0 0

gt var(trscale)

sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000

223 Window width

The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect

This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))

Autocorrelations of series pc[ 1] by lag

0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331

10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131

6

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Serial autominuscorrelation PC1

The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8

If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect

224 Window parity

If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45

225 Mullion

Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note

7

that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1

226 What Mahanalobis distance to report

The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output

23 Analyzing the example transect

The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity

231 Using PCs

The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating

Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584

There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38

Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull

8

Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold

Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)

[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409

Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results

Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)

[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639

Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate

Using a narrow window changes the picture radically

Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)

[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 6 4764

9

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 6: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves

If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance

E2 = (xs minus xd)T I(xs minus xd) (2)

where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances

Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero

21 Functions for the SMW method

I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

These must be called in the following order

1 smwpc (optional) extract PCs

2 smwdw SMW analysis of PCs or original variables

3 smwgraph visualise results of SMW analysis

22 Options for the SMW method

The analyst must make several important choices that have a large effecton the presumed boundaries

1 Whether to use the original variables or some number of their stan-dardized principal components

2 Whether to use Euclidean or Mahalanobis distances

3 How wide a window to use

4 Window parity ie whether to use an odd or even window

5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary

6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported

4

221 Standardized principal components

Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)

[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components

PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631

gt str(pc)

dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252

Here we see that the two PCs explain most of the variance

The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast

The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables

222 Original variables

Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)

dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031

3 rather than the variance-covariance matrix

5

$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642

Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)

sandA clayA sandB clayB sandC clayC0 0 0 0 0 0

gt var(trscale)

sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000

223 Window width

The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect

This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))

Autocorrelations of series pc[ 1] by lag

0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331

10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131

6

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Serial autominuscorrelation PC1

The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8

If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect

224 Window parity

If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45

225 Mullion

Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note

7

that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1

226 What Mahanalobis distance to report

The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output

23 Analyzing the example transect

The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity

231 Using PCs

The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating

Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584

There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38

Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull

8

Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold

Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)

[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409

Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results

Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)

[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639

Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate

Using a narrow window changes the picture radically

Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)

[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 6 4764

9

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 7: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

221 Standardized principal components

Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)

[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components

PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631

gt str(pc)

dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252

Here we see that the two PCs explain most of the variance

The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast

The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables

222 Original variables

Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)

dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031

3 rather than the variance-covariance matrix

5

$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642

Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)

sandA clayA sandB clayB sandC clayC0 0 0 0 0 0

gt var(trscale)

sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000

223 Window width

The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect

This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))

Autocorrelations of series pc[ 1] by lag

0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331

10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131

6

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Serial autominuscorrelation PC1

The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8

If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect

224 Window parity

If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45

225 Mullion

Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note

7

that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1

226 What Mahanalobis distance to report

The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output

23 Analyzing the example transect

The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity

231 Using PCs

The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating

Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584

There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38

Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull

8

Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold

Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)

[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409

Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results

Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)

[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639

Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate

Using a narrow window changes the picture radically

Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)

[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 6 4764

9

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 8: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642

Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)

sandA clayA sandB clayB sandC clayC0 0 0 0 0 0

gt var(trscale)

sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000

223 Window width

The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect

This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))

Autocorrelations of series pc[ 1] by lag

0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331

10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131

6

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Serial autominuscorrelation PC1

The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8

If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect

224 Window parity

If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45

225 Mullion

Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note

7

that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1

226 What Mahanalobis distance to report

The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output

23 Analyzing the example transect

The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity

231 Using PCs

The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating

Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584

There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38

Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull

8

Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold

Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)

[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409

Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results

Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)

[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639

Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate

Using a narrow window changes the picture radically

Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)

[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 6 4764

9

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 9: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Serial autominuscorrelation PC1

The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8

If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect

224 Window parity

If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45

225 Mullion

Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note

7

that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1

226 What Mahanalobis distance to report

The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output

23 Analyzing the example transect

The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity

231 Using PCs

The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating

Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584

There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38

Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull

8

Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold

Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)

[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409

Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results

Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)

[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639

Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate

Using a narrow window changes the picture radically

Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)

[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 6 4764

9

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 10: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1

226 What Mahanalobis distance to report

The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output

23 Analyzing the example transect

The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity

231 Using PCs

The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating

Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584

There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38

Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull

8

Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold

Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)

[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409

Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results

Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)

[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639

Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate

Using a narrow window changes the picture radically

Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)

[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 6 4764

9

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 11: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold

Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)

[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409

Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results

Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)

[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639

Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate

Using a narrow window changes the picture radically

Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)

[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 6 4764

9

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 12: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions

Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 14 206992 15 109723 25 94034 24 5239

With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)

Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)

[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2

1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329

And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast

Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-

10

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 13: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect

232 Using original variables

The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same

Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull

Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same

gt d lt- smwdw(transect wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756

gt d lt- smwdw(pc wid = 9)

[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2

1 14 229532 15 98113 24 83184 17 75715 25 7160

Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority

11

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 14: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

24 Visualizing the boundaries

The results returned by smwdw can be visualized with the smwgraphfunction

Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)

0 10 20 30 40

05

1015

20

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source

The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature

12

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 15: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))

0 10 20 30 40

05

1015

2025

Boundaries

window8 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

7

1415

16 24

25

26 38

0 10 20 30 40

05

1015

20

Boundaries

window8 PC 1 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

14

15

24

25

0 10 20 30 40

02

46

8

Boundaries

window8 PC 2 onlystation

squa

red

dist

ance

bet

wee

n ha

lves

67

1718

19

23

24

25

26

0 10 20 30 40

05

1015

20

Boundaries

window9 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17 2425

0 10 20 30 40

05

1015

2025

Boundaries

window10 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

16 1824

2526

0 10 20 30 40

05

1015

2025

30

Boundaries

window10 mullion1 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

13

14

15

16

17

2425

26

0 10 20 30 40

010

020

030

040

0

Boundaries

window6 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

0 10 20 30 40

010

2030

40

Boundaries

window7 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

6

14

24

0 10 20 30 40

05

1015

Boundaries

window11 PCs 2station

squa

red

dist

ance

bet

wee

n ha

lves

14

15

17

23

24

25

0 10 20 30 40

05

1015

Boundaries

window9 PCs 2 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

2425

26

0 10 20 30 40

050

100

200

300

Boundaries

window9 Original vars 6station

squa

red

dist

ance

bet

wee

n ha

lves

6

9 1417

24

25

28 35

0 10 20 30 40

010

0000

2000

0030

0000

Boundaries

window9 Original vars 6 Euclideanstation

squa

red

dist

ance

bet

wee

n ha

lves

12

13

14

15

23

24

25

26

13

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 16: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

3 When SMW fails to find a solution

The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like

Position 3 Singular pooled covariance matrix not a boundary

This implies a non-stable analytical context Among the causes are

bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero

bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly

Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries

Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull

The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]

Note We use setseed so your results will be identical to these notesin practice you would not use this

gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)

Autocorrelations of series runif(42) by lag

0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097

10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040

14

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 17: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

0 5 10 15

minus0

20

00

20

40

60

81

0

Lag

AC

F

Autocorrelation of a uniform random variable

Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)

Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)

[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components

PC1 PC2x -07071 07071y 07071 07071

gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)

15

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 18: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

0 10 20 30 40

minus2

minus1

01

2

black PC1 blue PC2station

There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance

Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)

0 10 20 30 40

010

2030

40

Boundaries

station

squa

red

dist

ance

bet

wee

n ha

lves

12

1415

16

17

1920 29

This illustrates the danger of relying on automatic procedures

4 Maximum Level Variance

The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a

16

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 19: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries

I have not yet implemented this the mathematics are presented in theappendix to Webster [4]

References

[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1

[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416

[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1

[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17

[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980

[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1

[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1

17

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 20: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

A Example transect

sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199

18

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 21: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)

[1] smwdw smwgraph smwpc

B1 PCA

gt print(smwpc)

function(transect npc=2)

cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))

standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)

frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])

print some diagnosticsprint(paste(PCA selected component

ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))

print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))

print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)

return(pcs)

B2 Find boundaries

gt print(smwdw)

function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)

silently correct unreasonable arguments

mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)

check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)

if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)

length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(

Error width must be positive and less than 12 of transect lengthlen))

return(NULL) number of variables

nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left

boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths

hwid lt- boffset - ((wid2)) - mull collect ds with their centre

19

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 22: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)

move left boundary of window along transectfor (l in 1(len-wid))

right boundary of windowr lt- l + wid

option Mahalanobis distance extract halves

winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector

if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)

else

winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)

if (ident==F)

pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))

store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-

mahalanobis(winlmean winrmean cov(pool))silent=TRUE)

if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid

Singular pooled covariance matrix not a boundarysep=))

option Euclidean distanceelse

d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))

sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))

Notes on this code

1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array

2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov

3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-

20

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 23: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened

B3 Plot the transect

gt print(smwgraph)

function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary

pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)

the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves

type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)

main boundaries -- compare to imposed maximum if any otherwise within-transect maximum

dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)

21

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 24: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

C Insight into Mahalanobis distance

This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window

We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age

Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence

We consider the distance between the windows in bivariate space formedby the girth and height

Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

gt (left lt- test[14 12])

Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278

gt (right lt- test[58 12])

Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean

Girth Height-05816 -13732

Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights

The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W

22

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 25: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

gt sqrt(sum((lmean - rmean)^2))

[1] 1491

gt sqrt(mahalanobis(lmean rmean diag(2)))

[1] 1491

Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance

Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)

1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062

gt (rightresid lt- t(right) - rmean)

5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618

gt (resid lt- (rbind(t(leftresid) t(rightresid))))

Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962

gt apply(resid 2 sd)

Girth Height02085 08952

gt (W lt- cov(resid))

Girth HeightGirth 004348 002947Height 002947 080137

gt var(test[ 12])

Girth HeightGirth 10000 05193Height 05193 10000

So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The

23

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 26: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling

This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance

We can now examine the contribution of each variable to the distance

Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)

Girth Height-05816 -13732

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 235855 -08674Height -08674 12798

gt Winv[1 1]Winv[2 2]

Girth1843

gt (tmp lt- drop(Winv diff))

Girth Height-12525 -1253

gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))

[1] 9005

gt sqrt(dsquared)

[1] 3001

gt sqrt(mahalanobis(lmean rmean W))

[1] 3001

Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable

24

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 27: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull

These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])

Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569

gt (right lt- test[2629 12])

Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278

gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)

Girth Height-07728 -07847

gt sqrt(sum((lmean - rmean)^2))

[1] 1101

gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)

Girth Height02303 03728

gt (W lt- cov(resid))

Girth HeightGirth 005306 -00384Height -003840 01390

gt (Winv lt- solve(W diag(2)))

[1] [2]Girth 23559 6509Height 6509 8993

gt Winv[1 1]Winv[2 2]

Girth262

gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))

25

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 28: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

[1] 275

gt sqrt(dsquared)

[1] 5244

gt sqrt(mahalanobis(lmean rmean W))

[1] 5244

In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244

26

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts
Page 29: Technical note: Optimal partitioning of soil transects ... · 2.2.1 Standardized principal components Function smw.pccomputes the standardized principal components from the correlation

Index of R Concepts

acf 6 14apply 21

class 22cov 21

datasets package 23

knitr package 1

mahalanobis 21 23 25mean 21

runif 14

scale 5setseed 14source 4 12stats package 6

t 21trees dataset 23try 21

27

  • 1 Example transect
  • 2 Split moving-window
    • 21 Functions for the SMW method
    • 22 Options for the SMW method
      • 221 Standardized principal components
      • 222 Original variables
      • 223 Window width
      • 224 Window parity
      • 225 Mullion
      • 226 What Mahanalobis distance to report
        • 23 Analyzing the example transect
          • 231 Using PCs
          • 232 Using original variables
            • 24 Visualizing the boundaries
              • 3 When SMW fails to find a solution
              • 4 Maximum Level Variance
              • References
              • A Example transect
              • B Split moving-window functions
                • B1 PCA
                • B2 Find boundaries
                • B3 Plot the transect
                  • C Insight into Mahalanobis distance
                  • Index of R concepts