27
Empirical Orthogonal Functions and Clusters M. Benno Blumenthal 6 June 2008 1 Contents 1 Outline 1 2 Introduction 2 2.1 EOF and Cluster Comparison .................... 2 3 Analyzing Test Data 3 3.1 Clusters ............................... 4 3.2 Test Data EOFs ........................... 7 4 Real World Data 11 4.1 Eritrea Climatological Malaria ................... 11 4.1.1 Computing Eritrea Clusters ................ 11 4.1.2 Computing Eritrea EOFs .................. 14 4.2 Madagascar Highlands Malaria ................... 20 4.2.1 Computing Madagascar Highlands Clusters ........ 20 4.2.2 Computing Madagascar Highlands EOFs ......... 23

EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

Empirical Orthogonal Functions and Clusters

M. Benno Blumenthal

6 June 2008

1

Contents

1 Outline 1

2 Introduction 22.1 EOF and Cluster Comparison . . . . . . . . . . . . . . . . . . . .2

3 Analyzing Test Data 33.1 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Test Data EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . .7

4 Real World Data 114.1 Eritrea Climatological Malaria . . . . . . . . . . . . . . . . . . .11

4.1.1 Computing Eritrea Clusters . . . . . . . . . . . . . . . .114.1.2 Computing Eritrea EOFs . . . . . . . . . . . . . . . . . .14

4.2 Madagascar Highlands Malaria . . . . . . . . . . . . . . . . . . .204.2.1 Computing Madagascar Highlands Clusters . . . . . . . .204.2.2 Computing Madagascar Highlands EOFs . . . . . . . . .23

1

Page 2: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

5 Summary 27

Abstract

Empirical Orthogonal Functions and Clustering are both used as data re-duction schemes and are in fact more closely related than may appear at firstglance.

2 Introduction

Historically Clustering and Empirical Orthogonal Functions (EOFs) are distinctlydifferent data reduction schemes, usually applied in different fields. More recentwork has established precise connections between the two approaches, which letsus better understand the differences between the two approaches, and use them asalternative views of the same dataset. Here we will apply the two techniques tothree sample datasets of increasing complexity, to both gain a better understandingof the datasets as well as how the techniques can help extract simpler models fromcomplex data.

2.1 EOF and Cluster Comparison

EOF and Cluster ComparisonEOF ClusterOptimally truncate spectrum bykeeping N most-energetic modes

Choose number of clusters

Each point has a weight for eachmode

Each point belongs to exactlyone cluster

Each mode has an (orthogonal)time series

Each cluster has a time series

Evaluate variance fit Evaluate variance fit

As pre-

sented in earlier talks, both EOFs and Clusters help us characterize a large datasetby pulling out essential features. Our measure of essential in both cases is variance

Page 3: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

or energy, and there are many different terms which are used to say the same thing.In particular, we call it an L2 norm (which means the measure is sum of squares),which can be interpreted as distance (normally root-mean-square, but just usingmean-square to judge separation gives the same results). So if we were analyzingmalaria case data by time and district, for example, we would consider two dis-tricts as “close” if they tended to have the same number of cases of malaria foreach and every time period, i.e. the mean square difference in the number of casesis small.

One facet of data reduction is having fewer objects to think about. For EOFs,this is done by truncating, i.e. only keeping the N most-energetic modes. In Clusteranalysis, one supplies the number of clusters to use.

Second facet is how the objects kept relate to the original domain, e.g. districts.In the case of EOFs, each district has a weight for each EOF mode kept: thatweight can be zero or not. In the case of clusters, each district belongs to exactlyone cluster, so the corresponding wieght is either zero or one, and it is one for onlyone cluster.

Third facet is the time behavior. For EOFs, different modes have orthogonaltime behaviors, i.e. they are constructed to be different. Each cluster, however,has some time behavior, with no enforcement orthongonality. In other words, forclusters, the fact that two clusters are distinct could be because their time behavioris different, or it could be because amplitude of their variation is different.

Finally, both schemes can be evaluated by checking how much of the varianceis fit by the analysis, and how much variance remains unexplained (i.e. variance ofthe EOF modes that are dropped in the truncation, or the within-cluster variance).

3 Analyzing Test Data

The first dataset is test data – it has been constructed with a random number gen-erator to have the desired statistical properties, so that we can use our techniquesto analyze the data and have confidence in obtaining the correct answer.

The data has been constructed to look like time series from a set of districtsthat naturally cluster into four groups. The districts are labeled according to which

Page 4: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

cluster they belong to, so it is easy to verify the cluster analysis returns the intendedclusters.

3.1 Clusters

Test Data

(link)

home .ciph .TestCluster .testdata

This code and link accesses the data from the course data area. Looking at thedataset page shows the structure of the dataset: a two-dimensional dataset that is afunction of districtvspace and timeT .

Figure 1 shows the test data plotted as a function of district and time. It isclearly banded, with each district label indicating with which band it belongs.

Applying Kmeans clusteringApplying the functionk-means136requires specifying space, time, and the

number of clusters.

(link)

home .ciph .TestCluster .testdata

[vspace][T]4 k-means136

Following the link shows the detailed description of the resulting dataset, andgives a link to the cluster viewer, which makes it easy to examine the clusters.

Test ClustersFigure 2 shows the cluster viewer’s presentation of the test clusters. On the

lower left is a block diagram which indicates which district belongs to which clus-ter. Clearly the districts are sorted by label, which was the intended answer.

Page 5: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

Jan Jan Jan Jan Jan Jan1960 1970 1980 1990 2000 2010

Time

w2_1w2_2w2_3w2_4w2_5w2_6w2_7w2_8w2_9

w2_10x2_1x2_2x2_3x2_4x2_5x2_6x2_7x2_8x2_9

x2_10y2_1y2_2y2_3y2_4y2_5z2_1z2_2z2_3z2_4z2_5z2_6z2_7z2_8z2_9

z2_10

vspa

ce

Figure 1: Test data

Page 6: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

http://iridl.ldeo.columbia.edu/expert/home/.ciph/.TestCluster/testdata%...

1 of 1 05/09/2008 03:45 PM

k-means results

ZoomZoom cluster NC var_fit var_misfit

ids fraction fraction

1 10 0.8764161 0.002578947

2 5 0.09217408 5.2143703E-04

3 10 0.02717125 4.6343994E-04

4 10 6.0636725E-04 6.8470697E-05

NC var_fit var_misfit

fraction fraction

35. 0.9963678 0.003632295

z2_10

w2_1

0.5 4.5

(link)

Figure 2: Test Data Clusters

Page 7: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

On the upper left are the time series, color-coded to match the block diagram.Clearly the two most energetic clusters have strong trends; the weakest cluster hasvery little energy.

The lower right gives a table which shows the variance fit and the misfit-withineach cluster. These numbers are summed at the bottom, given the variance fit byall the clusters together, as well as the variance not explained by the four clusters.Clearly the clusters account for almost all the energy, as constructed.

3.2 Test Data EOFs

Test Data - EOFSNow we look at the same test case with EOFs.Clusters are groups of similar points, but it is also true that the clusters are as

different as possible.EOFs are the minimal (fewest modes) representation of the most energy (vari-

ance).As it turns out, the EOFs correspond to the cluster differences.

Computing Test Data EOFs

• Get test data

(link)

home .ciph .TestCluster .testdata

• remove spatial mean using the functionaverageand the functionsub

(link)

dup

[vspace]average

sub

• compute EOFs using the functionsvd

Page 8: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

http://iridl.ldeo.columbia.edu/expert/home/.ciph/.TestCluster/testdata/d...

1 of 1 05/09/2008 04:44 PM

svd results

ev: 2.

ZoomZoom

total

variance

normalized

eigenvalues

singular

values

cumulative

normalized

1567172. 0.9727746 1234.708 0.9727746

1.

4

-4

w2_1 z2_10

(link)

Figure 3: Test Data EOF1

(link)

[vspace][T]svd

Following the links in turn let you see the intermediate results in this calcu-lation, and following the last link shows the details for the dataset containing theEOF analysis. In particular it gives a link to thesvdviewer, which makes it easierto examine the different modes of the EOF analysis.

Test Data: First EOFFigure 3 shows the svdviewer results for the first EOF of the test data.The upper left shows the time series corresponding to the first EOF – it shows

some oscillations about a clear trend with time.The lower left shows the corresponding spatial pattern. In this case the first

cluster points in one direction, while the second and fourth clusters point in the

Page 9: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

http://iridl.ldeo.columbia.edu/expert/home/.ciph/.TestCluster/testdata/d...

1 of 1 05/09/2008 04:47 PM

svd results

ev: 1. 3.

ZoomZoom

total

variance

normalized

eigenvalues

singular

values

cumulative

normalized

1567172. 0.01896751 172.4104 0.9917421

2.

4

-4

w2_1 z2_10

(link)

Figure 4: Test Data EOF2

opposite direction. Note that the analysis is arbitrarily signed – flipping the timeseries and flipping the spatial structure would give an equally valid answer.

The upper right shows the fraction of variance explained by this mode in thecontext of all the EOF modes. Clearly this has by far the most variance in thesystem.

The lower right is a table which shows the variance information again. Thetotal variance is given in the original data units squared, the normalized eigenvaluegives the fraction of that variance explained by the first EOF, the singular value isthe square root of that in the original units of the data. The last column is the sumfrom the first to the current EOF, which is a repeat since this is the first.

Test Data: Second EOFFigure 4 shows the corresponding view for the second EOF. Not surprisingly,

the time series is quite different (remember that the EOFs are constructed so that

Page 10: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

http://iridl.ldeo.columbia.edu/expert/home/.ciph/.TestCluster/testdata/d...

1 of 1 05/09/2008 04:47 PM

svd results

ev: 2. 4.

ZoomZoom

total

variance

normalized

eigenvalues

singular

values

cumulative

normalized

1567172. 0.001034712 40.26873 0.9927768

3.

4

-4

w2_1 z2_10

(link)

Figure 5: Test Data EOF3

the time series are orthogonal). The amplitude steadily increases in time, but it isall centered on zero.

The spatial structure shows that this is primarily the difference between thethird cluster and the first two. The remaining figure and table show that this modehas much less variance than the first.

Test Data: Third EOFFigure 5 shows the next most energetic EOF. It is primarily the difference

between the second and fourth cluster, though it picked up some noise from theother two clusters, not surprising given how little variance the mode contains.

EOFs are the Cluster DifferencesSo in fact the EOFs are the differences between continuously-varying (fuzzy)

clusters

Page 11: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

So svd is an upper limit to how well clusters can fit a dataset Also importantbecause svd algorithms are robust

Ding and He 2004: k-means analysis via Principal Component Analysis

4 Real World Data

Real World DataThat example was constructed to have four clusters, so the cluster model worked

very well, and the EOF model precisely corresponded to the cluster separations.In the real world, the cluster model is not such a perfect fit, and one would

expect the EOF model to converge much faster.

4.1 Eritrea Climatological Malaria

For the first real-world case, we look at a climatology computed from Eritreanmalaria data. By climatology we mean that we computed results for an averageyear: in this case monthly data was averaged together to get an average January,an average February, an average March, etc. This might be done for a numberof reasons. If there is a strong seasonal dependence in the data, it might be bestto analyze and understand that before attempting to understand the interannualvariability. If there are gaps in the data, computing an average year is one way toanalyze part of the signal and avoid the gaps. And some analysis software, such asADDATI, is limited to relatively small problems, and looking at an average yearmay reduce the size of the problem to the point where it can be handled.

4.1.1 Computing Eritrea Clusters

Computing eritrea clusters is a three step process

1. Open the dataset we wish to work with, and get the variable we want tocompute with

(link)

Page 12: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

home .ciph .Eritrea .malaria .climatology96-03

malaria_incidence

2. Compute clusters with the functionk-means136

(link)

[district][T]5 k-means136

3. Add the district location information from the original dataset with the func-tion add variable

(link)

MOH_SubZobas .the_geom add_variable

So computing clusters on the Eritrea climatological malaria data is quite sim-ilar to the steps take to compute clusters on the test data. The first step accessesthe dataset – here we access the dataset and pull out the incidence so that both areavailable. The next step is to compute the clusters. There is one additional step,which is to copy the district location information from the original dataset to thecluster results so that the cluster viewer has the additional information availablefor spatial plotting.

Again, following the links will give the more details of what is computed ateach step. Following the last link, in particular, gives the different parts of thecluster analysis, and provides a link to the cluster viewer, a convenient summaryof the cluster analysis.

Eritrea Climatological Malaria ClustersFigure 6 shows the results for the 5 cluster analysis. Lower left shows a map

of the clusters. Upper left shows the time series color-coded to match the map.Lower right gives a table of variance fit and not-fit by each cluster. Compared

to the test case, less variance is fit, however still much of the variance is explained.The link at the top of the page links back to the dataset of cluster results. This

is a convenient place for trying variations on the analysis such as chosing differentnumbers of clusters.

Page 14: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

4.1.2 Computing Eritrea EOFs

• Open the dataset we wish to work with, and get the variable we want tocompute with

(link)

home .ciph .Eritrea .malaria .climatology96-03

malaria_incidence

Note that we accessed the dataset and then accessed the malaria incidence,leaving them both available. This because our last step is to copy the districtshapes into the EOF results so that the results can be conveniently mapped.

• remove the time mean

(link)

dup [T] average sub

Note that we have removed the time mean, unlike our analysis of the testdata where we removed the district mean. Alternatively, we could have notremoved either mean, or removed both. By following the links, one canfind an editable version of the program which one can change to see thealternative analyses.

• Compute EOFS with the functionsvd

(link)

[district][T] svd

• Add the district location information from the original dataset with the func-tion add variable

(link)

MOH_SubZobas .the_geom add_variable

Page 16: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

Eritrea: EOF 1Figure 7 shows the first EOF for the Eritrea Climatology. The upper left shows

the time dependence for this pattern: a pronounced peak in October balanced bythe smaller opposite value most of the rest of the year. Clearly our choice toremove the time mean has distorted the results for this mode.

The corresponding spatial pattern is given in the lower left. Most of westernEritrea follows the same pattern, with two districts particularly strong. There isa more central district, however, which shows the opposite sign. Clearly there isgoing to be another mode which strongly features that district to finish modellingthat signal.

The plot in the upper right, and the table on the lower right, show that thismode accounts for most of the signal.

Eritrea: EOF 2Figure 8 shows the second EOF for Eritrea. The spatial pattern shows that

this is all about the central district that had the opposite behavior from the westernportion. In fact, it is off scale: following the link and changing the limits make itclearer that it is a dark blue. The time dependence shows a broad peak centeredmore in July. So this mode changes the time dependence of the central district.

Clearly in this case the clusters are a cleaner representation than EOFs withthe time mean removed.

Eritrea: EOF 3Figure 9 shows the third EOF. The time dependence shows peaks of opposite

sign before and after October: this in effect alters the shape of the October peakfor the districts in the pattern. Since this mode has much less variance, it is a smallchange to what is understood by looking the the first EOF.

I strongly suggest following the links and looking at the results with the districtmean removed. This gives results that are easier to understand spatially, see 10.

Page 20: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

4.2 Madagascar Highlands Malaria

For our third example of analyzing data, we look at malaria data from the Mada-gascar Highlands. This is the same dataset we showed in our data quality session,and you may recall that several iterations were required to get it into a usable state.

One reason for choosing only the Highlands is that malaria is endemic in thelowlands, so one has little potential for understanding and forecasting epidemicsthere. It would have been nice to extract that from the analyis, but that opportunitydid not arise.

In this case, we have a number of years of data available. This has two impor-tant implications:

1. this dataset is much larger than our last example, and would not be easilyanalyzed with ADDATI, and

2. this dataset has a number of gaps, so some additional steps are requiredbefore our k-means clustering algorithm can be applied.

4.2.1 Computing Madagascar Highlands Clusters

• Get the Geolocated Data.

(link)

home .ciph .Madagascar .malaria .geolocated

incid

• fill data gaps using svd.

This data has gaps, so it requires some preprocessing before we can computeclusters. One method of filling small gaps is to use a variation of the svd.

(link)

home .ciph .Madagascar .malaria .geolocated

incid

[district][T]svd: gappy-data :svd

Page 21: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

The functionsvd is a bit unusual in that if there are gaps in the data, it pro-ceeds to calculate the covariance with as much of the data as is available. Inthe case of only a few gaps, this results in an estimate of the covariance thatis essentially correct, but with a few small eigenvalues that cannot belong toa covariance (i.e. they are negative). Thegappy-dataoption truncates thecorresponding modes from the results before they are returned.

We can then reconstruct the time series from the structures and time seriesto have no gaps

(link)

home .ciph .Madagascar .malaria .geolocated

incid

[district][T]svd: gappy-data :svd

a: .Ss

:a: .Ts :a

mul

[ev]0.0 sum

What the code froma: to sum does is to extract the spatial structures andthe time series, multiply them together, and sum over all eigenvalues. Ifthere were no missing data, this would reconstruct the original time series,since no modes were dropped. In this case, we have eliminated the smallesteigenvalues, so the original data is approximated by this new set. This newset, of course, has no gaps.

• find the clusters, and add the locations.

(link)

[district][T] 5 k-means136

the_geom add_variable

The choice of five clusters was somewhat arbitrary. You are encouraged to followthe links and try alternative choices to see the impact on the results.

Page 23: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

4.2.2 Computing Madagascar Highlands EOFs

• Access Madagascar Highlands malaria dataset and extract incidence

(link)

home .ciph .Madagascar .malaria .geolocated

incid

• Remove district mean

(link)

dup

[district]average

sub

Note that this time we have choosen to remove the district mean. Alterna-tively, one could remove the time mean or not remove any mean at all. Byfollowing the link and editing the program one can try those alternatives.

• Computes EOFs using the functionsvdwith the gappy-data option

(link)

[district][T]svd: gappy-data :svd

The gappy-data option simplifies the results in the case where the data hasgaps by truncating the results to eliminate modes that are clearly not accu-rately calculated in the covariance estimate. One’s knowledge of uncertaintyin the system may lead one to truncate the system still further.

• Copy geolocation information from the original dataset to the EOF results

(link)

the_geom add_variable

This additional step allows the svd viewer to display the EOF structures ona map.

Page 24: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

http://iridl.ldeo.columbia.edu/expert/home/.ciph/.Madagascar/.malaria/.g...

1 of 1 05/12/2008 11:29 AM

svd results

ev:

2.

total

variance

normalized

eigenvalues

singular

values

cumulative

normalized

12.17603 0.7204713 2.961838 0.7204713

1.

10.84876S

25.59876S

42.79166E 50.12499E

-4 4

(link)

Figure 12: Madagascar EOF 1

Madagascar Highlands Malaria EOF 1Figure 12 shows the first EOF for the Madagascar malaria, fitting 72% of the

signal. In this case we have a time series from 1993-2005, showing both seasonal-ity and interannual variability, much more information than our previous analyses.

The corresponding spatial pattern is dominated by two districts, but a numberof other districts are also part of the pattern, two clumps of opposite sign.

Madagascar Highlands Malaria EOF 2Figure 13 shows the second EOF, with 11% of the variance. The spatial pat-

terns shows this is mostly about the difference between the two strongest districts,showing a gentle long-term oscillation with more difference in the first and thirdquarters of the record.

Madagascar Highlands Malaria EOF 3

Page 27: EOFs and Clusters - IRIbenno/Talks/svdandclusterhandout.pdfThat example was constructed to have four clusters, so the cluster model worked very well, and the EOF model precisely corresponded

Figure 14 shows the third EOF with 4% of the variance. The figure in the upperright shows that a number of modes have approximately the same variance, so thatthe structures and corresponding time behavior are not strongly distinct, and noisein the system will tend to mix up the structures. Looking at individual structuresin this variance range is not likely to be fruitful.

5 Summary

EOF/K-means clustering differences

• EOFs and clusters are alternative views of complex data

• choice of mean to remove changes what is easy to see

• usually fewer EOFs are needed to fit a system.

• the complete cluster solution tends to be easier to understand.

EOF rotations, such as are computed by the functionvarimax, are an intermediateapproach.