Clustering analysis of vegetation data - IJSkt.ijs.si/DragiKocev/PhD/resources/lib/exe/fetch.php?... · 2011. 3. 14. · Clustering analysis of vegetation data Valentin Gjorgjioski

Clustering analysis of vegetation data

Valentin Gjorgjioski1, Saso Dzeroski1 and Matt White2

1 Jozef Stefan InstituteJamova cesta 39, SI-1000 Ljubljana

Slovenia2Arthur Rylah Institute for Environmental Research

Dept. of Sustainability & EnvironmentHeidelberg 3084, VIC, Australia

1 Introduction

Vegetation may be described as the plant life of a region. The study of patternsand processes in vegetation at various scales of space and time is useful in under-standing landscapes, ecological processes, environmental history and predictingecosystem attributes such as productivity. Generalized vegetation descriptions,maps and other graphical representations of vegetation types have become fun-damental to land use planning and management. They are widely used as biodi-versity surrogates in conservation assessments and can provide a useful summaryof many non-vegetation landscape elements such as animal habitats, agriculturalsuitability and the location and abundance of timber and other forest resources.We use clustering or classification of vegetation data to obtain such descrip-tions, maps and other representations. Clustering vegetation data is well knownmachine learning problem which aims to partition the data set into subsets, sothat the data in each subset share some common trait. Summary of vegetationclassification and methods can be found in the numerous texts that focus onthis discipline[6,3]. In our work we deal with vegetation data which is organizedin relational model. To be able to apply classical machine learning approach weneed to do some data preprocessing. We preprocess the data using simple aggre-gation techniques and we use several approaches to analyze the data: Predictiveclustering trees [1], k-Means and Hierarchical Agglomerative Clustering. Thesealgorithms were applied and satisfactory results were obtained.

The rest of paper is organized as follows. First we discuss dataset and prob-lem in details. Further on we show preprocessing details needed to make datasuitable for classical data mining approaches, and in the next section we aredescribing our data mining setup and experiments. Next, we present the resultsof the experiments and at the end we conclude with discussion and further workproposals.

2 Dataset and problem description

The problem is to produce classification and clustering of vegetation properties,which is easier problem to solve than the classification of the vegetation in gen-eral. We aim to solve easier problem and later to advance in solving more general

problem. Mapping of such classification over the whole landscape is also desired,so we will try to do predictive clustering model, which later can be mapped onthe landscape.

The data has been collected from across the State of Victoria, Australia anarea of approximately 22,000,000 hectares. The State is relatively varied cli-matically and geologically and supports some 4,000 indigenous vascular plantspecies. Landscape is divided in about quadrants of 30x30 meters resolution re-ferred as sites later in this paper. For this study we have about 30000 sites andabout 4600 species. Each of these sites has ordinal categories which representabundance of each species. Further more, for each site we have environmental(climatic, radiometric, topographic) and spectral variables from the same lo-cations been extracted from a ’stack’ of data themes stored in a GIS. On theother hand, additional information is known of the species - their physiognomy(leaf type, plant size and general architecture), phenology (flowering time) andphylogeny (i.e. Genus, Family). Figure 1 depicts the relationship between siteproperties and species properties.

Fig. 1. Species and Sites properties

We have relational data with one to many relationships. To handle it we willaggregate the data with simple aggregation techniques. We give more detailsabout this in next section.

3 Preprocessing

First we convert the ordinal categories of abundance to numeric value with thehelp of the expert. We use the following mapping:

– 1 (0-5%) as 2.5– 2 (5-25%) as 15– 3 (25-50% as 37.5– 4 (50-75%) as 62.5– 5 (75-100%) as 87.5

Next, we remove measurements of exotic species, and species with very lowcover(0.5) as suggested by experts. After cleaning the data we aggregate thecover abundance of the species in a given site by species properties. For everyproperty of a species we aggregate over each of its values. This is done for everysite. Basically here we generate a new feature for every value of a given nominalproperty. Example for a feature generation of autflow property and value 1 isgiven in Figure 2, or given as algorithm it is presented with Algorithm 1.

autflow1 (Si, Sp) =

X

Sp∈Si,autflow(Sp)=1

cover (Si, Sp)

cover(Si), given that

cover (Si) =X

Sp∈Si

cover (Si, Sp)

Fig. 2. Example of feature generation for autflow property, given the site Si

and species Sp

Algorithm 1 Function that returns value of new feature for each sitefunction generateFeature(attribute, aV alue)

for each site S dofor each species species in S do

sum+ = speciesAbundance(species)if getAValue(species,attribute)==aValue then

sum1+ = speciesAbundance(species)

setFeatureV alue(site, feature, sum1/sum)

4 Methodology

We use three approaches:

– Predictive Clustering Trees for multi-target prediction (PCTs)– K-means clustering– Hierarchical agglomerative clustering (HAC)

4.1 Predictve Clustering Trees

Predictive modeling aims at constructing models that can predict a target prop-erty of an object from a description of the object. Predictive models are learned

from sets of examples, where each example has the form (D,T ), with D beingan object description and T a target property value. While a variety of repre-sentations ranging from propositional to first order logic have been used for D,T is almost always considered to consist of a single target attribute called theclass, which is either discrete (classification problem) or continuous (regressionproblem).

Clustering [2], on the other hand, is concerned with grouping objects intosubsets of objects (called clusters) that are similar w.r.t. their description D.There is no target property defined in clustering tasks. In conventional cluster-ing, the notion of a distance (or conversely, similarity) is crucial: examples areconsidered to be points in a metric space and clusters are constructed such thatexamples in the same cluster are close according to a particular distance met-ric. A centroid (or prototypical example) may be used as a representative for acluster. The centroid is the point with the lowest average (squared) distance toall the examples in the cluster, i.e., the mean or medoid of the examples.

Predictive clustering [1] combines elements from both prediction and clus-tering and it is implemented in the Clus system which can be obtained athttp://www.cs.kuleuven.be/~dtai/clus.

4.2 K-means Clustering

We describe k-means clustering very briefly in this section. The algorithm startsby partitioning the input points into k initial sets, either at random or usingsome heuristic data. It then calculates the mean point, or centroid, of each set.It constructs a new partition by associating each point with the closest centroid.Then the centroids are recalculated for the new clusters, and algorithm repeatedby alternate application of these two steps until convergence, which is obtainedwhen the points no longer switch clusters (or alternatively centroids are no longerchanged).

4.3 Hierarchical Agglomerative Clustering

In this section, we briefly discuss Hierarchical Agglomerative Clustering (HAC)(see, e.g., [4]). HAC is one of the most widely used clustering approaches. Itproduces a nested hierarchy of groups of similar objects, based on a matrixcontaining the pairwise distances between all objects. HAC repeats the followingthree steps until all objects are in the same cluster:

1. Search the distance matrix for the two closest objects or clusters.2. Join the two objects (clusters) to produce a new cluster.3. Update the distance matrix to include the distances between the new cluster

and all other clusters (objects).

There are four well known HAC algorithms: single-link, complete-link, group-average, and centroid clustering, which differ in the cluster similarity measurethey employ. We decided to use single-link HAC because it is usually considered

http://www.cs.kuleuven.be/~dtai/clus

to be the simplest approach and has smallest time complexity. Furthermore,this approach can do much better clustering than PCTs, and comparing to somebetter approach was out of the scope of this work. Single-link HAC computes thedistance between two clusters as the distance between the closest pair of objects.The HAC implementation that we use has a computational cost of O(N2), withN the number of time series, and for efficiency it uses a next-best-merge array[4].

An important drawback of single-link HAC is that it suffers from the chain-ing effect [4], which in some cases may result in undesirable elongated clusters.Because the merge criterion is strictly local (it only takes the two closest objectsinto account), a chain of points can be extended over a long distance withoutregard to the overall shape of the emerging cluster.

5 Experimental setup

In experimental setup, attributes obtained with aggregation are target attributes.On the other hand the properties of the sites are used as descriptive attributesin data mining task. To provide better experiments, comparison and analysisof results, we use various experimental setups. First, we take only selected at-tributes. Using selected set of attributes we experiment with PCTs given sizeconstraint[5] of 6 clusters. Next, experiments were performed using all of theavailable attributes, constraining PCTs to 6 and 12 clusters, HAC to 6 clusters,and we set k = 6 for k-means clustering.

6 Results

Here, we present only the results from PCTs. In Appendix we give the k-meansand PCTs results in details. We would like to emphasize that HAC producedvery unbalanced clusters (four clusters of size 1, one cluster of size 2, and onecluster of size 29673) which are useless in this case. In Figure 3 we give mapcolored in six colors according to the tree generated by Clus which is given inFigure 4.

Expert provided excellent feedback about this clustering and it’s visualiza-tion. Maps for the other results, are not currently produced and that is plannedfor further work. First tree with 6 clusters was obtained using only selectedaggregated attributes, while next experiments were performed using all of theavailable aggregated attributes. The tree given in Figure 5 is the result of PCTalgorithm with constraint set to 12 leaves and using aggregation of all availableattributes.

Description of the clusters in terms of size and lifelook and sprflow at-tributes is given in Figure 6. We could conclude that elements across the clus-ters are well distributed: we do not have either too small or too large clusters. Interms of lifelook attribute, clusters are well separated. We can notice that twomajor lifelook types are different between clusters and both represents morethan 30% of all elements in a cluster. Considering that there are 27 lifelook

Fig. 3. Map

types, this percantage is far from small. On the other hand, in terms of sprflowattribute, clusters are inpure with small exception in some clusters(A, B, F).

K-means clustering results in details are presented in appendix. For k-meansclustering we calculate four standard statistics average, standard deviation, minand max over the descriptive attributes in the cluster to give some descriptionabout the clusters and later to do visualization of clusters on a map.

7 Discussion and further work

This study shows how different is dealing with vegetation data and even moregeneral with environmental data compared to classical data mining problems.We focused on aggregation and data preprocessing mainly in this work, and laterwe applied classical algorithms. Obtained results are very promising. We con-tinue to work on this problem by adopting classical algorithm to use hierarchicalinformation about species, also doing mining without aggregation but directlyover species. In that case we consider that species are complex/structured dataand we propose developing new (or adapting classical) algorithms to handle thesetypes of problems.

Fig. 4. Clus Tree with 6 leaves

Fig. 5. Clus Tree with 12 leaves

References

1. H. Blockeel, L. De Raedt, and J. Ramon. Top-down induction of clustering trees.In 15th Int’l Conf. on Machine Learning, pages 55–63, 1998.

2. L. Kaufman and P.J. Rousseeuw, editors. Finding groups in data: An introductionto cluster analysis. John Wiley & Sons, 1990.

3. P. Legendre and L. Legendre. Numerical Ecology. Elsevier, Amsterdam, 1998.4. C.D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval.

Cambridge University Press, 2007.5. J. Struyf and S. Dzeroski. Constraint based induction of multi-objective regression

trees. In 4th Int’l Workshop on Knowledge Discovery in Inductive Databases: Re-vised Selected and Invited Papers, volume 3933 of LNCS, pages 222–233. Springer,2006.

6. D. Sun, R. J. Hnatiuk, and V. J. Neldner. Review of vegetation classification andmapping systems undertaken by major forested land management agencies in aus-tralia. Australian Journal of Botany, 45(6):929–948, 1997.

– Cluster A:• Size: 1425• lifelook=SS 19%• lifelook=MT 18%• sprflow=1 76%

– Cluster B:• Size: 4740• lifelook=S 15%• lifelook=MTG 15%• sprflow=1 79%

– Cluster C:• Size: 5668• lifelook=H 36%• lifelook=S 17%• sprflow=1 65%

– Cluster D:• Size: 2250• lifelook=S 20%• lifelook=MTG 19%• sprflow=1 53%

– Cluster E:• Size: 9350• lifelook=T 21%• lifelook=S 15%• sprflow=1 65%

– Cluster F:• Size: 6226• lifelook=T 20%• lifelook=S 17%• sprflow=1 75%

Fig. 6. Description of the clusters in terms of size and lifelook and sprflow at-tributes

A Appendix: Extended results

For each cluster we provide size of cluster, four statistics on descriptive at-tributes, and part of the prototype of a cluster. For each prototype we show justthe most important values of that prototype.

Cluster A, Size: 1398dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min 16.0 816.0 634.0 -9999.0 -9999.0 -168.0 321.0 3109.0 3.19 1.0

Max 117.0 1034.0 801.0 6.0 5.7 0.0 459.0 3225.0 12.28 7.0

Avg 60.3 906.8 708.6 -102.7 -96.9 -132.3 389.6 3164.1 12.0 1.3

StdDev 18.6 43.4 17.6 1030.7 995.9 15.9 31.5 32.2 0.9 1.2

Prototype:

lifelook=SS 19% leaftype= 34% sprflow=1 76% sumflow=1 58% autflow= 64%winflow= 52% hitecat=1 36% aquatic= 99% fleshyf= 91% fleshyl= 80%

Cluster B, Size: 7205dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min 9.0 816.0 512.0 -9999.0 -9999.0 -999.0 -412.0 1759.0 3.19 1.0

Max 1652.0 1065.0 825.0 12.66 12.75 2117.0 839.0 3108.0 16.6 7.0

Avg 118.48 892.67 706.45 -1071.37 -1069.44 -338.39 426.22 2727.4 12.2 1.16

StdDev 106.51 50.79 23.06 3099.43 3095.61 295.01 113.07 235.63 0.89 0.81

Prototype:

lifelook=MTG 19% leaftype= 57% sprflow=1 81% sumflow=1 73% autflow= 68%winflow= 75% hitecat=2 31% aquatic= 94% fleshyf= 95% fleshyl= 97%

Cluster C, Size: 1461dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min -32.0 816.0 411.0 -9999.0 -9999.0 -632.0 341.0 2203.0 3.19 1.0

Max 8.0 1121.0 849.0 5.88 4.57 393.0 832.0 2635.0 31.12 6.0

Avg 3.26 923.43 704.34 -2172.46 -2146.75 -357.64 534.17 2497.91 9.59 2.91

StdDev 3.45 50.72 26.38 4128.19 4108.44 205.62 60.34 73.63 3.93 2.26

Prototype:

lifelook=H 19% leaftype= 48% sprflow=1 75% sumflow=1 66% autflow= 58%winflow= 78% hitecat=1 35% aquatic= 91% fleshyf= 90% fleshyl= 78%

Cluster D, Size: 2297dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min 1261.0 513.0 0.0 2.43 2.81 516.0 -571.0 1648.0 15.49 1.0

Max 1945.0 814.0 1411.0 4.92 5.94 2594.0 4.0 2367.0 17.44 2.0

Avg 1547.77 638.37 632.17 3.81 4.03 1675.2 -324.96 1848.13 16.33 1.0

StdDev 156.09 68.2 241.09 0.43 0.52 426.98 107.0 114.38 0.44 0.04

Prototype:

lifelook=S 20% leaftype= 54% sprflow=1 53% sumflow=1 83% autflow= 80%winflow= 91% hitecat=2 33% aquatic= 99% fleshyf= 92% fleshyl= 100%

Cluster E, Size: 10999dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min -5.0 500.0 0.0 -9999.0 -9999.0 20.0 -343.0 1932.0 3.19 1.0

Max 1260.0 815.0 1410.0 6.4 7.1 1548.0 703.0 3022.0 15.65 6.0

Avg 545.57 661.52 667.96 -449.54 -449.99 468.73 194.1 2450.21 13.51 1.01

StdDev 338.27 63.96 206.22 2081.78 2081.68 373.71 225.06 156.69 1.01 0.22

Prototype:

lifelook=T 21% leaftype=Scle 39% sprflow=1 65% sumflow=1 66% autflow= 74%winflow= 82% hitecat=2 21% aquatic= 99% fleshyf= 94% fleshyl= 100%

Cluster F, Size: 6318dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min -14.0 -999.0 0.0 -9999.0 -9999.0 -999.0 -303.0 2226.0 3.19 1.0

Max 991.0 815.0 1279.0 10.13 11.45 19.0 820.0 3207.0 14.79 6.0

Avg 196.96 728.0 693.16 -845.94 -835.29 -207.96 396.6 2595.02 12.4 1.14

StdDev 173.5 66.18 136.0 2789.66 2772.97 180.05 164.57 173.6 1.22 0.76

Prototype:

lifelook=T 20% leaftype= 40% sprflow=1 75% sumflow=1 71% autflow= 70%winflow= 74% hitecat=2 24% aquatic= 98% fleshyf= 95% fleshyl= 99%

Fig. 7. PCTs results obtained using all attributes

Cluster A, Size: 9526dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min -14.00 504.00 0.00 -9999.00 -9999.00 -999.00 -511.00 1690.00 3.19 1.00

Max 1832.00 1052.00 1411.00 12.66 11.45 2451.00 807.00 3225.00 17.14 7.00

Avg 326.15 745.13 684.78 -490.16 -481.44 81.36 318.00 2575.04 12.71 1.14

Stddev 363.46 113.72 157.22 2168.67 2148.75 509.74 223.63 274.69 1.70 0.81

Prototype:

lifelook=S 28% leaftype=Scle 54% sprflow=1 78% sumflow=1 71% autflow= 70%winflow= 64% hitecat=3 20% aquatic= 100% fleshyf= 92% fleshyl= 98%

Cluster B, Size: 1065dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min -8.00 518.00 50.00 -9999.00 -9999.00 -991.00 -520.00 1718.00 3.19 1.00

Max 1770.00 1043.00 1259.00 7.98 7.32 2329.00 820.00 3222.00 17.00 7.00

Avg 205.83 809.66 705.02 -1178.99 -1170.29 -103.91 418.85 2554.12 12.33 1.25

Stddev 317.27 104.38 106.69 3230.88 3219.50 480.67 209.42 248.76 1.68 0.99

Prototype:

lifelook=LTG 31% leaftype= 64% sprflow=1 92% sumflow=1 87% autflow= 86%winflow= 79% hitecat=3 54% aquatic= 86% fleshyf= 98% fleshyl= 99%

Cluster C, Size: 3853dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min -21.00 520.00 0.00 -9999.00 -9999.00 -999.00 -547.00 1648.00 3.19 1.00

Max 1917.00 1050.00 1399.00 9.36 10.89 2565.00 816.00 3224.00 17.39 7.00

Avg 283.31 824.00 699.99 -553.75 -549.31 -91.13 325.03 2697.50 12.50 1.32

Stddev 402.76 128.56 120.16 2296.16 2285.90 563.98 221.85 360.00 1.95 1.15

Prototype:

lifelook=H 28% leaftype= 62% sprflow=1 83% sumflow=1 65% autflow= 69%winflow= 66% hitecat=1 52% aquatic= 95% fleshyf= 95% fleshyl= 89%

Cluster D, Size: 7116dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min -32.00 -999.00 0.00 -9999.00 -9999.00 -999.00 -547.00 1664.00 3.19 1.00

Max 1933.00 1121.00 1365.00 9.49 12.75 2594.00 839.00 3223.00 31.12 7.00

Avg 416.11 781.09 690.27 -1311.49 -1309.03 120.37 282.20 2498.21 12.93 1.22

Stddev 522.05 123.49 135.60 3380.95 3377.75 744.98 297.18 325.69 2.10 0.95

Prototype:

lifelook=MTG 27% leaftype= 69% sprflow=1 80% sumflow=1 87% autflow= 62%winflow= 89% hitecat=2 44% aquatic= 95% fleshyf= 97% fleshyl= 98%

Cluster E, Size: 2530dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min -14.00 515.00 0.00 -9999.00 -9999.00 -998.00 -502.00 1660.00 3.19 1.00

Max 1901.00 1107.00 1325.00 6.99 7.21 2428.00 809.00 3224.00 17.33 7.00

Avg 474.33 705.91 675.17 -450.20 -451.05 437.84 232.34 2489.47 13.21 1.12

Stddev 389.11 107.48 185.48 2083.72 2083.53 574.12 256.40 264.30 1.51 0.74

Prototype:

lifelook=T 28% leaftype= 40% sprflow= 57% sumflow= 68% autflow= 92%winflow= 85% hitecat=4 30% aquatic= 99% fleshyf= 89% fleshyl= 99%

Cluster F, Size: 5589dem twi solar thinvk thk eff rain mint jul max feb grnd dpth salinity

Min -10.00 511.00 0.00 -9999.00 -9999.00 -998.00 -571.00 1648.00 3.19 1.00

Max 1945.00 1044.00 1407.00 8.78 6.74 2577.00 771.00 3223.00 17.44 7.00

Avg 566.27 699.89 661.53 -498.66 -499.11 405.52 172.94 2432.93 13.47 1.11

Stddev 507.74 106.96 194.85 2185.88 2185.78 692.12 301.15 299.51 1.79 0.68

Prototype:

lifelook=T 17% leaftype= 34% sprflow= 52% sumflow=1 63% autflow= 74%winflow= 89% hitecat=2 28% aquatic= 99% fleshyf= 94% fleshyl= 98%

Fig. 8. K-means results obtained using all attributes and k=6

Documents

Clustering analysis of vegetation data - IJSkt.ijs.si/DragiKocev/PhD/resources/lib/exe/fetch.php?... · 2011. 3. 14. · Clustering analysis of vegetation data Valentin Gjorgjioski