City Typologies: Classi cation and Characterization · City Typologies: Classi cation and Characterization ... With the typologies and characterisation done, ... turing, Finance,

City Typologies:

Classification and Characterization

Bruno Miguel Fonseca de [email protected]

Instituto Superior Tecnico, Lisboa, Portugal

October 2014

Abstract

The aim of this thesis is to: create a typology of cities based on indicators from the economic sector;characterise it with several other indicators of consumption and emissions and predict the typology ofa random city through a decision tree.

Available for this analysis were 298 global cities and 23 indicators that were processed through variousclustering algorithms (K-Means, DBSCAN, EM and hierarchical) in order to obtain the typologies.These were then characterised in two ways. First using only the variables from the economical sectorand second using the variables from population, economy, emissions and consumptions. This was doneso one could construct typologies based on economic sectors and to avoid problems with the missingvalues in the second set of variables. With the typologies and characterisation done, the correlationsbetween variables, with and without highlighting the typologies, were analysed and discussed. Finallythe decision trees were constructed.

The results will show: the existence of five typologies using K-Means clustering; a characterisationof the typologies based on 8 economical sector and 15 other variables; several important correlationsbetween indicators; the relevancy of the typologies between indicators correlations and two decisiontrees that predict the city’s typology with good accuracy levels.Keywords: Typology, City, Classification, K-Means, Indicator

1. Introduction

World population has been increasing at an incred-ible pace for a few decades now and, depending onthe estimates, it may very well keep rising for a fewmore [1]. Accompanying this factor, is also the in-creasing migration of populations from rural areasto cities [2]. The combined factors are responsiblefrom producing enormous cities without precedentand for increasing the urban population from 13%in 1900 to 67% in 2050 [1].

Cities are also the home to the majority of hu-man activity and it is there that most of the min-eral and human resources are concentrated [2]. Thisconcentration of resources creates poles of creativ-ity, wealth creation, invention and investment [3]but they also are responsible for much of the pro-duction of waste, greenhouse emissions and watercontamination of the planet.

Therefore, cities play a key role in the issue ofsustainability. Not only are they a major cause forthe deterioration of the environment, but also inthem relies the human capital to create solutionsfor those same problems.

If now one considers the constant increasing in

urban population, it is possible to see how impor-tant this subject is today and will be in the future.

1.1. state of the art

Cities are the subject of many papers, however citytypology has always remained elusive much due tothe enormous lack of available data.

The concept of city typologies first began withcity hierarchization. In this area one should noticethe important work of Peter Hall in 1966 [4] with isbook ”The World Cities”, Friedmann’s in 1986 withis ”World City Hypothesis” [5] and Sassen in 1991with ”The Global City” [6]. However these workscreate a hierarchy of cities and not a typology. Thishierarchization shows the degrees of importance ofeach city on the world stage, based on the amount ofconnections between firms and services in differentcities. This type of research is still very active to-day. The Globalisation and World Cities (GaWC)Research Network provides several data bases andresearch papers on different forms of hierarchiza-tion.

A different approach, called a Metabolic Profile,can be used if now one considers a city as a livingorganism that consumes resources, produces activ-

1

ity and expels waste. This approach was first usedby Abel Wolman in 1965 [7]. After him, severalother authors [8] [9] [10], used also the concept ofurban metabolism to create metabolic profiles forthe cities. These works did establish some differ-ences between cities and shown that different typesof activity are related to different types of consump-tion and emissions, thus different metabolic profiles.

Alongside with Hierarchization and MetabolicProfiles there are also City Rankings. There arecountless number of reports involving city indica-tors: PWC - Cities of Opportunity, Siemens GreenIndex, Sustainability Cities Index 2010, Mercer -Cost of Living, UN-HABITAT - State of the WorldCities, ATKearney - 2014 Global Cities Index andEmerging Cities Outlook, and many other.

Each of these reports reflects on a certain aspectof the city as the name properly indicates. Theprocess is quite simple, for each report a certainamount of variables are selected and converted intoone single index. This index will reflect the stateof the city, the higher in the rank the better thecity is. This does not divide the cities into groups,although it could be done in a very simplistic wayby choosing good, medium and bad cities. Thesereports also differ widely among themselves, withregard to variables and methodologies used. Thismeans that depending on the data and the method-ologies, a city that is well placed in one rank maynot be in another.

The approach used in the thesis is different fromthese three options, but has a little bit on every one.

With the development of the technology, statis-tical tool and data collection in general, there isa new option. One can construct and characterisecity typologies based on several variables, with thehelp of data mining process. These typologies canthen be used to search for geographical distributionsand possible connections. Each typology will havecertain unique characteristics, which is the same assaying profiles, and ultimately cities within typolo-gies can be compared in search for a certain level ofefficiency, this will show which cities are performingbetter and producing a kind of Ranking.

In all the literature that was consulted, the onlyattempt to create a typology of cities based on indi-cators of consumption and emissions was performedby [11]. There is another work [12] that also ties thisbut it is not so complete. The author uses countryvalues of resource consumption and emission, to es-timated values for the corresponding cities via ZipfLaw []. Then it uses a decision tree model that clas-sifies the cities. This work appears to be the first ofits kind and therefore presents some conceptual is-sues. In particular, the improper use of the decisiontree.

One of the major breaktrough was the report

from Brooking Institute that standardizes the val-ues for the economical sector for several globalcities. This allows for a general view of how theeconomic sectors of each city are distributed andconstruct a typology of cities based on such divi-sions.

1.2. ObjectivesThe main objective is to divide the available 298cities into different typologies based on their eco-nomical sectors. This division will be made troughseveral clustering algorithms (K-Means, DBSCAN,EM and Hierarchical) in order to identify the mostadequate.

Furthermore, one shall proceed to the character-isation of each cluster, first with the variables fromthe economical sector and secondly with variablesfrom population, economy, consumptions and emis-sions. Thus, creating a detailed profile for each ty-pology.

Another objective, is to observe the correlationsbetween indicators, with and without the typologieshighlighted, so to infer about the importance of thetypologies for the global correlation.

Lastly, the construction of two decision trees, willenable the prediction of the typology of a city, usingeconomical or consumption and emissions variables,without the need to reuse the clustering algorithm.

1.3. LimitationThis thesis provides a photograph of what can beperceived as a typology today. It does not give, ortry, to produce a general definition of a typologiesfor cities that can evolve with time. The character-isations presented here do not reflect the typologiesfor cities in future nor do they reflect what theyhave been in the past.

2. DataTo accurately construct a city typology, one must besupported by reliable data. As was mentioned, theuse of national data to estimate values for cities isnot enough, because cities have a unique speed andstate of development when compared to the country[2]. This emphasises the need for real data collectedfrom cities. In order to obtain the best representa-tion of cities and indicators, several sources wereconsulted resulting in a database of 298 cities and23 indicators.

2.1. Variables and citiesThe 298 selected cities are mostly country capitals,state capitals, cities with large populations or sim-ply, cities that generate a great deal of influence orimportance at national level. The geographical dis-tribution is as homogeneous as possible, althoughthere is a significant discrepancy between industri-alised and developing countries. The majority of

2

cities come from Europe, North America, Japan andChine, while South America and Africa are the con-tinents with least representation.

Considering accessibility and relevance to thework, a total of 23 variables were chosen for theanalysis of the cities. These variables were dividedinto 2 groups. The first with 15 variables, dealswith the specificity of the city and contains informa-tion about city area, population, density, Gross Do-mestic Product, Waste Production, Recycling Rate,Water consumption, non-revenue water, PM10 an-nual emissions, CO2 emissions (total, energy andtransport). For simplicity this group shall be knownas PECE. The second group, has 8 variables thatcontain the percentages of the economical sectors(ES) of the city: Commodities, Local, Manufac-turing, Finance, Construction, Tourism, Transportand Utilities.

Since the purpose of this thesis is to create a ty-pology based on the activity of the city, and not onthe intakes and outtakes, the second group will bethe one responsible for creating the city typologiesand determine their differences from a productivestand point. While the first group will be responsi-ble for identifying and characterising the differencesbetween typologies.

2.2. SourcesInformation is a tradable good, so it was not sur-prising that the best databases were only accessi-ble at a fee. Among these, one can mention CityBenchmarking Data, Oxford Economics, Mercer orMckinsey’s Cityscope 2.0, just to name a few.

Thus, it was necessary to choose from sourcesthat were open to all, like municipal sites, statis-tic offices, specialized sites, reports, articles, someopen databases and others in order to obtain thenecessary information.

Some of the most important sources were: Brook-ing Institution, Eurostat (New Cronos, Urban Au-dit, Eurostat Yearbook) Canada and USA onlinegovernment sites, China (China Yearbook), ODCE(Metropolitan Explorer), UN-HABITAT (Urban-info v.2, State of the World’s Cities 2012/2013),World Health Organization (PM10 report), UN-DATA, IIASA, Siemens (Siemens Green Index),CDP (CDP Cities 2012 Global Report), PWC(Cities of Oportunities), KPMG, C40 (C40 cities),Global City Indicators .

2.3. Problems and LimitationsAlthough there were several sources for the collec-tion of data, some problems were encountered re-garding the collection, accessibility and standard-isation of data and the limits to which it can beapplied [13].

Most of the available sources and data are for de-veloped countries, mostly Europe, USA and Japan.

This makes any analysis a little biased towardsthese economies and dismisses the reality in devel-oping countries.

The number of missing data was also an issue.Although, from the 23 variables, the 8 from theES had no missing values, the remaining 15 PECE,were in average only 50% complete. The reasonfor this average value, lies not only with the lackof available data but also with the lack of time forcollecting it. This process is sometimes quite stren-uous, as it is often necessary to look individually,each variable for each city.

Another issue had to do with the definition of acity, since currently there is no convention in howto define its area. This implies that the data recov-ered from the cities, may vary substantially fromcountry to country depending on their definition.Some reports try to provide a standardised modelfor these areas [14] [15] [16], However they are farfrom being the norm.

City area is not the only indicator suffering fromlack of definition. Some variables change theirs fromcountry to country or even from city to city. Waterconsumption may or may not include the consump-tion by the industrial and commercial sectors, andthe rate of recycling is confusing because in someplaces, burned waste is considered energy recycling.

3. AlgorithmsThe process of Data Mining allows to categorise andsummarise large quantities of data into useful infor-mation, in order to observe correlations that other-wise would not be possible. This process is com-posed of several steps like data preparation, clean-ing, clustering algorithms, selection and interpreta-tions of the results [17].

In the case of the clustering algorithms, there areseveral options available depending on the size andtype of data or even on the speed and accuracydesired [18].

3.1. Hierarchical VS Partitional clusteringClustering algorithms can be divided into two maingroups: Hierarchical and Partitional [19].

Hierarchical algorithms repeat, at each cycle, theprocess of either merging smaller clusters into largerones or dividing larger clusters to smaller ones. Ei-ther way, it produces a hierarchy of clusters calleda dendrogram that is a very clear visual represen-tation of the merging process.

Partitional algorithms creates several partitionsthat are then evaluated and then rearranged, opti-mising the initial partitions until a stopping criteriais archived.

Hierarchical and Partitional Clustering have keydifferences in running time, assumptions, input pa-rameters and resultant clusters. Partitional clus-tering is faster and more accurate than hierarchical

3

clustering but requires stronger assumptions suchas number of clusters and the initial centers. Hi-erarchical clustering requires only a similarity mea-sure, does not require any input parameters, thedendrogram returns a much more meaningful andsubjective division of clusters, and is more suitablefor categorical data as long as a similarity measurecan be defined accordingly [20].

3.2. K-MeansK-Means is a partitional clustering method, wheredata is divided into K clusters and where each ob-ject is assigned to precisely one cluster. The algo-rithm works in this manner: first select K pointsas initial centroids; Assign all points to the closestcentroid; Recompute the centroid of each cluster;Repeat 2 and 3 until the centroids don’t change[19].

What this really does, by moving the centers, isminimise the sum of the squared distance betweenpoints and the cluster ”center” [20][38] or in math-ematical form to minimise the expression:

k∑i=1

∑−→x ∈ci

‖(−→xj−−→cij)‖2 =

k∑i=1

∑−→x ∈ci

d∑j=1

(xj−cij)2 (1)

where −→x is a point, −→ci is the ”center” of the clus-ter, d is the dimension of −→x and −→ci , and, xj and cijare the components of −→x and −→ci .

In the absence of numerical problems, this proce-dure always converges to a solution, which is typ-ically a local minimum due to the dependency ofthe initial values. [](Selim and Ismail, 1984). Thus,this dependency becomes a key step of the basic K-Means procedure because it greatly influences theend results. Since, in most cases, the best optionis unknown, choosing the initial points may requiretrying a few options.

K-Means is easy to interpret, simple to imple-ment, has a good speed of convergence and goodadaptability to sparse data. In the case of a samplewith m instances, N attributes, T iterations and Kclusters the complexity is given by: O(T ∗K∗m∗N).Its linearity makes it more advantageous in compar-ison to other clustering methods which have non-linear complexity like hierarchical. However this isan algorithm that is very sensitive tho the initialvalues; works best with data that have isotropicclusters and can be somewhat slower due to therecalculations of all distances in each step.

3.3. DBSCANDBSCAN [21], is a density based hierarchical algo-rithm that creates clusters by connecting regions ofhigh density of points, separated by regions of lowdensity. This algorithm is particularly efficient incases were the distribution of data is not circular or

has strange shapes and is also very good in dealingwith noise and outliers. However, if the density ofpoints in each clusters is not constant, the algorithmstarts to perform very poorly.

3.4. Expectation Maximisation (EM)

The EM algorithm [20] [22] is similar to K-Means,but instead of unequivocally assigning each pointto one cluster it computes the probability of clus-ter membership based on one or more probabilitydistributions. In other words, each data point be-longs to all clusters with a certain probability ineach one. The process by which the algorithm iscapable of assigning probabilities is somewhat long,and is further described in [23].

This algorithm applies especially for those caseswhere clusters might not be completely separated,implying a certain level of overlap. In those cases,it is necessary to determine to which extend eachpoint belongs to each cluster.

3.5. Hierarchic

The hierarchical algorithm [19] [20] produces a se-ries of nested clusters, ranging from clusters of in-dividual points at the bottom to an all-inclusivecluster at the top. This process is best describedby a diagram, called a dendrogram, which displaysgraphically the order in which points are merged. Inorder to agglomerate points or clusters, one mustfirst define what is the proximity between them.There are several options for this depending on theobjective, being the most common the Single link-age, Complete linkage, Group Average Linkage andWard’s Method [20]

The algorithm is characterised by its versatility.It maintains good performance on data sets con-taining non-isotropic clusters if the clusters are wellseparated. Another advantage is that it does notassume any particular number of clusters. Instead,any desired number of clusters can be obtained bycutting the dendrogram horizontally at the desiredlevel.

However, since no global objective function is be-ing optimised, this means that merging decisionsare final. This means that even though decisionsat a local level may be good, the final global resultmay be poor.

3.6. Decision Tree

Decision tree model, is a tree based graphic that al-lows for the classification of previously unclassifieddata points. For this purpose it uses dependablevariables (inputs) in correlation with an indepen-dent one (target or classification), to allow the cre-ation of paths that classify the points [20].

Decision Trees are generated by recursive parti-tioning. Depending on the splitting method, in eachstep the method divides one variable creating two or

4

more groups. In general, the recursion stops whenall instances have the same label value, i.e. the sub-set is pure or if a certain stopping criteria is reached.

Several level of complexity can be obtained bythis method. However, to avoid the problem of over-fitting one should consider the tool called pruning.This option removes paths that do not add to thediscriminative power of the decision tree, makingthe resulting nodes more simple and easy to under-stand at the cost of some accuracy.

It is important to mention that this method re-quires a previous classification to work properly.Thus, in this thesis, the K-Means was first used tocreate the classification, and then the decision treegave the model based on that information. This, ofcourse, involves some drawbacks.

This model is important to avoid the recalcula-tion of the K-Means algorithm every time one newcity is added to the database and to provide a sim-ple model that can be easily interpreted.

4. Results

Each algorithm listed above produced a set of re-sults. These were analysed individually and thencompared, so one could choose the best algorithm,taking into account the characteristics of the dataand the quality of the results.

4.1. K-Means

This method has two main difficulties: its great de-pendency on the initialisation of points and the im-possibility to determine ” priori” the optimal num-ber of clusters.

To minimise these issues, one explored two dif-ferent ways of initialising the K points: Maximisingthe Initial Distance (MD) and Constant Intervals(CI) [24]. For each initialisation the algorithm wasrun several times, varying the value of clusters (K)from 2 to 20. In each run, two values were collected:total distance and distance within clusters. Thesevalues were then used in the determination of thebest initialisation and value of K.

Figure 1: Relative distance decrease in K clusters

The results showed that although both initiali-

sations produced very close results, CI produced amore specialised set of clusters, i.e, cities that hadtheir predominant characteristic more evident. Af-ter all considerations, the best result was obtainedfor 5 clusters with CI.

Figure 2: Average values for the economical sectorsof the typologies

Each cluster was analysed for its characteristicsand given the name of its predominant variable,thus obtaining the names: Finance, Local, Man-ufacturing, Commodities and Balanced. The clus-ter Balanced earned its name by not presenting anypredominant variable.

Furthermore, analysing both the average valuesand the variables distribution for each cluster, en-abled the construction of a characterisation for eachtypology.

4.2. DBSCANDBSCAN has used in hopes of improving the clus-tering process, since the natural shape of the clus-ters was unknown and this algorithm is more flexi-ble in the presence of non circular shapes.

DBSCAN uses only two parameters, distance (ε)and core points (MinPts), to produce the clusters.In order to obtain the best combinations of both,they were chosen in a grid pattern, varying ε be-tween 4 and 7,5 in intervals of 0,5 and MinPts from3 to 7 in intervals of 1. Since this algorithm doesnot depend on initial points, to each combinationof parameters there is only one possible result.

The best results were obtained for ε=5 withMinPts= 4 , 5 and 6 and for ε=5.5 with MinPts=6.For the remaining cases, either there were too manyclusters with few points or too few clusters with oneexceedingly large.

DBSCAN presented some difficulties. Most of theclusters were too small and in every run, of the algo-rithm, there were always a considerable amount ofunclustered points. Most of these difficulties comein part due to the different densities of points and

5

because the clusters were not compact and spacedout. Ultimately, this algorithm only served to get asense of the shape and distribution of data.

4.3. Expectation Maximisation (EM)Due to the presence of a Balanced cluster, that rep-resents a group of non specialised cities, the possi-bility of overlapping clusters arose.

These overlapping areas could be the intersectionof the other four typologies or any combination ofintersections between two or three clusters. The ob-jective was, check to which extent clusters sharedpoints between each other and if could be possibleto obtain only four clusters, where Balanced clus-ter would correspond to the overlapping area of theother four.

Two programs were used to produce the results,Statistica and KNIME. Expectation Maximisationproduced only slightly different results dependingon the program that was used and the chosen pa-rameters. Most of these results remained consistentwith the characterisation of the clusters producedwith K-Means.

The cluster of Commodities was the most af-fected by this algorithm, getting completely mis-characterised if too much overlapping was forced,and disappearing completely if only 4 clusters wereconsidered. This is partly explained due to the factthat the cluster is very small and very specialised.Meanwhile, the cluster Balanced suffered very fewchanges, when the value of K changed, reinforcingthe premise that it exists by its own merit and notas a result of overlapping clusters.

The presence of several cities with significantprobability of existing in more than one cluster,evinced in both results from both programs, indi-cate the possibility of a transitional phase betweenclusters, and shown the possibility of a continuousevolution path, from one typology to another.

However, the algorithm proved to be somewhatvolatile, due to its dependency on the initial values.Because of that, the most important informationthat was retrieved from the algorithm, was the factthat for small values of overlapping, the result fromK-Means is almost recover with the added informa-tion about the presence of several cities with mixprobabilities.

4.4. HierarchicalIn the hierarchical algorithm, several options wereavailable to choose from. However, since the objec-tive was to obtain clusters with close characteristicand small deviations, the Complete Linkage was thebest choice.

However, since this algorithm suffers from the in-ability to adjust or optimise the results once thepoints are clustered, which can cause severe errors,there were little expectations for this algorithm.

Despite of this, one found that the results wereactually very good. In fact, for the case with 7clusters, in reality 5 clusters plus 2 clusters of iso-lated points, the values were remarkably similar tothe K-Means case. The overall characteristics of theproduced clusters were the same as in K-Means andEM with only small deviations.

4.5. Best Result

In sum, one should notice that the results from K-Means, EM and Complete Linkage, although notabsolutely equal, all agree among themselves aboutthe characteristic variables and about the averagevalues of the economic variables in each cluster.Nonetheless K-Means was chosen as the best op-tion because it was the one that proved to be thebest fit and the one who best performed in rela-tion to the other algorithms. Although there is notan absolute certainty about the results obtained bythis algorithm, when comparing these results withother algorithms it is possible to observe some con-sistency between them, thus solidifying the results.

4.6. Decision Tree

The results from K-Means opened the possibility toassign a label to each city (Finance, Manufacturing,Commodities, Local and Balanced), enabling theconstruction of a decision tree that identifies thetypology of a city through a simple diagram. Thiswill provide an approximation model that can beeasily used each time a new city is added. Thisavoids the need for countless recalculations of theK-Means algorithm each time a new city need aclassification. It also provides a simple way for acommon person to use and understand these resultswithout requiring computational skills or extensiveknowledge about algorithms.

It should be mentioned that this process is some-what flawed due to its dependency on the K-Meansclassification. Two tree models were constructed.The first model was completely expanded, showingall the possible paths necessary to identify unequiv-ocally the typology of a city (overfitting), while thesecond was reduced in size sacrificing some accuracybut making the model much simpler (pruning).

5. Results 2

With the typologies created in the previous section,one may now proceed to identify differences amongtypologies based on the indicators of the PECE vari-ables. This was done by comparing average valuesand variable distribution for each cluster with thenew variables. After this, one analysed the cor-relations between variables, with and without thetypologies highlighted in search of different correla-tions. Lastly, the creation of a new decision tree,allows to predict the typology, now based in thePECE variables.

6

5.1. Characterisation of typologiesConsidering the average values of the PECE vari-ables for each cluster, several important differencesbetween the typologies have appeared.

Figure 3: Average values for the PECE group ofindicators, per typology

It was found that the typology Local tops in 7out of 15 variables. These cities have the mostGDP per capita, the biggest consumption of waterand energy, produce the most waste and emit moreCO2 Total and CO2 Transport. By opposition theManufacturing typology appears on the bottom of5 variables, and in 3 that are very close of beinglast. This typology is the one with the least GDPper capita, waste, energy consumption, unemploy-ment and general CO2 emissions. Balanced andFinanced typologies, almost coincide in 9 variables,and the only variable where they differ significantlyis in the percentage of Non-Revenue Water. In mostcases they are located between Local (above) andManufacturing (bellow).

Analysing now the distributions of the variables,it was seen that in the majority of cases, the resultsreflected the same reality as the average values, con-firming that the typologies with most and least pro-duction of waste, water consumption or CO2 emis-sions remained the same.

5.2. Correlations with typologiesThe purpose was to identify the most relevantglobal correlations and observe if the introducingof typologies, would produce the same type of cor-relation.

Analysing first the global correlations, one foundthat most of them follow a power law or a logarithmlaw. This may indicate that some variables, may beinfluence by a scale economy like outlined in [3].

The introduction of the typologies showed a clearsubdivision of the global correlation. Althoughmost of the typologies follow the same trend lineas the global correlation, they clearly differ between

Figure 4: Correlation between CO2 and GDP, in-cluding typologies

them, like the cases of GDP Vs Area or CO2 Trans-port Vs PM10. In some cases the sub-correlationsdid not even follow the expected trend line, likePM10 Vs GDP or Water consumption Vs Density.These cases show that a global correlation, is fre-quently composed of several sub correlations thatdo not necessarily share the same trend line.

Also very remarkable, were the cases where theintroduction of the typologies did not produce anyeffect, as Unemployment Vs GDP, Non-RevenueWater Vs GDP and Non-Revenue Water Vs Re-cycling are such cases. This clearly indicates thatsome correlations are immune to the typology of thecity.

Figure 5: Correlation between Non-Revenue Waterand GDP, including typologies

5.3. Decision TreeThe program used to construct the decision treewas the IBM SPSS Statistics 20, because it offersa wide range of choices and parameters to optimisethe results. The program was run for three growingmethods: CRT, CHAID and exhaustive CHAID.The method that produced the best results, withhigh accuracy and low number of nodes, was theCHAID [25].

The advantage of this method is that it includesramifications for the cases of missing data, mak-

7

ing the tree much more understandable and usable,since without this option, it would be very trickyto select a path, if a random variable had no valueavailable.

The method had full use of the 15 PECE vari-ables. Even though it only used PM10, CO2 Trans-port, CO2 Energy, Energy, Unemployment Rateand GDP it obtained a 73,5% of correct predictionwith 26 nodes. There is a curious fact about thismodel. It is possible to verify that the cities of theManufacturing typology are the ones with the leastamount of indicators available. Thus, it is not dif-ficult to understand how the model assumes thatcities with no data are from this type. This showsone of the weaknesses of this decision tree and rein-forces the need to work with databases that are ascomplete as possible.

5.4. Conclusion

In conclusion, with the typologies that were pre-viously created, it was possible to define charac-teristics for each typology and observe differencesbetween them based on the PECE variables. Thisapproach was necessary, not only because it mademore sense to first use the economical sector to es-tablish the typologies, but also because the highlevel of missing values in the PECE variables, wouldseverely impair the creation of the typologies.

The first results, compare the average values ofeach typologies withing each variable. They showedseveral differences between the typologies, espe-cially in variables like the PM10 and CO2 Trans-port emissions and Energy and water consumption.These results were further confirmed by comparingthe variable average values with their distributions.The majority of cases showed that the mean valueswere in agreement with the distributions and onlyin the cases of GDP and Energy were there foundsome small deviations.

The correlation between variables, with and with-out the typologies inserted, showed some very in-teresting cases. For the most part, the introduc-tion of typologies showed that a global correlationis composed of several sub correlations belongingto each of the typologies that may or may not fol-low the same distribution as the global correlation.The most evident cases of differentiation betweentypologies were found in CO2 Transport VS Areaand PM10 VS GDP per capita. There were alsocases where the introduction of typologies did notproduced any kind of differentiation. This may indi-cate that some correlations are immune to differenttypologies and exist above them. These were thecases of Unemployment rate VS GDP per capitaand Non-Revenue Water and Recycling Rate.

The use of the tree model here is debatable, giventhe conditions that lead to the creation of typolo-

gies in the first place. However, since there is noother way to circumvent these issues, and despiteonly working in a qualitative way, the decision treethat was constructed, is the best tool that one has,currently available to classify a city based on theirpopulation, economy, emissions and consumptions.The level of precision of the decision tree, dependson the objectives of the work, however the modelthat was produced, gave a reasonable balance be-tween accuracy and complexity.

The number of missing values is of the most im-portance. One should expect that a more completedata base will produce a more accurate set of re-sults and possibly enable the construction of thetypologies based in more than just the variables.

6. Conclusions

In this thesis, there were two main objectives: firstand foremost, the creation of a typology for citiesand second the characterisation of these typologiesso they could be differentiated.

The first objective was achieved by combiningthe variables from the economic sector with sev-eral clustering algorithms. This was a slow pro-cess that required several attempts, but that pro-duced a solid result, where cities were divided intofive groups (Commodities, Balanced, Manufactur-ing, Local and Balanced), now called typologies.

These several algorithms had also served to iden-tify some characteristics of the data, that are rele-vant for the clustering process. Most interestinglywas to verify that the clusters were not equallydense, which causes problems for density based clus-tering algorithms, and that they overlapped in somepoints. Nonetheless, from all the clustering algo-rithms, it was the K-Means that proved to be themost reliable.

With the typologies defined, by comparison be-tween them, one found that each one had certaincharacteristics that made them easily identifiable.The most important was that each typology had acharacteristic variable with a very high value thatdid not repeat in any other typology. Thus, oneproved that it is possible to separate the cities intoseveral groups, based on their economical activi-ties. Doing it, then, allowed for the characterisa-tion of each cluster based on those same variables.Next, the first decision tree model was created andhelped to classify each city based on the econom-ical variables. Despite some conceptual problems,this model can predict the typology of a city withover 90% accuracy in a tree with only 10 nodes,thus avoiding new recalculations of the clusteringalgorithm.

In the second main objective of this thesis, the re-maining variables related to GDP, population, con-sumptions and emissions were used to characterise

8

even further each typology. The two derived results,average values and variable distributions, showedsignificant differences between typologies making itpossible to determine some patterns of consumptionand emissions for each typology. For example it,was found that cities from the Local typology had,in average, more GDP per capita but they also hadmore waste production, water and energy consump-tion and CO2 emissions, while on the opposite sideof the spectrum were the manufacturing cities withlow values of these variables.

Looking at the correlation between variables,they presented several interesting cases. Alone,global correlation showed strong interaction be-tween several variables like PM10 and GDP percapita or CO2 Transport and PM10. These wereenough to identify some very important character-istics of the cities but the introduction of the typolo-gies enriched much more this analysis. It was seenthat although many typologies follow the global cor-relation, they display different rates between themand in some cases, the typologies do not even fol-low the global correlation. A few special cases evenshowed no differentiation when typologies were in-volved showing that those variables were transverseto all typologies. Thus, showing that beneath theglobal correlations between variables, there is spe-cific reality for each typology that normally is hid-den.

The second decision tree help to improve even fur-ther the classification of cities. Using now the con-sumption and emissions variables, the model pre-dicts with little above 70% accuracy the typologyof the city. This model, even though suffers fromthe same issues as the first decision tree, becomesvery important due to the existence of missing data.Currently, this model is the only tool available topredict the typology of a city though emission andconsumptions variables, that even allows for somemissing data.

At the end of this thesis both main objectiveswere completed. It is now possible to classify a city,using one of two options and there is a character-isation of each typology based on their economicsectors and in their consumptions and emissions.

6.1. Future Work

There were several considerations and paths thatwere not explored in this thesis, some because ofdata availability, most because of lack of time andsome due to technical difficulties.

It was seen with Expectation Maximisation thatsome clusters overlap, and this may in fact be a im-portant stage of transition for cities, making thisa subject that should be deeper investigated. Also,the number of algorithms should be expanded to in-clude more recent one like CLIQUE, CLARANS or

BIRCH in hopes of add more information or betterresults.

One remarks that an important connectionshould be done with the work produced by GeorgeWest. It was found that certain variables follow apower law when the city size scales up or down.It would be very interesting, since this thesis showthat not all typologies follow global correlations inthe same way, to see if these power laws would bealso applied for all typologies or if they only ap-ply when considering cities in general. A creationof a level of efficiency for the cities should be in-vestigated. With the typologies it is now possibleto predict the values for consumptions and emis-sions taking into account several other indicators.It should be possible to calculate the efficiency of acity by comparing its values with the ones expectedfor a city like that.

Moreover, it would be very interesting to observethese cities for a considerable period of time, notonly to understand their evolution but also to seehow the typologies would evolve.

Another very important improvement would bethe introduction of more indicators like, numberof inhabitants per house, average utilisation of mo-torised vehicles, number of PhDs, and others thatcould be analysed in blocks so not to overburdenthe process of classification. Associated with thenumber of variables, the reduction of the number ofmissing values should also be a priority in order toobtain more accurate results..

Finally, it would be interesting to use the remain-ing 150 cities, for which only the values for the vari-ables of consumption and emissions were available,to determine their typologies.

Acknowledgements

I would like to thank my girlfriend, for her incredi-ble support and patience, Andre Pina for his inputsand help in crucial moments, professor Paulo Fer-rao For giving me the opportunity to work in thisamazing field and Carlos Silva for his help in thelater stages of the thesis.

References

[1] United Nations. World urbanization prospects2014, 2014.

[2] UN-Habitat. State af the world’scities. report HS/080/12E, UN-Habitat,https://portal.gs.com, 2012.

[3] Dirk Helbing Christian Khnert Lus M. A. Bet-tencourt, Jos Lobo and Geoffrey B. West.Growth, innovation, scaling, and the pace oflife in cities. Proceedings of the NationalAcademy of Sciences, 104(17):7301–7306, Mar2007.

9

[4] Peter Hall. The world cities. McGraw-Hill,1966.

[5] Jonh Friedmann. The World City Hypothesis,volume 17, pages 69–83. 1986.

[6] Saskia Sassen. The Global City: New York,London, Tokyo. Princeton University Press,1991.

[7] Abel Wolman. The metabolism of cities. Sci-entific American, 213(3):179–190, Sep 1965.

[8] John Cuddihy Christopher Kennedy andJoshua Engel-Yan. The changing metabolismof cities. Journal of Industrial Ecology, 11(2),2007.

[9] Marina Fischer-Kowalski. Societys metabolism- the intellectual history of materials flow anal-ysis, part i, 1860-1970. Journal of IndustrialEcology, 2(1):61–78, Jan 1998.

[10] Heinz Schandl Krausmann Fridolin, MarinaFischer-Kowalski and Nina Eisenmenger. Theglobal socio-metabolic transition: past andpresent metabolic profiles and their future tra-jectories. Journal of Industrial Ecology, 12(5-6):637–656, 2009.

[11] Artessa Nicola D. Saldivar-Sali. Master’s the-sis.

[12] KPMG. City typology as the basis for policy.report, KPMG, 2010.

[13] Christa Anderson. City sustainability indica-tors. World Bank.

[14] OCDE. Redifining ”Urban” - A new way tomesure metropolitan areas. OCDEpublishing,2012. 10.1787/9789264174108-en.

[15] M. Piacentini and K. Rosina. Mea-suring the environmental performance ofmetropolitan areas with geographic infor-mation sources. OECD Regional Devel-opment Working Papers, 2012/05, 2012.http://dx.doi.org/10.1787/5k9b9ltv87jf-en.

[16] Eurostat. Eurostat regional yearbook 2013. Eu-rostat, 2013. ISSN 1830-9674.

[17] Usama Fayyad, Gregory Piatetsky-Shapiro,and Padhraic Smyth. From data mining toknowledge discovery in databases. AI Maga-zine, 17(3), 1996.

[18] Vladimir Estivill-Castro. Why so many clus-tering algorithms: a position paper. In ACMSIGKDD Explorations Newsletter, volume 4,pages 65–75. ACM New York, NY, USA, June2002. ISSN: 1931-0145.

[19] Eui-Hong Han. An introduction to clusteranalysis for data mining, February 2000.

[20] Lior Rokach and Oded Maimon. ClusteringMethods, chapter 15, pages 321–352. SpringerUS, 2005.

[21] Martin Ester, Hans peter Kriegel, Jrg S, andXiaowei Xu. A density-based algorithm fordiscovering clusters in large spatial databaseswith noise. pages 226–231. AAAI Press, 1996.

[22] A. P. Dempster; N. M. Laird; D. B. Rubin.Maximum likelihood from incomplete data viathe em algorithm. Journal of the Royal Statis-tical Society. Series B (Methodological), 39(1),1977.

[23] Ted Pedersen. The em algorithm: Selectedreadings, June 2001. Good starting point inEM.

[24] Statistica. STATISTICA Electronic Manual.StatSoft, 2005.

[25] IBM. CHAID and Exhaus-tive CHAID Algorithms. IBM.ftp://ftp.software.ibm.com/software/analytics/spss/support/Stats/Docs/Statistics/Algorithms/13.0/TREE-CHAID.pdf.

10

Documents

City Typologies: Classi cation and Characterization · City Typologies: Classi cation and Characterization ... With the typologies and characterisation done, ... turing, Finance,