Distances and other dissimilarity measures in chemometrics

Distances and OtherDissimilarity Measures inChemometrics

Roberto Todeschini, Davide Ballabio and VivianaConsonniUniversity of Milano-Bicocca, Milan, Italy

1 Introduction 12 Theoretical Background 3

2.1 Notation and Symbols 32.2 Axiomatic Rules for Dissimilarity Measures 32.3 From Distance to Similarity 42.4 Weighted Distances 62.5 Data Pretreatment 6

3 Definitions of Distance and SimilarityMeasures 73.1 Distances for Real-valued Data 73.2 Distances for Ranked Data 123.3 Distances for Frequency Data 133.4 Binary Similarity Measures 133.5 Mixed-type Distances 18

4 Meta-Distances 185 Distances Between Sets 19

5.1 Hausdorff Distance 195.2 Linkage Metrics 205.3 Procrustes Analysis 205.4 Canonical Measure of Distance 21

6 Distance Measures on Graphs 227 Multivariate Comparison of Real-Valued

Distances 227.1 Comparison of Real-valued Distance

Measures in Unsupervised Analysis 247.2 Effects of Distance Measures on Similarity-

based Classification 278 Comparison of Binary Similarity Coefficients 31

Acknowledgments 33Abbreviations and Acronyms 33Related Articles 33References 33

Several similarity/diversity measures for data mining andchemometrics are presented and discussed toward thedifferent data they are applied to. After a short presentation

of the axioms for dissimilarity and similarity functions,their relationships, and the required data pretreatment,the theoretical definitions and formulas of distanceand similarity measures for real-valued, binary, ranked,frequency, and mixed-type data are provided along withthe main concepts on distances between sets and meta-distances. Simple examples of calculation are given, andextended comparisons are performed on the distancesdefined for real-valued and binary data.

1 INTRODUCTION

One can easily suppose that the concepts of similarityand dissimilarity among objects, events, situations, and soon always were basic concepts of the human reasoning,which is strongly based on the concept of analogy.

The first explicit traces of the word distance can befound in the writings by Aristoteles (384 AC – 322 AC),who, in his Metaphysics, used the word distance to mean‘It is between extremities that distance is greatest’ or‘things which have something between them, that is, acertain distance’. In addition, ‘distance’ has the sense of‘dimension’ [as in ‘space has three dimensions, length,breadth and depth’ (Aristoteles, Physics)].

Euclid, one of the most important mathematicians ofthe ancient history (323 AC – 286 AC), used the worddistance only in his third postulate of the Principia: ‘Everycircle can be described by a centre and a distance.’ Theword used in this axiom – διαστηματι – still has a verygeneral meaning.

The mathematization of the concepts of dissimilarity,diversity, distance, and their dual terms such as similarityand nearness comes back to the development of themathematics of the twentieth century.

The distance we use in everyday life is what Euclideanapplied to 2-D or 3-D spaces, but several distancemeasures exist. Every distance has its own characteristics,advantages, and drawbacks. Distances are used tomeasure the similarity among objects represented byan extended number of parameters, which is the usualsituation in analytical chemistry where objects arecharacterized by several signals, parameters, and so on.

Distance, sometimes called farness, is a numericaldescription of how far apart entities are. In data mining,entities commonly are objects or variables. The conceptof distance is a concrete way of describing what it meansfor elements of some space to be ‘close to’ or ‘far awayfrom’ each other.

The numerical value of a similarity/diversity measuredepends on three main components: (i) the description ofthe objects (i.e. the selected variables), (ii) the weightingscheme of the description elements, and (iii) the selecteddistance or similarity measure.

Encyclopedia of Analytical Chemistry, Online © 2006–2015 John Wiley & Sons, Ltd.This article is © 2015 John Wiley & Sons, Ltd.This article was published in the Encyclopedia of Analytical Chemistry in 2015 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470027318.a9438

2 CHEMOMETRICS

Distances

Distance between samples Distance between sampleand reference point

Distance between variablesets

Classification methods (DA)Applicabitiy domain methodsLeverage

Canonical correlation analysisCanonical distance measuresProcrustes analysis

k-NN methods

Optimal design of experimentsCluster analysisPrincipal component analysisMultidimensional scalingMinimum spanning tree

Figure 1 Flowchart of the different applications of distance measures in data mining and modeling.

Distance and similarity measures play a fundamentalrole in chemometrics, as schematically shown in Figure 1,where the different methods are arbitrarily divided into(i) methods that calculate a distance between entitiessuch as objects or variables, (ii) methods that calculatea distance between an object and a reference point, and(iii) methods that calculate a distance between two setsof objects (or variables).

Given the data, the main problem using several ofthese methods is the choice of an appropriate distance,choice that is complicated by the huge number ofdifferent possibilities. Indeed, often, chemometric usersare not aware of the different possibilities that the useof alternative distance/similarity measures can provide inhighlighting new sources of information.

Similarity/diversity measures are the core of almost allthe methods for cluster analysis: the method k-meansassigns an object measuring its distance from each clustercentroid, which is assumed as the representative pointof the cluster; analogously, the Jarvis–Patrick method isbased on a neighbor table where for each object to beassigned, the nearest neighbors are listed; hierarchicalagglomerative methods use the distance measures, calledlinkage metrics, which are able to quantify the similaritybetween groups of objects (i.e. the clusters). Kohonenmaps (or self-organizing maps, SOM) exploit distancemeasures to assign objects to the map neurons and thento evaluate object relationships on the map projection:each object, represented by a p-dimensional vector (pis the number of variables) scaled in the range [0, 1], is

compared with each neuron of the map, whose weights arerepresented in the same scale and is assigned to the neuronfor which the Euclidean distance is minimum (winnerneuron). Then, the learning process proceeds and theinformation received in the winner neuron by the objectis spread out to the neighbors of that neuron, smoothingthe information proportionally to the topological distancefrom the winner neuron.

Principal component analysis (PCA), which is themost common approach for exploratory data analysis,generates a metric space where the distances between theobject pairs are the classical Euclidean distances. Othercommon techniques for exploratory data analysis, such asmultidimensional scaling (MDS) and minimum spanningtree (MST), are based on algorithms able to elaboratedistance (or similarity) matrices representing the internalsimilarity/diversity relationships of the objects of adata set.

In classification, one of the most well-known methodsfor nonlinear problems is the k-nearest neighbor (k-NN) method, which is based on the calculation ofsome distance measure between the target and thetraining objects to identify the first k neighbors andevaluate the class membership of the target. Anotherwell-known method is discriminant analysis (DA) thatexploits Mahalanobis distance, able to consider the wholecovariance structure of each class, to evaluate the distancebetween the object and the class centroid.

The concept of model applicability domain (AD)is increasingly gaining importance in the modeling


DISTANCES AND OTHER DISSIMILARITY MEASURES IN CHEMOMETRICS 3

field where the evaluation whether a given model(classification or regression model) is suitable or notto give reliable predictions for new objects is outstanding.Several methods have been proposed to date to decidewhether a new object can be considered inside the ADof a model and most of them are based on predefineddistance thresholds.

Some books, reviews, and papers(1–10) are dedicatedto present, analyze, and compare similarity/diversitymeasures.

In this article, the problem of evaluating similarityand dissimilarity relationships between objects has beenaddressed with particular attention paid to the use ofdistance and similarity measures in data mining. Thesimilarity/diversity measures commonly used in otherscientific fields, such as, for example, those based onprobability distributions and information content, werenot considered in this study. Moreover, the article wasmainly focused on similarity/diversity measures betweenobjects, although some measures, such as Pearson,Spearman, and Kendall correlation measures, are usuallyapplied to evaluate relationships between variables.

In the framework of data mining, distance and similaritymeasures are commonly distinguished on the basis of datatypes, that is, real-valued data, binary data, frequencyor ranked data, and mixed-type data. Therefore, aftera brief introduction to the theoretical background andthe data pretreatment required before the calculation ofdistances, the most common distance/similarity measuresare presented according to this general classification,which accounts for the data type. In Sections 7 and 8,some applications of the different distance and similaritymeasures to real and simulated data sets are discussedwith the aim of evaluating and comparing the differentinformation provided by each measure.

2 THEORETICAL BACKGROUND

2.1 Notation and Symbols

A chemical data set is usually constituted of a numberof objects (experiments) and a number of parameters,which have been measured on each object. Therefore,the data set is arranged in a numerical matrix (two-way array): each row represents an object of the dataset, whereas columns represent the chemical parameters.Numerical matrices are denoted as X (n ð p), where n isthe number of objects and p the number of variables. Thesingle element of the data matrix X is denoted as xij andrepresents the value of the jth variable for the ith object.Scalars are indicated by italic lower-case characters (e.g.xij) and vectors by bold lower-case characters (e.g. x).

2.2 Axiomatic Rules for Dissimilarity Measures

A function D : XðX!R, where X is a set, must satisfya certain number of properties (axioms) to be considereda distance.

Let X be a set. A function D : XðX!R is calleda distance (or dissimilarity) on X if, for all x, y2X, thefollowing three axioms hold:

Ax.1 : Dxy ½ 0 nonnegativity

Ax.2 : Dxx D 0 reflexivity

Ax.3 : Dxy D Dyx symmetry

A function D : XðX!R is called a quasi-distanceon X if, for all x, y2X, only the first two axioms hold:



A function D : XðX!R is called a metric on X if,for all x, y, z2X, the following four axioms hold:


Ax.20 Dxy D 0 iff x D y strong reflexivity


Ax.4 : Dxy � Dxz CDzy triangle inequality

A function D : XðX!R is called a semimetric (orpseudo-metric) on X if, for all x, y, z2X, the followingfour axioms hold:





A function D : XðX!R is called a quasi-metric onX if, for all x, y, z2X, the following three axioms hold:




A function D : XðX!R is called an ultrametric onX if, for all x, y, z2X, the following four axioms hold:


4 CHEMOMETRICS

Dissimilarity functions

Quasi-distances

Distances

Metrics

Ultrametrics

Nonmetrics

Semimetrics

Quasi-metrics

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

Ax.1: Dxy ≥ 0Ax.2: Dxx = 0

Ax .3 :Dxy = Dyx

Ax .3 :Dxy = Dyx

Ax .4 :Dxy ≤ Dxz + Dzy

Ax .2′Dxy = 0 iff x = y

Ax .4 :Dxy ≤ Dxz + Dzy

Ax .2′Dxy = 0 iff x = y

Ax .4′ :Dxy ≤ max {Dxz, Dzy}

Figure 2 Flowchart of the relationships among the different classes of dissimilarity functions.




Ax.40 : Dxy � maxfDxz, Dzyg ultrametric inequality

The axiom 20 is a stronger condition than axiom 2 asaxiom 40 is stronger than axiom 4.

The different classes of dissimilarity functions can bedistinguished according to the axioms they satisfy, asshown in Figure 2 and Table 1. Note that all dissimilarityfunctions must fulfill at least the two basic requirementsof nonnegativity (Ax. 1) and reflexivity (Ax. 2) and arefurther distinguished into distances and quasi-distancesdepending on whether the property of symmetry (Ax. 3)is fulfilled or not. Obviously, the class of quasi-distancesis the largest one and includes all the distances. Distancescan be further distinguished into metric distances andnonmetric distances according to the property of triangleinequality (Ax. 4): if triangle inequality is not fulfilled,

then a distance is nonmetric; otherwise, it is metric ifalso the property of strong reflexivity (Ax. 20) is fulfilled.If strong reflexivity does not hold, that is, there canbe pairs of objects x 6D y for which Dxy D 0, then adistance cannot be considered properly a metric and iscalled semi-metric distance. The class of quasi-metricsincludes all the quasi-distances that fulfill the propertiesof triangle inequality (Ax. 4) and strong reflexivity (Ax. 20)and differs from the metric distances as quasi-metrics donot fulfill the property of symmetry (Ax. 3). Obviously,the class of quasi-metrics includes all the metricdistances.

2.3 From Distance to Similarity

A function S defined as S : XðX!R on X is calledsimilarity, if, for all x, y2X, the following properties hold:

Ax.1 : Sxy ½ 0 nonnegativity

Ax.2 : Sxx D 0 identity



Table 1 Different classes of dissimilarity measures and their axioms

Dissimilarity function Ax. 1 Ax. 2 Ax. 20 Ax. 3 Ax. 4 Ax. 40

Quasi-distances ž žDistances ž ž žSemi-metrics or pseudo-metrics ž ž ž žQuasi-metrics ž ž ž žMetrics ž ž ž ž žUltrametrics ž ž ž ž ž ž

Ax.3 : Sxy D Syx symmetry

The most common similarity measures satisfy also acondition of closure, that is, 0�Sxy� 1. The value 1indicates the maximum similarity, whereas the value 0indicates the maximum dissimilarity. In this framework,correlation measures can be considered a special caseof similarity measures; however, correlation valuescan be negative, i.e. between [�1, C1], and in thiscase, they cannot be properly regarded as similaritymeasures owing to violation of the first axiom implyingnonnegativity.

Most of the similarity measures can be derived from thedistance measures by appropriate transform functions, thechoice of which mostly depends on whether the distancemeasure is bounded or not.

Some distances, by definition, are intrinsically boundedto the upper value of 1 and others can be bounded in thesame range by normalization on the number of variablesand/or applying some scaling procedures.

For example, given a data set constituted by n objectsand p variables, after range scaling, any distance thatranges between 0 and p can be normalized between 0and 1 simply by averaging on p. For these [0, 1] boundeddistances DN, it is possible to derive the correspondingsimilarity measures by the following equations:

S(1)xy D 1�DN

xy

S(2)xy D 1� (DN

xy)2

S(3)xy D

√1� (DN

xy)2

The obtained similarity measure is naturally bounded[0, 1], as usually required for a similarity function.

When dealing with unbounded distances, a directprocedure to transform them into similarity measures canbe obtained by specific transform functions conceivedin such a way that the similarity values result boundedbetween 0 and 1. The most known similarity transformsfor unbounded distances D are the following:

S(4)xy D

11CDxy

S(5)xy D 1� Dxy

Dmax

S(6)xy D e�Dxy

where Dmax represents the maximum distance betweenall the possible n*(n�1)/2 pairs of objects in the data set.

The similarity transform of type (4) is the simplestone, but its main drawback is that the similarity iscompressed toward zero when distances significantlylarger than all others are present. Indeed, independentlyof the adopted scaling procedure, the presence of severalvariables increases the distance value for the majorityof the distance functions. Moreover, this transform (4)should not be used for normalized distances DN, themaximum distance in this case being equal to 1, which inturn would lead to the minimum similarity of 0.5 insteadof 0. To overcome this drawback, the transform (4) shouldbe further implemented as

S(40)xy D 2 Ð (S(4)

xy � 0.5)

Transform function (6) suffers from the same draw-back, the minimum value being 0.368 (i.e. 1/e), whichis achieved while the normalized distance equals itsmaximum value of 1. In this case, the following scalingshould be adopted to have values properly ranging from0 to 1:

S(60)xy D

S(6)xy � 0.3681� 0.368

For the transform function (5), it is worthy of note thatthere will always be at least one pair of objects x and ywith similarity equal to zero, that is, the pair of objectshaving the maximum distance will have similarity equal to0 regardless of the actual value of their distance. Indeed,this similarity transform gives relative similarity values,depending on the most distant pair of objects.

As most of the similarity measures are derived fromdistance measures by appropriate transform functions,


6 CHEMOMETRICS

dissimilarity measures D can be derived from similaritymeasures S using any monotonically decreasing transfor-mation of S.

Examples of transformations used to obtain dissimi-larity measures from similarities are

D(1)xy D 1� Sxy

D(2)xy D

√1� Sxy

D(3)xy D

√2 Ð (1� S2

xy)

D(4)xy D arccos(Sy)

D(5)xy D � ln Sxy

where S is assumed to vary in the range [0, 1].

2.4 Weighted Distances

Weighted distances are obtained by weighting each jthvariable by a user-defined weight wj usually under theconstraint

p∑jD1

wj D 1

It can be noted that if one would like to have all thevariables with the same importance on the distance, thenall the weights would have the same value wjD 1/p.

In other cases, the weights can be user-defined andestablished by additional a priori information about thevariable.

2.5 Data Pretreatment

When dealing with data mining, a data pretreatment isoften necessary to allow a fair comparison of variablesdefined by different measurement scales, units, and so on.

The most relevant step of data pretreatment is datascaling procedures, which allow squeezing all the variablesinto a comparable scale so that distance/similaritymeasures between objects can equally exploit informationpresent in all the variables, independently of theirnumerical scale.

In general, distance measures defined for real-valueddata require that all the variables are comparable, thatis, defined in the same numerical scale. If the variablesare defined in different scales, distances between theobjects suffer from distortion effects simply because ofthe different numerical scales and not really influencedby a comparable contribution of all the variables usedto define the data. Thus, in the case of real-valued data,data pretreatment is necessary before the calculation ofdistances.

Data scaling procedures perform the transformationof each variable, separately, in a comparable numericalscale. Let n be the total number of objects of a dataset, p the number of variables, xij the value of the jthvariable for the ith object, and x0ij the correspondingscaled value. The quantities Lj, Uj, xj, and sj are theminimum, maximum, average, and standard deviation ofthe jth variable, respectively.

Thus, the most common scaling procedures are thefollowing:

Scaling to maximum x0ij DxijUj

x0ij � 1Uj D maxi(xij)

Range scaling x0ij Dxij�LjUj�Lj

0 � x0ij � 1Lj D mini(xij)Uj Dmaxi(xij)

Scaling at unitary variance x0ij Dxij

s2j

s0j D 1

Autoscaling x0ij Dxij�xj

sjx0j D 0s0j D 1

Pareto scaling x0ij Dxij�xjpsj

x0j D 0s0j D psj

Logarithmic centering x0ij D log(xij)�n∑

iD1log(xij)

n

The mean and the standard deviation of scaleddata change, for each jth variable, by the followingrelationships:

x0j D α Ð xj C β and s0j D α Ð sj

where α and β are two parameters, which can assumedifferent values on the basis of the types of scalingprocedures, as shown in Table 2.

Some scaling procedures, for instance, the meancentering, do not solve the problem of variable compara-bility.

Additional data pretreatment is usually required formore complex data such as spectra and compositionaldata, for which row scaling is preliminary applied.The column scaling procedures previously defined arestill performed after preliminary row scaling. The rowcentering is a classical scaling for spectra, whereas log-ratio transform is suggested for compositional data.

Table 2 α and β parameters of different scaling procedures

Scaling procedure α β

Mean centering 1 �xj

Scaling to maximum 1/Uj 0Range scaling 1/(Uj�Lj) �Lj/(Uj�Lj)Scaling to unitary variance 1/sj 0Autoscaling 1/sj �xj/sj

Pareto scaling 1/√

sj �xj/√

sj



Example 1 Scaling versus nonscaling

This simple example aims to show the role andimportance of data scaling while dealing with the analysisof similarity/diversity relationships between objects. InTable 3, data concerning five objects described by threevariables are collected, in the original scale and afterrange-scaled treatment. Consider two objects 1 and 2. Ifthe classical Euclidean distance was calculated using thethree original variables, then the distance value of 10.05would be obtained. If the Euclidean distance betweenthe two objects was calculated considering only the firstvariable, then the distance value would be 10 and thiswould be not much different from the previous one.This means that the first variable alone contributes upto 99.5% of the value of the distance between objects1 and 2 and that the other two variables do not muchinfluence the result. If we now make the same calculationconsidering the range-scaled values instead of the originalones, the Euclidean distance between objects 1 and 2based on all the three variables becomes 0.414 and thecontribution to the distance value of the first variabledecreases to 12%. Therefore, if distances are calculatedon the raw data set in presence of variables with differentscales, variables with large variances would have thehighest weight in the distance calculation, thus hidingthe contribution of variables characterized by smallervariances.

3 DEFINITIONS OF DISTANCE ANDSIMILARITY MEASURES

Distance and similarity measures can differ dependingon the kind of data they are applied to: real-valueddata, binary data, ranked data, and frequency data. Thesedata are distinguished on the basis of the variables usedto describe objects. Variables such as signal intensity,biological activity, pressure, temperature, reaction speed,concentration, and counts are measured quantitatively onan interval or ratio scale; they are quantitative variables

Table 3 Original and range-scaled data of Example 1on data pretreatment

Original data Scaled data

Data x1 x2 x3 x1 x2 x3

1 100 2 0.2 0.714 0.333 0.4002 90 1 0.1 0.571 0.000 0.2003 120 4 0 1.000 1.000 0.0004 70 3 0.5 0.286 0.667 1.0005 50 3 0.3 0.000 0.667 0.600

and belong to the class of real-valued data. Quantitativevariables can be either continuous or discrete: in theformer case, the values that a variable can assume area set of infinite or uncountably infinite values; in thelatter case, the set of values is either finite or countablyinfinite, and the values are usually integers. Variablessuch as color, shape, and texture provide a classificationof the objects into categories describing the quality of anobject; these variables are qualitative variables and aremeasured on a nominal scale. If these variables allow anordering or ranking of the objects, then the variables aresaid measured on an ordinal scale. If a nominal variableallows just two values, e.g. yes/no, present/absent, on/off,and so on, then this is referred to as binary variable, whichis usually coded as 1/0.(11)

In the following section, distance/similarity measuresare introduced on the basis of the data types they aredefined for.

3.1 Distances for Real-valued Data

Real-valued data have all variables represented byreal values, such as concentrations, signal intensities ofspectra, and quantitative physical or chemical measures.Several dissimilarity measures were defined in literaturefor real-valued data, the most known ones beingEuclidean, Manhattan, and Mahalanobis distances.

Real-valued distances can be divided in two mainclasses based on the range they cover: unboundeddistances, which range from zero to infinite, and boundeddistances, which range from zero to a fixed finite value,that is, distances having a maximum value limited by anupper bound. The most common unbounded distancesare collected in Table 4, whereas bounded distancesin Table 5: Dxy is the general symbol to represent thedissimilarity measure between any pair of objects x and y;the symbols xj and yj indicate the values of the jth variablesfor the p-dimensional objects x and y, respectively. Thelast column of Tables 4 and 5 includes the formulas tocalculate average dissimilarity measures whose values areindependent of the number p of variables describing theobjects.

A geometrical representation of the most knownunbounded distances, that is, Euclidean (R1), Manhattan

x

Euclidean

a

b

yLagrange

Manhattan

Figure 3 Geometrical representation of Euclidean,Manhattan, and Lagrange distances.


8 CHEMOMETRICS

Table 4 Unbounded distances for real-valued data

Equation Distance Definition Range Average

R1 Euclidean DEUCxy D

√√√√ p∑jD1

(xj � yj)2 0 � DEUC

xy <1 DEUCxy D DEUC

xypp

R2 Manhattan or city-block DMANxy D

p∑jD1

jxj � yjj 0 � DMANxy <1 D

MANxy D DMAN

xyp

R3 Lagrange DLAGxy D maxjjxj � yjj 0 � DLAG

xy <1 DLAGxy D DLAG

xy

R4 Minkowski DMINxy D

⎡⎣ p∑

jD1

∣∣∣xj � yj

∣∣∣q⎤⎦

1/q

q > 00 � DMIN <1 D

MIN D DMIN

p1/q

R5 Bhattacharyya DBHAxy D

√√√√ p∑jD1

(√

xj �√

yj)2 x, y ½ 0

0 � DBHAxy <1 D

BHAxy D DBHA

xypp

R6 Mahalanobis DMAHxy D

√(x� y)T Ð S�1 Ð (x� y) 0 � DMAH

xy <1 DMAHxy D DMAH

xypp

R7 Locally centered Mahalanobis DLCMxy D 1

pÐ√

(x� y)T Ð S�1(y) Ð (x� y) 0 � DLCM

xy <1 DLCMxy D DLCM

xy

(R2), and Lagrange (R3) distances, is shown in Figure 3.The Euclidean distance between the points x and ycorresponds to the shortest path joining the two points(p

a2 C b2); the Manhattan distance, also called taxidistance, is the sum of the shortest paths along eachdimension (a C b); finally, the Lagrange distance, alsocalled Chebyshev distance, is the maximum path amongthe shortest paths along each dimension (a).

The Minkowski distance (R4) represents a family ofdistance measures, for which the higher the value of qis, the greater the importance given to large differencesis. Euclidean (R1), Manhattan (R2), and Lagrange (R3)distances are special cases of the Minkowski distance(R4), corresponding to different values of the exponentq of this power distance: Euclidean distance is obtainedfor q D 2, Manhattan distance for q D 1, and Lagrangedistance for q!1.

Among the dissimilarity measures defined in Table 4,Mahalanobis distance (R6) is a relative dissimilarityfunction, which measures the dissimilarity between twoobjects x and y not only on the basis of the twoconsidered objects but also accounting for the informationon the whole structure of the data set, by means ofthe data covariance matrix S. The Mahalanobis distancecan be thought of as a reliable distance measure whencorrelation between variables exists, i.e. it is able tounderweight correlated variables that give otherwiseredundant information.

Moreover, note that the Mahalanobis distance is anextended version of Euclidean distance: if the covariancematrix S is replaced by the identity matrix I, then theMahalanobis distance reduces to the Euclidean distance.

Proportional to the Mahalanobis distance DMAH isthe leverage, which is often used in the framework ofregression diagnostics and outlier detection, defined as

hic D xi Ð (XTc Xc) Ð xT

i D(DMAH

ic )2

n� 1

where Xc is the model matrix centered on the centroid cof the data and n the number of objects in the data set; hicmeasures the dissimilarity between object xi and the datacentroid, or, in other words, gives the ‘distance’ of the ithobject from the center of the model represented by thedata matrix X.

Recently, a variant of the Mahalanobis distance, calledlocally centered Mahalanobis distance (R7), has beenproposed in the literature.(12) What makes this functiondifferent from the classical Mahalanobis distance isthe way the dissimilarity between objects x and y isevaluated, as the data covariance matrix is centered inone of the two objects instead of the data centroid.Obviously, two different values are obtained dependingon whether the covariance matrix is centered in y (i.e.S(y)) or in x (i.e. S(x)) and, therefore, the property ofsymmetry (Axiom 3) is violated with the consequence thatlocally centered Mahalanobis (LCM) function cannot beproperly regarded as a distance but as a quasi-distance(i.e. DLCM

xy 6D DLCMyx ).

In order to make the LCM symmetric and thus,complying with the Axiom 3 of distances, two symmetriza-tion procedures can be applied:

DMSAxy D DLCM

xy CDLCMyx

2



Table 5 Bounded distances on real-valued data. rxy is the Pearson correlation

Equation Distance Definition Range Average

R8 Canberra DCANxy D

p∑jD1

jxj � yjjjxjj C jyjj

0 � DCANxy � p D

CANxy D DCAN

xyp

R9 Clark or coefficient of divergence DCLAxy D

√√√√√ p∑jD1

⎛⎝ xj � yj∣∣∣xj

∣∣∣C jyjj

⎞⎠

2

0 � DCLAxy � p D

CLAxy D DCLA

xypp

R10 Wave-Edge DWExy D

p∑jD1

⎛⎝1�

min(

xj, yj

)max(xj, yj)

⎞⎠ 0 � DWE

xy � p DWExy D

DWExyp

R11 Lance–Williams or Bray–Curtis DLWxy D

p∑jD1

jxj � yjjp∑

jD1

(jxjj C jyjj)0 � DLW

xy � 1 DLWxy D DLW

xy

R12 Soergel DSOExy D

p∑jD1

jxj � yjjp∑

jD1

max(xj, yj)

0 � DSOExy � 1 D

SOExy D DSOE

xy

R13 Jaccard–Tanimoto distance DJTxy D

√√√√√√√√√√1�

p∑jD1

xj Ð yj

p∑jD1

x2j C

p∑jD1

y2j �

p∑jD1

xj Ð yj

0 � DJTxy � 1 D

JTxy D DJT

xy

R14 Pearson distance DPEAxy D 1� rxy 0 � DPEA

xy � 2 DPEAxy D DPEA

xy

R15 Correlation distance DCORxy D 1� rxy

20 � DCOR

xy � 1 DCORxy D DCOR

xy

R16 Cosine or angular distance DCDxy D 1�

p∑jD1

xj Ð yj

√√√√√√p∑

jD1

x2j Ð

p∑jD1

y2j

0 � DCDxy � 1 D

CDxy D DCD

xy

DMSGxy D

√DLCM

xy ðDLCMyx

the former being the arithmetic mean (MSA) of the twodissimilarity values and the latter their geometric mean(MSG). For the sake of simplicity, hereinafter, the symbolDMU

xy will be used to indicate the LCM function centeredin x in place of the symbol DLCM

xy , whereas the symbolDML

xy will replace the symbol DLCMyx to indicate the LCM

function centered in y. Note that the symbol MU refersto the elements of the upper triangular submatrix ofthe dissimilarity matrix containing LCM values betweenall the pairs of objects and the symbol ML to the

elements of the lower triangular submatrix of the samematrix.

Specific distance measures for real-valued data are alsoderived from similarity measures (Table 5). Among these,the Jaccard–Tanimoto distance (R13) is derived from thewell-known Jaccard–Tanimoto similarity coefficient SJT:

SJTxy D

p∑jD1

xj Ð yj

p∑jD1

x2j C

p∑jD1

y2j �

p∑jD1

xj Ð yj


10 CHEMOMETRICS

and the angular distance (R16) is derived from thesimilarity cosine coefficient SCC:

SCCxy D

p∑jD1

xj Ð yj

√√√√ p∑jD1

x2j Ð

p∑jD1

y2j

It is interesting to note that Euclidean andJaccard–Tanimoto distances are intrinsically related.Indeed, the squared Euclidean distance can be rewrittenas

(DEUCxy )2 D

p∑jD1

(xj � yj)2 D

p∑jD1

x2j C

p∑jD1

y2j � 2 Ð

p∑jD1

xj Ð yj

and the squared Jaccard–Tanimoto distance as

(DJTxy )2 D 1�

p∑jD1

xj Ð yj

p∑jD1

x2j C

p∑jD1

y2j �

p∑jD1

xj Ð yj

D

p∑jD1

x2j C

p∑jD1

y2j � 2 Ð

p∑jD1

xj Ð yj

p∑jD1

x2j C

p∑jD1

y2j �

p∑jD1

xj Ð yj

D (DEUCxy )2

p∑jD1

x2j C

p∑jD1

y2j �

p∑jD1

xj Ð yj

According to this relationship, it derives that squaredJaccard–Tanimoto distance can be viewed as a normal-ized squared Euclidean distance.

The Pearson (R14) and correlation (R15) distances,usually applied to measure variable correlation but hereapplied to pairs of objects, are derived from the Pearson’scorrelation coefficient rxy, which is the most knownbivariate correlation measure estimating the degree ofassociation between two objects x and y, defined as

rxy D

p∑jD1

(xj � x) Ð (yj � y)

√√√√ p∑jD1

(xj � x)2 Ðp∑

jD1

(yj � y)2

� 1 � rxy � C1

where x and y are the means of vectors x and y,respectively.

From the correlation coefficient, was also defined thesquared Pearson distance as

DSQPxy D 1� r2

xy

where pairs of objects with correlation equal to both �1or C1 are considered similar to each other, being theirDSQP

xy D 0.In order to obtain a suitable measure of similarity in

the range [0, 1], the Pearson distance (R14) was scaled,giving the correlation distance (R15).

DCORxy D DPEA

xy

2

It should be noted that the correlation distance, unlikethe classical distances, does not account for systematicdifferences between objects, as it measures the associationbetween the object profiles. Moreover, when objectsare described by only two variables, the correlationdistance gives always a value equal to 1 or �1, thusprojecting all the distances only in two points, and itcannot be calculated for data described by just onevariable.

A distance measure is scaling invariant if the followingrelationship is fulfilled:

D(x, y) D D(αxC β, αyC β)

where α and β are the scaling parameters. Morespecifically, a distance has the property of

1. translation invariance, i.e. invariance to translationwith respect to the origin, if D(x, y)DD(xC β, yC β)with α D 1 in the scaling invariance expression;

2. scale invariance, i.e. invariance to dilation, ifD(x, y)DD(αx, αy) with βD 0 in the scaling invarianceexpression.

Among the distances collected in Tables 4 and 5,classical Mahalanobis, LCM, Pearson, and correlationdistances are scaling invariant as both conditions(i.e. translation invariance and scale invariance) hold.Euclidean, Manhattan, Lagrange, and Minkowski areinvariant to translation, as any constant added to bothx and y vanishes when the difference between values iscalculated, whereas they are not invariant to dilation.On the contrary, most of the bounded distances ofTable 5 result to be invariant to dilation but not totranslation, as they are based on the ratio of twoquantities.



00

Distances not invariant to translation

1 2 3 4 5 6 7 8 9

β values

0.5

1

1.5

2

2.5

3

3.5

4

Dis

tanc

es CAN

CLA

BHA

JT

LW

WE

Camberra (CAN)Lance–Williams (LW)Clark (CLA)Soergel (SOE)Bhattacharyya (BHA)Wave-Edge (WE)Jaccard–Tanimoto (JT)Cosine distance (CD)

Figure 4 Profiles of the different distance values between two objects obtained for increasing shift parameter values.

0

Distances not invariant to dilation

EuclideanManhattanLagrangeBhattacharyya

1 1.2 1.4 1.6 1.8 2

α values

5

10

15

20

25

30

Dis

tanc

es

Figure 5 Profiles of the different distance values between two objects obtained for increasing dilation parameter values.

Example 2 Invariance properties of distances

The invariance properties of distances have been furtherinvestigated by a simple exercise. Consider two objects xand y described by range-scaled variables in the interval[0, 1]. Then, the shift parameter β is added to all thevariable values of both x and y and varied between 0and 10 with step 1. For each different value of the shiftparameter, the distance between the two objects x andy is calculated using the different distance functions ofTables 4 and 5. This exercise can be repeated, keepingthe shift parameter β constant at value 0 and multiplyingby the dilation parameter α the variables of both objectsx and y. The dilation parameter is varied from 1 to 2with step 0.25 and, for each different value, the distancesof Tables 4 and 5 are calculated. Finally, the results of

this calculation are shown in Figures 4 and 5, whereone can easily see how much the different distancefunctions are sensitive to the shift and dilation parameters.Canberra, Lance–Williams, Clark, Soergel, Wave-Edge,Jaccard–Tanimoto, angular, and Bhattacharyya distancesare clearly not invariant to translation (Figure 4). Forthese distance functions, indeed, the origin of the axes issignificant and, therefore, if the two objects are movedaway from the origin of the axes, then the distancebetween them tends toward zero.

Euclidean, Manhattan, Lagrange, and Bhattacharyyadistances do not fulfill the property of dilation invarianceand, as can be seen in Figure 5, they tend to increasesignificantly as the value α increases: Euclidean andLagrange distances increase by a factor α and Manhattanand Bhattacharyya distances by a factor

pα.


12 CHEMOMETRICS

Table 6 Distance measures for ranked data

Equation Distance Definition

P1 Spearman distance DSPExy D

p∑jD1

[rxj � ryj]2

P2 Kendall distance DKENxy D

p�1∑jD1

p∑kDjC1

jδjkj if δjk < 0

P3 Mahalanobis rank distance DMRDxy D 2 Ð

p∑jD1

(rxj � ryj)2

(rxj C ryj)

P4 Bhattacharyya rank distance DBAHxy D

√√√√ p∑jD1

(√

rxj �√

ryj)2

Among the considered distances, the only distancethat does not fulfill both invariance properties isthe Bhattacharyya distance, although, with a minimalsensitivity to both the shift and dilation parameters.

3.2 Distances for Ranked Data

While dealing with ordinal data, it is implicit that data canbe ordered and a rank can be assigned to each entity; forthis kind of data, specific similarity measures are required.

The most common association measures between tworankings or permutations are the Spearman ρ coefficientand the Kendall τ coefficient. The Spearman ρ coefficientfor two ranked vectors x and y is defined as

ρxy D 1�6 Ð

p∑jD1

[rxj � ryj]2

p3 � p� 1 � ρxy � C1

where rj indicates the jth rank for the two entities xand y in the interval [1, p]. This expression of Spearmanrank correlation corresponds to the Pearson correlationcalculated on the ranks, i.e.

ρxy D

p∑jD1

[rxj � rx] Ð [ryj � ry]

√√√√ p∑jD1

[rxj � rx]2 Ðp∑

jD1

[ryj � ry]2

The Kendall τ coefficient is defined as

τxy D2 Ð

p�1∑jD1

p∑kDjC1

δjk

p Ð (p� 1)� 1 � τxy � C1

where negative δjk values indicate rank discordance of yand x, whereas positive values indicate rank concordance:

δjk D{�1 if

[(rxj � rxk

) Ð (ryj � ryk)]

< 0C1 otherwise

Distance measures derived from both Spearman rankcorrelation and Kendall coefficient along with theMahalanobis rank distance and Bhattacharyya distanceare collected in Table 6.

Example 3 Similarity/diversity measures for ranked data

In order to better explain how to calculate simi-larity/diversity measures for ranked data a simpleexample is here reported. Suppose to have two objectsx and y, each described by rankings from 1 to 4. Table 7shows the four ranks for the objects. Then, the above-mentioned similarity/diversity measures are calculated as

Spearman ρ coefficient: ρxy D 1� 6 Ð (02 C 32 C 22 C 12)

43 � 4D

�0.4Spearman ρ distance: DSPE

xy D (02 C 32 C 22 C 12) D 14Mahalanobis rank distance: DMRD

xy D 2 Ð (0C 1.80C0.67C 0.33) D 5.6Bhattacharyya distance: DBHA

xy D√(p

3�p3)2 C (p

4�p1)2 C (p

2�p4)2 C (p

1�p2)2

D 1.23

Table 7 Two objects x and y, each characterized by fourranks ranging between 1 and 4

a b c d

x 3 4 2 1y 3 1 4 2



Table 8 Distance measures for frequencies

Equation Distance Definition Range

F1 Tanimoto distance DTxy D 1�

p∑jD1

min(fxj, fyj)

p∑jD1

fxjCp∑

jD1

fyj�p∑

jD1

min(fxj, fyj)

0�DTxy� 1

F2 Modified Tanimoto distance DMTxy D 1�

2Ðp∑

jD1

min(fxj, fyj)

p∑jD1

fxjCp∑

jD1

fyj

0�DMTxy� 1

Kendall τ coefficient: τxy D2 Ð (�1� 1C 1� 1� 1C 1)

4 Ð (4� 1)D

2 Ð (�2)

12D �0.33

Kendall τ distance: DKENxy D j�1j C j�1j C j�1j C j�1j D

4

3.3 Distances for Frequency Data

Frequencies are integer numbers denoting the occur-rences of repeated events, cases, and so on. The two mostknown measures of dissimilarity for frequency data arecollected in Table 8.

The two distances are derived from the Tanimoto simi-larity and the modified Tanimoto similarity, respectively,defined as

STxy D

p∑jD1

min(fxj, fyj)

p∑jD1

fxj Cp∑

jD1

fyj �p∑

jD1

min(fxj, fyj)

and

SMTxy D

2 Ðp∑

jD1

min(fxj, fyj)

p∑jD1

fxj Cp∑

jD1

fyj

where fxj is the number of occurrences of the jth event(the variable) for the object x and fyj the number ofoccurrences of the same jth event for the object y.

Example 4 Similarity/diversity measures for frequencydata

Consider the data collected in Table 7 and suppose thesedata to be frequencies instead of ranks, then the distancemeasures defined earlier for frequency occurrences arecalculated as

DTxy D 1

� 3C 1C 2C 1(3C 4C 2C 1)C (3C 1C 4C 2)� (3C 1C 2C 1)

D 1� 713D 0.46

DMTxy D 1� 2 Ð (3C 1C 2C 1)

(3C 4C 2C 1)C (3C 1C 4C 2)

D 1� 1420D 0.30

3.4 Binary Similarity Measures

Binary variables whose values are either one or zero(presence/absence, yes/no, active/nonactive, etc.) arelargely common in data mining because these variablesare able, for example, to describe the presence/absenceof a signal at a given wavelength of a spectrum andthe presence/absence of a specific functional group ormolecular fragment in a molecule, to define whether acompound is active or inactive and in general if somefeature or attribute is observed or not.

To deal with binary variables, several similaritycoefficients have been proposed in literature and they canall be described as follows. Let two objects be describedby the binary vectors x and y, each comprised p variableswith values 0/1. The common association coefficients arecalculated from the data reported in a frequency table


14 CHEMOMETRICS

Table 9 Frequency table of the four possible combinationsof values 0 and 1 for two binary samples x and y

y D 1 y D 0x D 1 a b a C b

x D 0 c d c C d

a C c b C d p

(Table 9), where a, b, c, and d are the frequencies of theevents (x D 1 and y D 1), (x D 1 and y D 0), (x D 0 andy D 1), and (x D 0 and y D 0), respectively, in the pairof binary vectors describing the two objects; p is the totalnumber of attributes (i.e. variables), equal to a C b C c Cd, which is the length of each binary vector.

The frequency table can be read as follows: a is thenumber of common presences of the attributes, d thenumber of ‘common absences’ in x and y, a C b thenumber of attributes present in x, and a C c the numberattributes present in y. The diagonal entries a and d hencegive information about the similarity between the twovectors, whereas the entries b and c give informationabout their dissimilarity.

Example 5 Frequency table for binary data

A simple example is presented to show how thefrequencies a, b, c, and d are calculated. Let two objectsbe represented by the vectors x and y, each described by10 binary variables (i.e. p D 10):

x : 1 1 0 1 1 0 0 0 1 1

y : 1 0 1 0 1 0 0 0 1 1

Then, a, b, c, and d take the following values:

y D 1 y D 0x D 1 a D 4 b D 2 a C b D6x D 0 c D 1 d D 3

a C c D5 p D 10

Binary coefficients are usually distinguished into (i)symmetric coefficients that use both a and d, i.e. thedouble-zero state (d) for two objects is treated in exactlythe same way as any other pair of values and should beused when the zero state is a valid basis for comparingtwo objects; (ii) asymmetric coefficients, conversely, thatignore such double-zero attributes in the similaritycalculation; and (iii) correlation-based coefficients, whichare defined in the range [�1, C1], that account for

the difference between the occurrence frequency ofconcordances (i.e. ad) and the occurrence frequency ofdiscordances (i.e. bc).

The most common binary similarity coefficients arelisted in Table 10. Most of these coefficients are naturallydefined in the range [0, 1]. For those coefficients havingranges other than [0, 1], they can be rescaled using thefollowing linear transform:

S0 D SC α

β

where S is the original similarity value, S0 the rescaledfunction in the range [0, 1], and α and β numericalparameters whose values are reported in Table 10(where, obviously, α D 0 and β D 1 indicate thatno transformation is required to obtain the desiredrange). Table 10 also collects the mathematical conditionsthat must be applied to make each binary coefficientvalid for any combination of the a, b, c, and dfrequencies.

The most common binary similarity coefficient is theJaccard–Tanimoto coefficient (B3) that emphasizes thepresence of common features a, neglecting the absenceof common features d and the simple matching (B1)accounting for both presence and absence of commonfeatures.

A weighted version of the Jaccard–Tanimoto coeffi-cient (B3) is the Tversky similarity coefficient, definedas

STVxy D

aaC γ Ð bC δ Ð c 0 � STV

xy � 1

where λ and γ are user-defined parameters. In particular,equal values of γ and δ provide a symmetrical contributionto the two dissimilarity frequencies b and c, as, forinstance, in the Jaccard–Tanimoto coefficient (B3),for which γ D δ D1, in the Gleason coefficient (B4)when γD δD 1/2, in the Sokal–Sneath coefficient (B12)when γD δD 2 and in the Jaccard coefficient (B14)when γD δD 1/3; different values of γ and δ provideasymmetrical contribution, as, for example, in theDice–Wallace, Post–Snijders coefficient (B31), for whichδ D 1 and γ D 0; this coefficient can be interpretedas the fraction of object x that is in common withobject y.

The coefficient CT5 (B43) is the normalized versionof a measure derived from a Bayesian analysis of the



Tab

le10

Bin

ary

sim

ilari

tyco

effic

ient

s.In

the

colu

mn

‘Con

diti

ons’

,den

indi

cate

sth

ede

nom

inat

orof

the

func

tion

Equ

atio

nSi

mila

rity

coef

ficie

ntD

efini

tion

αβ

Con

diti

ons

B1

Soka

l–M

iche

ner,

Sim

ple

Mat

chin

gSSM xyD

aCd

p0

1N

one

B2

Rog

ers–

Tan

imot

oSR

TxyD

aCd

pCbC

c0

1N

one

B3

Jacc

ard

–T

anim

oto

SRT

xyD

aaC

bCc

01

aD

0!

sD0

B4

Gle

ason

–D

ice

–So

rens

enSG

LE

xyD

2a2aCbCc

01

aD

0!

sD0

B5

Rus

sell

–R

aoSR

RxyD

a p0

1N

one

B6

For

bes

SFO

RxyD

pa(aCb

)(aC

c)0

p/a

denD

0_aD

0!

sD0

B7

Sim

pson

SSIM

xyD

am

inf(aCb

),(aCc

)g0

1de

nD

0_aD

0!

sD0

B8

Bra

un-B

lanq

uet

SBB

xyD

am

axf(aCb

),(aCc

)g0

1aD

0!

sD0

B9

Dri

ver–

Kro

eber

–O

chia

icos

ine

SDK

xyD

ap (aCb

)(aC

c)0

1de

nD

0!

sD0

B10

Bar

oni-

Urb

ani–

Bus

erSB

U1

xyD

p adCa

p adCaCbCc

01

dD

p!

sD1

B11

Kul

czyn

ski

SKU

LxyD

1 2Ð[

aaC

bC

aaC

c

]0

1aD

0!

sD0

B12

Soka

l–Sn

eath

SSS1

xyD

aaC

2bC2

c0

1aD

0!

sD0

B13

Soka

l–Sn

eath

SSS2

xyD

2aC2

dpC

aCd

01

Non

e

B14

Jacc

ard

SJA xyD

3a3aCbCc

01

aD

0!

sD0

B15

Fai

thSF

AI

xyD

aC0.

5Ðdp

01

Non

e

B16

Mou

ntfo

rdSM

OU

xyD

2aabCa

cC2b

c0

2de

nD

0!

sDa/

p

B17

Mic

hael

SMIC

xyD

4Ð(ad�b

c)(aCd

)2C(

bCc)

2C1

2aD

p_

dD

p!

sD1

bC

cD

0!

sD1

B18

Rog

ot–

Gol

dber

gSR

GxyD

a2aCbCcC

d2dCbCc

01

aD

p_

dD

p!

sD1

B19

Haw

kins

–D

otso

nSH

DxyD

1 2Ð(

aaC

bCcC

dbC

cCd

)0

1aD

p_

dD

p!

sD1

B20

Yul

eSY

U1

xyD

ad�b

cadCb

cC1

2aD

p_

dD

p_b

cD

0!

sD1

B21

Yul

eSY

U2

xyDp ad�p

bcp adCp

bcC1

2aD

p_

dD

p_b

cD

0!

sD1

B22

Fos

sum

SFO

SxyD

pÐ(a�

0.5)

2

(aCb

)(aC

c)0

(p�0

.5)2

pde

nD

0!

sD0

B23

Den

nis

SDE

NxyD

ad�b

cp p(

aCb)

(aCc

)

p p 2p p

aD

p_

dD

p!

sD

1de

nD

0!

sD

0

B24

Col

eSC

O1

xyD

ad�b

c(aCc

)(cC

d)p�

1p

aD

p_

dD

p!

sD

1de

nD

0!

sD

0

B25

Col

eSC

O2

xyD

ad�b

c(aCb

)(bC

d)p�

1p

aD

p_

dD

p!

sD

1de

nD

0!

sD

0


16 CHEMOMETRICS

Tab

le10

(Con

tinue

d).

Equ

atio

nSi

mila

rity

coef

ficie

ntD

efini

tion

αβ

Con

diti

ons

B26

Dis

pers

ion

SDIS

xyD

ad�b

cp2

1/4

1/2

aD

p_

dD

p!

sD1

B27

Goo

dman

–K

rusk

alSG

KxyD

2Ðmin

(a,d

)�b�

c2Ðm

in(a

,d)C

bCc

C12

aD

p_

dD

p!

sD1

B28

Soka

l–Sn

eath

SSS3

xyD

1 4Ð[

aaC

bC

aaC

cC

dbC

dC

dcC

d

]0

1aD

p_

dD

p!

sD

1aD

0^

dD

0!

sD

0

B29

Soka

l–Sn

eath

SSS4

xyD

ap (aCb

)(aC

c)Ð

dp (bCd

)(cC

d)0

1aD

p_

dD

p!

sD

1aD

0_

dD

0!

sD

0

B30

Pea

rson

–H

eron

SPH

IxyD

ad�b

cp (aCb

)(aC

c)(cCd

)(bC

d)C1

2aD

p_

dD

p!

sD

1bD

p_

cD

p!

sD

0de

nD

0!

sD

0B

31D

ice

–W

alla

ce,P

ost–

Snijd

ers

SDI1

xyD

a(aCb

)0

1aD

0!

sD0

B32

Dic

e–

Wal

lace

,Pos

t–Sn

ijder

sSD

I2xyD

a(aCc

)0

1aD

0!

sD0

B33

Sorg

enfr

eiSSO

RxyD

a2

(aCb

)(aC

c)0

1aD

0!

sD0

B34

Coh

enSC

OE

xyD

2Ð(ad�b

c)(aCb

)(bC

d)C(

aCc)

(cCd

)C1

2aD

p_

dD

p!

sD

1de

nD

0!

sD

0

B35

Pei

rce

SPE

1xyD

ad�b

c(aCb

)(cC

d)C1

2aD

p_

dD

p!

sD

1bD

p_

cD

p!

sD

0

B36

Pei

rce

SPE

2xyD

ad�b

c(aCc

)(bC

d)C1

2aD

p_

dD

p!

sD

1bD

p_

cD

p!

sD

0

B37

Max

wel

l–P

illin

erSM

PxyD

2Ð(ad�b

c)(aCb

)(cC

d)C(

aCc)

(bCd

)C1

2aD

p_

dD

p!

sD

1de

nD

0!

sD

0

B38

Har

ris–

Lah

eySH

LxyD

aÐ(2dCbCc

)2Ð(

aCbC

c)C

dÐ(2aCbCc

)2Ð(

bCcC

d)0

paD

p_

dD

p!

sD

1de

nD

0!

sD

0B

39C

onso

nni–

Tod

esch

ini

SCT

1xyD

ln(1CaCd

)ln

(1Cp

)0

1N

one

B40

Con

sonn

i–T

odes

chin

iSC

T2

xyD

ln(1Cp

)�ln

(1CbCc

)

ln(1Cp

)0

1N

one

B41

Con

sonn

i–T

odes

chin

iSC

T3

xyD

ln(1Ca

)ln

(1Cp

)0

1N

one

B42

Con

sonn

i–T

odes

chin

iSC

T4

xyD

ln(1Ca

)ln

(1CaCbCc

)0

1N

one

B43

Con

sonn

i–T

odes

chin

iSC

T5

xyD

ln[ 1C

ad1C

bc

]ln

(1Cp

2 /4)

12

Non

e

B44

Aus

tin

–C

olw

ell

SAC

xyD

2 πÐa

rcsi

n√ (

aCd

p

)0

1N

one



Table 11 Some distance measures derived from binary variables

Equation Distance Equation Range

C1 Hamming distance DHAMxy D bC c [0, p]

C2 Root squared Hamming distance DHSRxy D pbC c [0,

pp]

C3 Tanimoto distance DTANxy D bC c

p[0, 1]

C4 Root squared Tanimoto distance DTSRxy D

√bC c

p[0, 1]

C5 Watson nonmetric distance DWATxy D bC c

2aC bC c[0, 1]

C6 Soergel binary distance DSBDxy D bC c

aC bC c[0, 1]

probability, which was defined as(13)

wxy D lnp(x D 1 ^ y D 1)ð p(x D 0 ^ y D 0)

p(x D 0 ^ y D 1)ð p(x D 1 ^ x D 0)

D ln[

adbc

]�1 < wxy < C1

where p is the probability of the event.While binary similarity coefficients are mainly based

on the number a of ‘common presences’ of the attributesand d the number of ‘common absences’ in x and y,binary distances account for the entries b and c that giveinformation about dissimilarity of the two objects. Themost popular binary distance measures (Table 11) areHamming distance, defined as

DHAMxy D bC c 0 � DHAM

xy � p

and Tanimoto distance, which is its normalized counter-part, defined as

DTxy D

bC cp

0 � DTxy � 1

In the literature, their root squared versions havealso been proposed and used for particular applications.Moreover, other two binary distance measures are Watsonnonmetric distance

DWATxy D bC c

2aC bC c0 � DWAT

xy � 1

and Soergel binary distance

DSOExy D bC c

aC bC c0 � DSOE

xy � 1

While comparing distances for binary and real-valuedcontinuous variables, it is easy to see that Hamming

distance (C1) coincides with Manhattan distance (R2),where Hamming distance (C2) is the squared Euclideandistance (R1), and Tanimoto distance (C3) coincideswith average Manhattan distance and squared Tanimoto(C4) with the average Euclidean distance. Moreover,the Watson nonmetric distance (C5) corresponds to theLance–Williams distance (R11) and is the complementof the Gleason coefficient (B4); the Soergel binarydistance (C6) corresponds to the Soergel distance(R12) and is the complement of the Jaccard/Tanimotocoefficient (B3).

Among the mathematical properties of the binarysimilarity coefficients, particular attention has to bepaid to their metricity, that is, whether a similaritycoefficient can be transformed into a metric distance.By definition, metric distances comply with the triangleinequality and those dissimilarity measures that donot comply with the triangle inequality are nonmetricor quasi-metric if neither the symmetry condition isfulfilled.

After transformation into distances, it is easy to seethat several similarity coefficients are nonmetric as it islikely that two objects, x and y, have a distance valuelarger than the sum of their distances with anotherobject z. It follows that these similarity measures cannotbe used directly to project objects in a metric space,unless a suitable transformation has been applied toconvert them into metric distances. Moreover, to obtainmetric distances, it is important to remind that anytransformation does not induce metric distances if thesimilarity measure does not fulfill the mathematicalcondition for symmetry (i.e. Sxy 6DSyx). For some binarysimilarity coefficients, this condition is not fulfilled, forexample, if only the parameter a or the parameter b,but not both contemporarily, appears in their definition:indeed, in this case, the values of b and c exchange theirvalues. This happens, for instance, for the coefficientsCO1 (B24), CO2 (B25), DI1 (B31), and DI2 (B32). Theproperties of binary similarity coefficients are further


18 CHEMOMETRICS

discussed in a subsequent paragraph of this article wherea multivariate comparison of the similarity coefficients ofTable 10 on a simulated data set is carried out.

3.5 Mixed-type Distances

In real cases, the variables describing the data can be‘mixed’, i.e. can be a mixture of numeric values and counts(i.e. variables defined in interval or ratio scales), rankings(i.e. variables defined in ordinal scales), categorical,and binary attributes (i.e. variables defined in nominalscales). Therefore, mixed-type distances, referred to asthe general symbol DMT, should be used when a data setcontains variables of different types: nominal (n), binary(b), ordinal (o), and real-valued (r) variables. In thesecases, to evaluate the proximities of pairs of objects, thefollowing general equation can be used:

DMTxy D wn ÐDn

xy C wb ÐDbxy C wo ÐDo

xy C wr ÐDrxy

where Dn is the distance contribution calculated consid-ering only nominal variables, Db the distance contributioncalculated considering only binary variables, Do thedistance contribution calculated considering only ordinalvariables, and Dr the distance contribution calculatedconsidering only real-valued variables; wn, wb, wo, and wrare user-defined weights for the different types of distancecontributions.

A general similarity measure proposed to deal withmixed-type data is the Gower coefficient, which is definedas

SGOWxy D

p∑jD1

sxy,j

p∑jD1

δxy,j

where sxy,j is the similarity of objects x and y calculatedfor the jth variable and δxy,j a comparison index, it being1 when the variable j can be used to compare x and y, and0 otherwise.

For nominal variables, the similarity contribution iscalculated as

sxy,j D{

1 if x D y0 otherwise

For binary variables, the similarity contribution andvariable counter are calculated as

sxy,j D{

1 if xj D 1 ^ yj D 10 otherwise

δxy,j D{

0 if xj D 0 ^ yj D 01 otherwise

For real-valued variables, the similarity contribution iscalculated as

sxy,j D 1� jxj � yjjUj � Lj

where Uj and Lj are the upper and the lower values ofthe jth variable of the data.

Park distance is a distance measure for mixed-type datathat requires all the variables to be scaled in the range[0, 1]. To define it the following rules hold:

ž nominal variables are considered as binaryvariables;

ž binary variables are kept unchanged;ž ordinal variables taking values between [1, k] are

scaled as x0j D xj/k;ž real-valued variables are range scaled between [0, 1].

The Park distance is then calculated as averageEuclidean distance:

DPARxy D

√√√√√√√p∑

jD1

(xj � yj)2

p

It is interesting to note that Jaccard–Tanimoto distanceis defined both for binary and real-valued variables. Then,when only these two kinds of variables are present in thedata set, Jaccard–Tanimoto function is a suitable measureof proximities between objects.

4 META-DISTANCES

The concept of meta-distance introduces higher degreelevels of similarity/diversity measures. This conceptwas proposed by Buscema(10) measuring the connec-tion between two variables j and k and called, inthat specific context, atemporal target diffusion model(ATDM). The connection strength between variables wasdefined as

sjk Dn∑

iD1

⎡⎢⎢⎣xij Ð xik ð

p∑m 6Djm 6Dk

(1C ε)C (xij � xim)

(1C ε)C (xik � xim)

⎤⎥⎥⎦

where i runs over all the n objects, j and k denoted thejth/kth pair of variables, and m runs over all the p � 2variables other than j and k. The variables were rangescaled between [0, 1] and the term ε was settled equal to



0.0001 to avoid singularities. The connection wjk betweentwo variables was then defined as

wjk D sjk Ð e�Djk/p

α

where Djk was any distance measure. The contributionmade by the distance was determined by the variableweighting parameter α, which was set to 0.1 in thequoted work.

In practice, ATDM weights how much each variabledepends on any other, while also considering thecontexts of the other variables. Then, we can say thatthis function weights the association of any pair ofvariables with an approximation of the highest order ofrelationship.

Analogously, the meta-distance (or its dual conceptof meta-similarity) between two objects x and y is heredefined as

DMETAxy D Dxy Ð e�αÐSMETA

xy

where, given any distance measure Dxy, called primarydistance, this is contracted by a meta-similarity factorSMETA able to catch higher degree levels of proximityaspects, which considers the similarity of the objectsx and y with all other n � 2 objects different fromx and y:

SMETAxy D 1

pÐ

p∑jD1

⎡⎢⎢⎣ 1

(n� 2)Ð

n∑z6Dxz6Dy

⎤⎥⎥⎦ 0 � SMETA

xy � 1

δj(z) D

⎧⎪⎨⎪⎩

1 ift � 1Cmin(dxz,j, dyz,j

)1Cmax(dxz,j, dyz,j)

� 1(1� ε) � t � 1

0 otherwise

where dxz,j is any dissimilarity measure between x andz, calculated considering only the jth variable, andthe threshold t is a user-defined parameter near to1. The Kronecker δj(z) measures the meta-similaritybetween two objects x and y considering their proximityrelationships with all the other n � 2 objects, that is,counting how many times the two objects x and y havesimilar distances from another object z; if the distanceswere exactly the same, then the distance ratio would beequal to 1. Note that, by definition, the largest distance isput in the denominator of the ratio to have values equalto or smaller than 1. The threshold 1 � ε allows to countalso all the cases for which there are small differencesbetween the two distances dxz,j and dyz,j.

There are several different possibilities to obtain ameta-distance, depending on the choice of the primarydistance Dxy and the dissimilarity function dxz,j usedto calculate the contraction factor of meta-similarity

SMETA. In this study, we focused on the meta-distancemeasure derived from Jaccard–Tanimoto distance DJT

as the primary distance measure and Manhattan distanceas the meta-similarity; the α parameter was arbitrarilysettled equal to 2, thus allowing a maximum reductionof 87% (e� 2D 0.135) of the primary distance whenSMETA D 1.

Mathematically, this meta-distance, called contractedJaccard–Tanimoto, is defined as

DCJTxy D DJT

xy Ð e�2ÐSMANxy 0 � DMETA

xy � 1

DJTxy D

⎛⎜⎜⎜⎜⎜⎝1�

p∑jD1

xj Ð yj

p∑jD1

x2j C

p∑jD1

y2j �

p∑jD1

xj Ð yj

⎞⎟⎟⎟⎟⎟⎠

1/2

SMANxy D 1

pÐ

p∑jD1

⎡⎢⎢⎢⎢⎢⎣

1(n� 2)

Ðn∑

i 6D xi 6D y

δj(z)

⎤⎥⎥⎥⎥⎥⎦

δj(z) D

⎧⎪⎨⎪⎩

1 if 0.95 � 1Cmin(∣∣xj � zj

∣∣ , jyj � zjj)

1Cmax(jxj � zjj, jyj � zjj)� 1

0 otherwise

The value of 0.95 was obtained with ε D 0.05 andassumed as the threshold below which the two distancesof x and y from z are considered different.

5 DISTANCES BETWEEN SETS

In the framework of the similarity/diversity evaluation,it is also important the comparison between two sets ofobjects described by the same variables (linkage metrics)or between two sets of variables describing the sameobjects (Procrustes analysis and canonical measure ofdistance). A special case is the comparison of differentmodels (regression or classification models) for the sameset of objects.

Before discussing some specific approaches to measureproximities between sets, a general distance measurebetween sets, namely the Hausdorff distance, is shortlyintroduced.

5.1 Hausdorff Distance

Let fX, Dg be a metric space, i.e. a set X equipped with ametric D. A distance measure between two nonemptysubsets A and B of a metric space fX, Dg is called


20 CHEMOMETRICS

Table 12 Linkage metrics used in agglomerative hierarchical clustering

Equation Linkage metric Definition

L1 Average linkage DALAB D

M∑aD1

N∑bD1

dab

MÐNL2 Single linkage DSL

AB D mina,b(dab)

L3 Complete linkage DCLAB D maxa,b(dab)

L4 Centroid linkage DCENAB D (cA � cB)2

L5 Median linkage DMEDAB D (medA �medB)2

L6 Ward linkage DWLAB D

√M ÐN

M CNÐ (cA � cB)2

Hausdorff distance, defined as

DHAUAB D maxfsup

a2A( inf

b2Bdab), sup

b2B( inf

a2Adab)g

where infb2Bdab is the distance between any point a ofA and the set B and infa2Adab the distance between anypoint b of B and the set A.

Examples of calculation of Hausdorff distances are

DHAU([1, 7], [3, 6]) D max[sup(inff2, 5g, inff4, 1g),sup(inff2, 4g, inff5, 1g)]D max[supf2, 1g, supf2, 1g] D 2

DHAU(1, [3, 6]) D max[sup(inff2, 5g), sup(inff2g, inff5g)]D max[sup(2), supf2, 5g] D 5

DHAU([1, 7], [1, 4, 5, 7]) D max[sup(inff0, 3, 4, 6g),(inff6, 3, 2, 0g), sup(inff0, 6g,inff3, 3g, inff4, 2g, inff6, 0g)]D max[sup(f0, 0g),

sup(f0, 3, 2, 0g)] D 3

Here, it must be highlighted that in general Hausdorffdistance is a semi-metric because the property ofstrong reflexivity (Axiom 20) is not fulfilled. Indeed, theHausdorff distance DHAU

AB D 0 does not imply A D B, butsimply that A�B; however, if both A and B are closedsets, then DHAU

AB D 0 implies ADB, the strong reflexivitycondition holds and the Hausdorff distance becomes ametrics.

5.2 Linkage Metrics

Linkage metrics are distances between two sets of objectsdescribed by the same variables; these kinds of distances

are typically used in cluster analysis to evaluate clusterproximities.

Let ADfa1, a2, . . . , aMg be a set of M objects andBDfb1, b2, . . . , bNg a set of N objects, each describedby the same p variables; cA D fx1, x2, . . . , xpgA andcB D fx1, x2, . . . , xpgB are the p-dimensional centroids ofthe two sets (i.e. the vectors of the average values of thep variables describing the objects, calculated consideringseparately the objects of each set); medA and medB arethe corresponding medians of the two sets.

The most common linkage metrics are collected inTable 12.

In general, linkage metrics are ultrametrics, that is, theultrametric inequality holds (Axiom 40), which states thatthe distance between two objects Dxy is smaller or equalto the maximum distance between each of the two objectand another object of the set, that is,

Ax.40 : Dxy � maxfDxz, Dzyg

The algorithms of agglomerative hierarchical clusteringexploit these linkage metrics to produce the dendrogramof a data set.

5.3 Procrustes Analysis

Procrustes analysis is a statistical method to compare twodata sets being comprised the same objects but describedwith different sets of variables. The two data sets couldbe, for instance, the sets of variables of two differentclassification or regression models obtained from thesame set of objects.

Procrustes analysis determines a linear transformation,based on translation, reflection, orthogonal rotation, andscaling, of the points in the first data set to best conformthem to the points in the second data set.(14,11) TheProcrustes goodness-of-fit criterion measures somehowthe dissimilarity between the two data sets, it being



the sum of squared differences between points aftertranslation, dilation, and rotation of one data set withrespect to the other one; it is equal to 0 if two data setscoincide, whereas it is equal to 1 if data structures arecompletely dissimilar.

5.4 Canonical Measure of Distance

Canonical measure of distance (CMD) is a dissimilarityfunction proposed to compare two data sets with the sameobjects but two different sets of variables as for Procrustesanalysis.

Let A and B be the two different data sets. The simplestway to measure the distance between these two data setsdisregards the actual variable values and simply consistsin computing the number of diverse variables in the twodata sets. This function is the Hamming distance (Table 11,C1), defined for two sets A and B as

DHAMAB D bC c

where b is the number of variables in A but not in Band c the number of variables present in B but not inA. Hamming distance usually has an upward bias as itoverestimates the actual distance between two variablesets, because variable correlation is not accounted for.

The CMD(15) overcomes this drawback and is definedas

DCMDAB D pA C pB � 2 Ð

M∑jD1

√λj 0 � DCMD

AB � (pA C pB)

where pA and pB are the number of variables in set A andB, respectively, λ the eigenvalue of the symmetrical cross-correlation matrix, and M the number of nonvanishingeigenvalues.

The cross-correlation matrix contains the pairwisecorrelation coefficients between variables of the two sets;it is an unsymmetrical matrix CAB of size (pA ð pB) orCBA of size (pB ð pA). The symmetrical cross-correlationmatrix is derived by the following inner product:

QA D CAB ð CBA orQB D CBA ð CAB

where QA and QB are two different square symmetricalmatrices, one of size pAðpA and the other of size pBðpB.Although these symmetrical matrices are different, theirM nonzero eigenvalues coincide, M being the minimumrank between QA and QB.

The canonical measure of correlation was also derivedfrom the nonvanishing eigenvalues λ of the symmetrical

Table 13 Pairwise correlations between variablesx1, x2, x3, and x4

x1 x2 x3 x4

x1 1 0.979 0.061 0.475x2 0.979 1 0.194 0.593x3 0.061 0.194 1 0.240x4 0.475 0.593 0.240 1

cross-correlation matrices as the following:

ρCMCAB � CMCAB D

M∑jD1

√λj

ppA Ð pB

0 � ρCMCAB � 1

where the numerator measures the inter-set commonvariance and the denominator is its theoretical maximumvalue. This index is related to the multidimensionalcorrelational structure between two sets of variables. Ifno correlation exists between any pairs of variables fromthe two sets, then CMC D 0 and CMD index reduces tothe Hamming distance.

The CMD function fulfills the first three main axiomsfor a distance measure; however, the triangle inequalitydoes not always hold, thus CMD between sets is anonmetric distance.

Example 6 Similarity/diversity measures between sets

An example of calculation of proximity measuresbetween sets is presented. Several data sets, eachobtained by combining in different ways four variables,were generated. Table 13 shows the variable pairwisecorrelations, and CMD and CMC measures are givenin Table 14, along with the Hamming distance andProcrustes goodness-of-fit criterion.

From results of Table 14, the first consideration isthat the CMD index removes the strong degenerationof Hamming distance allowing a better distinguishing ofthe different cases. It is also apparent that conclusionsdriven from Hamming distance are quite different fromconclusions based on the CMD index. Consider, forinstance, case 2 that refers to comparison of two datasets having two common variables (x3 and x4) andthe third variable different. Hamming distance equals2 meaning that there is some difference between the twodata sets; on the contrary, the CMD index is very nearzero meaning that the two sets are actually the same.This is a consequence of the large correlation (i.e. 0.979)between variables x1 and x2. Similar considerations canbe done on all the other cases shown in Table 14.


22 CHEMOMETRICS

Table 14 Hamming distance DHAM, Procrustes distance DPR, CMD, and CMC indices for different variable sets; variable pairwisecorrelations are collected in Table 13

ID Set A Set B pA pB b c DHAM DPR CMD CMC

1 x1, x2, x3, x4 x1, x2, x3, x4 4 4 0 0 0 0 0 12 x1, x3, x4 x2, x3, x4 3 3 1 1 2 0.004 0.028 0.9953 x1, x2, x3, x4 x1, x3, x4 4 3 1 0 1 0.017 0.150 0.9894 x1, x2, x3, x4 x2, x3, x4 4 3 1 0 1 0.019 0.176 0.9855 x1, x2, x3, x4 x1, x2, x3 4 3 1 0 1 0.338 0.591 0.9256 x1, x2, x3 x1, x3, x4 3 3 1 1 2 0.412 0.651 0.8927 x1, x2, x3 x2, x3, x4 3 3 1 1 2 0.412 0.681 0.8878 x1, x2, x3, x4 x1, x2, x4 4 3 1 0 1 0.109 0.819 0.8929 x1, x2, x3, x4 x2, x3 4 2 2 0 2 0.377 0.927 0.89710 x1, x2, x3, x4 x1, x3 4 2 2 0 2 0.431 0.993 0.88511 x1 x1, x2, x4 1 3 0 2 2 0.491 1.045 0.85312 x1, x2, x3, x4 x1, x4 4 2 2 0 2 0.144 1.100 0.86613 x1, x2, x3, x4 x2, x4 4 2 2 0 2 0.147 1.127 0.86114 x1 x1, x2, x3 1 3 0 2 2 0.309 1.199 0.80915 x3 x4 1 1 1 1 2 0.942 1.520 0.24016 x1 x2, x3, x4 1 3 1 3 4 0.662 1.821 0.62917 x1 x1, x2, x3, x4 1 4 0 3 3 0.559 2.042 0.74018 x1, x2, x4 x3, x4 2 2 2 1 3 0.384 2.272 0.43219 x1, x2 x3, x4 2 2 2 2 4 0.740 2.291 0.42720 x3 x1, x2, x4 1 3 1 1 4 0.956 3.371 0.182

pA, pB, b, and c are the terms defined in the text.

Table 15 List of the data sets used to compare distances for real-valued data

ID Data set Objects Variables Classes

1 Iris 150 4 32 Wines 178 13 33 Perpot 100 2 24 Sulfa 50 7 25 Thiophene 24 3 36 Itaoils 572 8 97 Blood 784 4 28 Diabetes 768 8 2

6 DISTANCE MEASURES ON GRAPHS

A graph is usually denoted as G D (V, E), whereV is a set of vertices and E a set of elementsrepresenting the binary relationship between pairs ofvertices; unordered vertex pairs are called edges. Severalsystems can be represented by graphs, as, for instance,social networks, collaboration networks, communicationnetworks, bibliometric networks, and so on. In the fieldof chemistry, graphs are used to represent moleculesand specifically referred to as molecular graphs, wherevertices and edges are interpreted as atoms and chemicalbonds. A molecular graph depicts the connectivity ofatoms in a molecule irrespective of the metric parameterssuch as equilibrium interatomic distances among nuclei,bond angles, and torsion angles, representing the 3-Dmolecular geometry.

Distances between vertices and edges are mainlycalculated in terms of topological and detour distances.

The topological distance DTOPxy is the number of edges

along the shortest path between the vertices vx and vy, i.e.the length of the geodesic between vx and vy.

The detour distance �DETxy is exactly the ‘opposite’ of

the definition of the topological distance, it being thelength of the longest path between the vertices vx and vy,i.e. is the maximum number of edges that separate thetwo vertices.

It can be noted that the topological and detour distancescoincide for acyclic graphs, there being only one pathconnecting any pair of vertices, while can differ when atleast a cycle is present in the graph.

7 MULTIVARIATE COMPARISON OFREAL-VALUED DISTANCES

In order to perform a multivariate comparison amongthe real-valued distances, eight benchmark data setswere considered. The list of these data sets together



with the number of objects and variables is providedin Table 15. For each data set, the partition ofobjects in different classes was used to evaluatethe effects of distance measures on similarity-basedclassification.(16–23)

The following 18 distance measures were consideredfor this analysis:

ž nine unbounded distances (Table 4), namelyEuclidean (R1), Manhattan (R2), Lagrange (R3),Bhattacharyya (R5), Mahalanobis (R6), and fourdifferent LCM distances (R7); the latter are the

unsymmetrical LCM centered in x, the unsymmetricalLCM centered in y, the symmetrical LCM based onthe arithmetic mean, and the symmetrical LCM basedon the geometric mean;

ž eight bounded distances (Table 5), namely Canberra(R8), Clark (R9), Wave-Edge (R10), Lance–Williams(R11), Soergel (R12), Jaccard–Tanimoto (R13),Correlation (R15), and Cosine (R16);

ž one meta-distance, that is, the contractedJaccard–Tanimoto distance, derived by theJaccard–Tanimoto distance as the primary distance

−0.5−0.5

0

0

0.5

−0.50

0.5

−0.4

0

0.4

−0.4

0

0.41

0.5

MAN

−0.5 0 0.5

BHA

−0.5 0 0.5

WE

−1 0 1

1

2

LAG

−0.2 0 0.2 0.4

CLA

−1 0 1

SOE

−0.2

0

0.2

−0.20

0.6

0.2

−1

0

10.4

−0.5 0 10.5

CD

−1 0 1

CJT

2−0.2 0 0.2 0.4

JT

−0.20

0.2

−0.1

0

0.1

−0.2

0

0.2

0.40.4

−0.4 0 0.4

MU

−0.2 0 0.2

ML

0.6−0.2 0 0.40.2 0.6

COR

−0.4

0

−0.5

0

0.5

−0.5

0

0.50.4

−0.5 0 10.5

MSA

−0.5 0 0.5

MAH

1−0.4 0 0.4

MSG

−1

0

−0.5

0

0.5

−2

0

21

−0.5 0 0.5 −2 0 2−1 0 1

EUC CAN LW

−1

0

1

−0.2

0

0.2

Figure 6 Projections of the 150 Iris objects by means of multidimensional scaling based on 18 distance measures. The differentcolors represent the classes.


24 CHEMOMETRICS

and Manhattan distance as the meta-similaritycontraction factor.

7.1 Comparison of Real-valued Distance Measures inUnsupervised Analysis

This study was undertaken with the aim of investigatinghow the different distance measures influence the mutualrelationships among the objects of a data set and,therefore, how their graphical visualization and the resultsof unsupervised analysis can change accordingly.

A visual example of how the 18 considered distancesinduce different similarity/diversity relationships amongthe objects and thus define different geometries of datais shown in Figure 6 for the Iris data set. The plots ofFigure 6 are the projections of the data into a 2-D spaceobtained by means of MDS technique, which is a suitablemultivariate method to consider the mutual relationshipsof the object distances, by reproducing the data structureencoded in the distance (similarity) matrix into a 2-Dspace.

The changes in the distribution of the three classes ofIris data set allow an easy comparison of the differenteffects of the distances on the object relationships.

At a first glance, the majority of the distances reveala separation between the blue class with respect to theothers. On the other side, this behavior is not so evidentfor the four distances derived from LCM distance (MU,ML, MSG, and MSA), which give a different overview ofthe class distribution. With the respect to the green andred classes, distances give different degree of separation;for example, these two classes overlap when CD and CORare used, while better visual separations are obtained bymeans of EUC, LAG, and CJT. Finally, each distance

gives a different perception of the presence of outliers;for example, almost all distances detect one or moreoutliers in the blue class, while CJT gives a compactclusterization of this class.

For each data set in analysis, the pairwise dissimilaritieswere then calculated between all the possible pairs ofobjects (i.e. n ž (n� 1)/2, n being the total number ofobjects of the data set) using all the considered 18 distancefunctions, one at a time. At the end of this calculation,the pairwise distances were collected into a data matrixof dimension n ž (n� 1)/2 ð 18, where the rows representthe object pairs and the columns the considered distancemeasures. PCA was then applied to this data matrix inorder to investigate the relationships among the differentdistance functions.

The loading plots of the first four PCs are reportedin Figures 7–14; the first four PCs always explain atotal variance higher than 90%. In order to makemore readable the plots, the following symbols wereadopted: white squares for the five Mahalanobis-typedistances (MAH: classical Mahalanobis, ML: LCMfunction centered in x, MU: LCM function centered inx, MSA: symmetrical LCM function-arithmetic mean,and MSG: symmetrical LCM function-geometric mean);black squares for the two correlation-based boundeddistances (COR: correlation distance) and (CD: cosinedistance); white circles for four other unboundeddistances (EUC: Euclidean, MAN: Manhattan, LAG:Lagrange, and BHA: Bhattacharyya); black circles forother six bounded distances (CAN: Canberra, LW:Lance–Williams, WE: Wave-Edge, CLA: Clark, SOE:Soergel, and JT: Jaccard–Tanimoto); asterisk for the onemeta distance (CJT: Contracted Jaccard–Tanimoto).

−0.4

−0.25

Iris Iris

−0.2 −0.15

PC1 - EV: 73.3%

−0.5 0 0.5

PC3 - EV: 3.8%(a)

−0.3

−0.2

−0.1

0

0.1

PC

2 -

EV

: 17.

2%

−0.4

(b)

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

PC

4 -

EV

: 2%

MAN

MAN

EUC

EUC

LAG

LAG

MU MU

MAH ML

ML

MSA

MSA

CORBHAMAH

CJT

CDJTLW

SOE

CANCLAWECOR

BHACJTCD

JTLWSOECANCLAWE

MSG

MSG

Figure 7 PCA of the pairwise distances between the objects of Iris data set obtained by the different distance functions. (a)PC1–PC2 loading plot. (b) PC3–PC4 loading plot.



−0.4

−0.50.15

Wines

0.2 0.25

PC1 - EV: 71.2%

−0.5 0 0.5

PC3 - EV: 5.3%(a) (b)

−0.3

−0.2

−0.1

0

0

0.1

0.2

PC

2 -

EV

: 15.

9%

−0.4

−0.5

−0.6

−0.3

−0.2

−0.1

0.1

0.2

0.3

PC

4 -

EV

: 3.1

%

EUC

LAG

MUMAHML

MSA

COR

MSG

Wines

MAN

EUC

LAG

MU

MAH

ML

MSA

COR

BHACJT

CDJT

LWSOE

CANCLA

WE

MSG

Figure 8 PCA of the pairwise distances between the objects of Wines data set obtained by the different distance functions. (a)PC1–PC2 loading plot. (b) PC3–PC4 loading plot.

(a)

−0.2

0

0.2

0.4

0.6

0.8

PC

2 -

EV

: 5.5

%

−0.2

−0.4

−0.6

0

0.2

0.4

0.6

PC

4 -

EV

: 3.1

%

−0.25 −0.2 −0.15

PC1 - EV: 82% (b)

−0.4 −0.2 0 0.2

PC3 - EV: 5.3%

MAN

EUC LAG

MUMU

MAH

MAH

ML

MSA

MSA

COR

COR

BHA

CJT

CD

CD

JT

JT

LWSOE

CAN

CLA

WE

MANEUCLAGMLBHACJTLWSOE

CANCLAWE

MSG

MSG

Perpot Perpot

Figure 9 PCA of the pairwise distances between the objects of Perpot data set obtained by the different distance functions. (a)PC1–PC2 loading plot. (b) PC3–PC4 loading plot.

First, it can be noted that in most of the cases,the five Mahalanobis-type distances behave differentlyfrom the other distances. In particular, this separationis apparent for the data sets Iris (Figure 7), Itaoils(Figure 12), Sulfa (Figure 10), Wines (Figure 8), andBlood (Figure 13). Moreover, the third or the fourthcomponent highlights the opposite behavior of the twoasymmetrical LCM functions MU and ML, while the twodifferent symmetrization procedures of LCM distances,i.e. by arithmetic mean (MSA) and geometric mean(MSG), are not significantly distinguishable in all thePCs for all the data sets.

Euclidean (EUC) and Manhattan (MAN) distancesare often very similar and, in several cases, not farfrom the Lagrange (LAG) distance. For the data setsBlood (Figure 13), Diabetes (Figure 14), and Thiophene(Figure 11), they are closer to the group of Mahalanobis-type distances than bounded distances. In particular,the Lagrange distance (LAG) appears to be similar toMahalanobis distance (MAH) in the data sets Diabetesand Thiophene and is, in general, the most similar to thegroup of Mahalanobis-type distances. The Bhattacharyyadistance (BHA) seems to be different from the otherunbounded distances (see, e.g. Diabetes and Blood) andis often similar to the Jaccard–Tanimoto distance (JT).


26 CHEMOMETRICS

−0.1

−0.2

0

0.1

0.2

0.3

0.4

PC

2 -

EV

: 15.

5%

(a)

−0.1

−0.3

−0.2

−0.4

−0.5

−0.6

0

0.1

0.2

0.3

PC

4 -

EV

: 3.8

%

−0.25 −0.2 −0.15

PC1 - EV: 70.3% (b)

−0.5 0 0.5

PC3 - EV: 4.7%

Sulfa Sulfa

MUMAH

MSA

COR

CDJTMAN

EUC

LAG

ML

BHACJT

LWSOE

CANCLAWE

MSGMUMAH

MSA

CORCDJT

MAN

EUC

LAG

ML

BHACJT

LWSOECANCLAWE

MSG

Figure 10 PCA of the pairwise distances between the objects of Sulfa data set obtained by the different distance functions. (a)PC1–PC2 loading plot. (b) PC3–PC4 loading plot.

−0.1

−0.2

−0.3

0

0.1

0.2

0.3

PC

2 -

EV

: 9.4

%

(a)

−0.1

−0.3

−0.2

−0.4

−0.5

0

0.1

0.2

0.3

0.4

PC

4 -

EV

: 3.2

%

0.2 0.22 0.24

PC1 - EV: 76.6% (b)

−0.4 −0.2 0 0.2 0.4

PC3 - EV: 4.9%

Thiophene Thiophene

MU

MAH

MSA COR

CDJT

MAN

EUC

LAG

ML

BHA

CJT

LWSOE

CANCLA

WE

MSG

MU MAHMSA

CORCD

JT

MAN

EUC

LAG

ML

BHA

CJTLW

SOE

CANCLA

WE

MSG

Figure 11 PCA of the pairwise distances between the objects of Thiophene data set obtained by the different distance functions.(a) PC1–PC2 loading plot. (b) PC3–PC4 loading plot.

In some cases, correlation distance (COR) appearsseparated from the other distances; at less extent, thesame behavior is also shown by cosine distance (CD).This can be clearly noted in the Perpot (Figure 9) andBlood (Figure 13) data sets.

Besides correlation and cosine distances, the othersix bounded distances, namely Lance–Williams (LW),Canberra (CAN), Wave-Edge (WE), Soergel (SOE),Jaccard–Tanimoto (JT), and Clark (CLA), appear veryoften in the same region of the PC space, especially inthe first two components. These distances can be furtherpartitioned into two subgroups, the former constitutedby Wave-Edge (WE), Canberra (CAN), and Clark

(CLA) distances and the latter by Lance–Williams (LW),Soergel (SOE), and Jaccard–Tanimoto (JT) distances.For example, evidence for these two subgroups canbe found in the first two PCs of the data sets Blood(Figure 13a), Diabetes (Figure 14a), and Thiophene(Figure 11a) and in the third and fourth PCs of Iris(Figure 7b), Thiophene (Figure 11b), Sulfa (Figure 10),and Itaoils (Figure 12b) data sets.

Moreover, it is remarkable that the ContractedJaccard–Tanimoto (JCT) distance is usually not very farfrom the Jaccard–Tanimoto distance (JT) as expected;however, it looks clearly distinguishable from JT in all theeight data sets. These relationships can be easily noted



−0.1

0

0.1

0.2

0.3

0.4

0.5

PC

2 -

EV

: 13%

(a)

−0.2

−0.4

0

0.2

0.4

0.6

PC

4 -

EV

: 3.3

%

0.15 0.2 0.25

PC1 - EV: 74.4% (b)

−0.5 0 0.5

PC3 - EV: 4.9%

Itaoils Itaoils

MU

MAH

MSA

COR

CDJT

MANEUC

LAGML

BHA

CJT

LWSOE

CAN

CLA

WE

MSG

MUMAH

MSA

COR CDJTMAN

EUC

LAG

ML

BHACJTLWSOECANCLA WE

MSG

Figure 12 PCA of the pairwise distances between the objects of Itaoils data set obtained by the different distance functions. (a)PC1–PC2 loading plot. (b) PC3–PC4 loading plot.

Blood Blood

MU

MAH MSA

CORCD

JT

MANEUC

LAG

ML

BHACJT

LWSOE

CANCLAWE

MSG

MU

MAHMSA

CORCD

JT

MAN

EUC LAG

ML

BHA

CJT

LWSOECANCLAWE

MSG

−0.2

0 0

0.1

0.2

0.3

−0.3

−0.1PC

2 -

EV

: 19.

7%

−0.4

0.2

0.4

0.6

−0.6

−0.2PC

4 -

EV

: 3.9

%

(a) (b)

−0.28 −0.26 −0.24 −0.22 −0.2 −0.18

PC1 - EV: 62.9%

−0.6 −0.4 −0.2 0 0.2

PC3 - EV: 4.5%

Figure 13 PCA of the pairwise distances between the objects of Blood data set obtained by the different distance functions. (a)PC1–PC2 loading plot. (b) PC3–PC4 loading plot.

in the first two PCs for the data sets Blood (Figure 13a),Thiophene (Figure 11a), and Diabetes (Figure 14a) andin the third and fourth PCs for Thiophene (Figure 11b),Sulfa (Figure 10b), and Perpot (Figure 9b) data sets.

7.2 Effects of Distance Measures on Similarity-basedClassification

Classification methods are fundamental multivariatetechniques aimed to find mathematical models able torecognize the class membership of objects on the basisof a set of measurements. The k-NN classification rule is

conceptually quite simple: an object is classified accordingto the class memberships of the k closest objects, i.e. itis classified according to the majority of its k-NNs in thedata space. Thus, an object is classified on the basis of itssimilarity to other objects. In a computational viewpoint,all that is necessary is to calculate and analyze a distancematrix. The distance of each object from all the otherobjects is computed, and the objects are then sortedaccording to this distance. In order to quantitativelyevaluate the effects of the different geometries inducedon the data by each distance function, the k-NN analysis


28 CHEMOMETRICS

Diabetes Diabetes

MUMAH

MSA

COR

CDJTMAN

EUC

LAG ML

BHACJTLW

SOE

CANCLAWE

MSG

MU

MAH

MSA

COR

CD JT

MAN

EUC

LAG

ML

BHA

CJT LWSOE

CANCLA

WE

MSG

−0.2

00

0.1

0.2

0.3

0.4

−0.3

−0.1PC

2 -

EV

: 12.

3%

−0.3

0.1

0.2

0.3

0.4

−0.4

−0.1

−0.2

PC

4 -

EV

: 4.3

%

(a) (b)

−0.26 −0.24 −0.22 −0.2 −0.18

PC1 - EV: 71.9%

−0.5 0 0.5

PC3 - EV: 5%

Figure 14 PCA of the pairwise distances between the objects of Diabetes data set obtained by the different distance functions.(a) PC1–PC2 loading plot. (b) PC3–PC4 loading plot.

Table 16 Nonerror rate of the 18 distances for each data set and the average nonerror rate (NER) from k-NN classification

Distance Symbol Iris Wines Perpot Sulfa Thiophene Itaoils Blood Diabetes NER

Manhattan MAN 0.953 0.981 0.990 0.823 0.833 0.949 0.637 0.692 0.857Euclidean EUC 0.967 0.977 0.990 0.774 0.833 0.947 0.625 0.707 0.852Soergel SOE 0.953 0.967 0.980 0.788 0.792 0.947 0.637 0.708 0.847Lance–Williams LW 0.953 0.967 0.980 0.788 0.792 0.947 0.637 0.707 0.846Contracted JT CJT 0.967 0.981 0.990 0.760 0.792 0.946 0.626 0.708 0.846Jaccard–Tanimoto JT 0.960 0.972 0.980 0.724 0.792 0.947 0.628 0.718 0.840Lagrange LAG 0.967 0.955 0.970 0.774 0.792 0.943 0.609 0.697 0.838Wave-Edge WE 0.953 0.972 0.960 0.788 0.750 0.929 0.633 0.663 0.831Bhattacharyya BHA 0.953 0.977 1.000 0.683 0.792 0.932 0.637 0.673 0.831Canberra CAN 0.947 0.977 0.960 0.752 0.750 0.929 0.637 0.672 0.828Mahalanobis MAH 0.913 0.972 0.990 0.710 0.792 0.920 0.626 0.700 0.828Clark CLA 0.953 0.986 0.970 0.732 0.750 0.911 0.632 0.662 0.825LCM-symm. geom. MSG 0.920 0.917 0.980 0.718 0.792 0.913 0.635 0.684 0.820LCM-symm. arith. MSA 0.920 0.897 0.980 0.690 0.792 0.909 0.637 0.681 0.813LCM-lower mat. ML 0.933 0.888 0.940 0.766 0.708 0.869 0.633 0.645 0.798Cosine CD 0.827 0.972 0.900 0.540 0.792 0.954 0.603 0.650 0.780LCM-upper mat. MU 0.853 0.618 0.990 0.516 0.750 0.865 0.613 0.629 0.729Correlation COR 0.853 0.980 0.500 0.518 0.708 0.947 0.598 0.629 0.717

The best results for each data set are highlighted in gray.

was performed on the eight data sets (Table 15). Nosophisticated validation procedures were adopted, butonly the implicit leave-one out technique typical of thek-NN approach. Thus, a rough estimate of the potentialclassification behavior of each distance was estimated foreach data set.

The usual way of selecting k is by testing a set of kvalues (e.g. from 1 to 10); then, the k giving the lowestclassification error can be selected as the optimal one.

The nonerror rate (NER) was calculated for each dataset by mean of each of the 18 distance measures together

with the average NER (Table 16). The rank of eachdistance for each data set was calculated together withthe average rank (AR) (Table 17).

By analyzing the results reported in Tables 16 and17, it can be easily observed that the use of differentdistance measures has an effect of the classificationperformance, owing to the way each distance representsthe similarity/diversity relationships among the objects.Five distances give the best results both considering theaverage NER and the AR; these are the Manhattan(MAN), Euclidean (EUC), Soergel (SOE), Contracted



Table 17 Ranks of the 18 distances for each data set and average rank (AR) from k-NN classification

Distance Symbol Iris Wines Perpot Sulfa Thiophene Itaoils Blood Diabetes AR

Manhattan MAN 7.5 1.0 1.5 3.5Euclidean EUC 1.5 14.0Contracted JT CJT

2.02.0

2.56.02.5 12.5

Soergel SOE 12.5Lance–Williams LW 12.5

3.53.5

Jaccard–Tanimoto JT

4.04.04.09.09.09.0 11.0 11.0

8.03.52.52.53.51.0

Bhattacharyya BHA

7.57.54.07.5

9.56.0 1.0 15.0 10.0 3.5 11.0

Lagrange LAG 2.0 14.0 12.5

5.58.03.03.0

5.5

2.05.08.05.05.05.0

9.0 16.0 7.0Mahalanobis MAH 15.0 4.0 12.0

7.57.57.57.57.57.57.5 13.0 12.5 6.0

3.85.25.96.36.47.37.79.29.9

Canberra CAN 11.0 14.5 13.5 11.5 3.5 12.0 10.1Wave-Edge WE

9.56.09.5 14.5

9.03.0 13.5 11.5 8.5 13.0 10.1

Clark CLA7.57.5 1.0 12.5 10.0 13.5 15.0 10.0 14.0 10.4

LCM-symm. geom. MSG 13.5 15.0 12.0 14.0 7.0 9.0 10.9LCM-symm. arith. MSA 13.5 16.0

9.09.0 14.0 16.0 3.5 10.0 11.2

Cosine CD 18.0 9.5 17.0 16.0

7.57.57.5 1.0 17.0 15.0 12.6

LCM-lower mat. ML 12.0 17.0 16.0 7.0 17.5 17.0 8.5 16.0 13.9Correlation COR 16.5 4.0 18.0 17.0 17.5 5.0 18.0 17.5 14.2LCM-upper mat. MU 16.5 18.0 4.0 18.0 13.5 18.0 15.0 17.5 15.1

The best rank(s) for each data set is highlighted in gray.

0

0.6

−0.6

0

0

0.3

0.3−0.3

−0.3 0 0.3−0.3

0

0.8

0.6

0.4

0.2

−0.6

−0.4

−0.2

0

0.2

−0.2

−0.5(a) (b)

(c) (d)

0

EUC

0.5 −0.5 0

CLA

JT CJT

0.5 11

Figure 15 Projections of the 178 Wines objects by means of multidimensional scaling based on Euclidean (EUC, Figure 15a),Clark (CLA, Figure 15b), Jaccard–Tanimoto (JT, Figure 15c), and Contracted Jaccard–Tanimoto (CJT, Figure 15d) distances. Thedifferent colors represent the classes.


30 CHEMOMETRICS

Table 18 Statistical parameters for 44 similarity coefficients calculated by the simulated data set

# Symbol mean std cv perc(5) perc(95)

B1 SM 0.5420 0.2869 0.5293 0.0625 0.9629B2 RT 0.4268 0.2848 0.6672 0.0323 0.9284B3 JT 0.4050 0.3112 0.7685 0.0102 0.9433B4 GLE 0.5066 0.3204 0.6325 0.0201 0.9708B5 RR 0.3329 0.2939 0.8829 0.0059 0.9014B6 FOR 0.3756 0.3234 0.8609 0.0015 0.9429B7 SIM 0.6902 0.2984 0.4323 0.0865 0.9971B8 BB 0.4427 0.3185 0.7194 0.0119 0.9569B9 DK 0.5302 0.3075 0.5800 0.0391 0.9710B10 BUB 0.5052 0.2980 0.5898 0.0375 0.9569B11 KUL 0.5665 0.2875 0.5075 0.0606 0.9712B12 SS1 0.3092 0.2880 0.9312 0.0051 0.8927B13 SS2 0.6532 0.2716 0.4158 0.1176 0.9811B14 JA 0.5658 0.3187 0.5633 0.0299 0.9804B15 FAI 0.4374 0.2655 0.6069 0.0454 0.9170B16 MOU 0.0085 0.0502 5.9205 0.0001 0.0232B17 MIC 0.5099 0.2155 0.4227 0.1096 0.8912B18 RG 0.4558 0.2411 0.5290 0.0611 0.8669B19 HD 0.3619 0.2353 0.6503 0.0328 0.8031B20 YU1 0.5349 0.3700 0.6917 0.0025 0.9978B21 YU2 0.5252 0.2837 0.5401 0.0479 0.9551B22 FOS 0.3746 0.3234 0.8635 0.0013 0.9425B23 DEN 0.3431 0.1336 0.3894 0.1270 0.5880B24 CO1 0.9921 0.0803 0.0809 0.9970 0.9999B25 CO2 0.9921 0.0802 0.0809 0.9966 0.9999B26 DIS 0.5066 0.1563 0.3085 0.2244 0.7766B27 GK 0.2961 0.2579 0.8710 0.0049 0.8076B28 SS3 0.5213 0.2161 0.4146 0.1475 0.8748B29 SS4 0.2465 0.2410 0.9778 0.0018 0.7527B30 PHI 0.5175 0.2140 0.4136 0.1379 0.8727B31 DI1 0.5473 0.3248 0.5934 0.0256 0.9868B32 DI2 0.5856 0.3390 0.5788 0.0229 0.9946B33 SOR 0.4051 0.3449 0.8516 0.0007 0.9737B34 COH 0.5378 0.1788 0.3325 0.2461 0.8673B35 PE1 0.5210 0.2308 0.4429 0.1157 0.9014B36 PE2 0.5167 0.2191 0.4240 0.1214 0.8899B37 MP 0.5164 0.2105 0.4076 0.1397 0.8708B38 HL 0.2687 0.2102 0.7824 0.0270 0.7124B39 CT1 0.8733 0.1354 0.1551 0.6022 0.9946B40 CT2 0.1628 0.1525 0.9364 0.0093 0.4715B41 CT3 0.7401 0.2208 0.2983 0.2807 0.9850B42 CT4 0.7734 0.2160 0.2793 0.3197 0.9914B43 CT5 0.5128 0.1400 0.2729 0.2857 0.7384B44 AC 0.5316 0.2174 0.4090 0.1609 0.8766

Jaccard–Tanimoto (CJT), and Lance–Williams (LW)distances.

Looking at the ranks collected in Table 17, only in fewcases the best rank (1.0) is achieved by other distancemeasures, such as Clark distance (CLA) for Wines dataset, Bhattacharyya (BHA) for Perpot data set, Cosinedistance (CD) for Itaoils data set, and Jaccard–Tanimoto(JT) for Diabetes data set.

The Contracted Jaccard–Tanimoto distance (CJT) isbetter than its parent Jaccard–Tanimoto distance (JT) in

four data sets out of eight, in one case is equal, and inthree cases is only a slightly lower.

The four Mahalanobis-type distances derived from theLCM distance (MSA, MSG, MU, and ML) appear notuseful in classification problems; on the other side, LCMdistance was proposed for different purposes, i.e. with theaim to catch outliers and analyze the AD of a model.

Correlation and cosine distance appear as much weakfor classification purposes on the studied data sets;analogous considerations can be done for Canberra



0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10

×104

CT1

CT5

SS2

RG

HL

GK

RT

AC

SM SS3

SS4 CT2

Symmetric functions

Figure 16 Line plots of the symmetric binary coefficients calculated from the simulated data set.

0

0.2

0.4

0.6

0.8

1

0

Asymmetric functions

1 2 3 4 5 6 7 8 9 10

×104

CT4

CT3

Kul

Sim

GleBB

FaiJT

Ja

DKBUB

SorRR

Di2Di1

For,Fos

SS1

Mou

Figure 17 Line plots of the asymmetric binary coefficients calculated from the simulated data set.

(CAN), Wave-Edge (WE), and Clark (CLA) distances,although Clark distance provided the best result inone case.

A detailed example of how the choice of the distancefunction influences the geometry of the object space isgiven for the data set Wines (Figure 15). The comparisonwas performed by MDS. As commented earlier, the Clarkdistance (CLA) gives the best results (NER D 0.986) andthis is visually confirmed by the good clusterization ofthe objects belonging to the three different classes, whichis not so well obtained in the case of Euclidean (EUC,NER D 0.977), Jaccard–Tanimoto (JT, NER D 0.972),and Contracted Jaccard–Tanimoto (CJT, NER D 0.981).

8 COMPARISON OF BINARY SIMILARITYCOEFFICIENTS

An extended comparison among the 44 similaritymeasures listed in Table 10 was performed using asimulated data set.(10) The simulated data set of 100 000cases has been generated by randomly generatingquadruples of integer numbers (a, b, c, d) under theconstraint a C b C c C d D 1024. For each case, the 44similarity coefficients were calculated and organized intoa matrix of 100 000 rows and 44 columns. Each case can bethought of as the comparison of a binary vector of length1024 bits with a reference vector of the same length.


32 CHEMOMETRICS

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10

×104

Co1, Co2

Coh

Yu1Yu2

Pe1, Pe2, Phi,Mic, MP

dis

Den

Correlation-based functions

Figure 18 Line plots of the correlation-based binary coefficients calculated from the simulated data set.

−1.4−2.0

SymAsymIntermCorr−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

Dim 1

−1.2

−1.0

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Dim

2

Mic

Den dis

Coh

Pe2

MP PhiPe1

RG

HDSM

AC SS2

SS4GK

CT2CT1

BUB

Kul

DK JaGle

BB JTFor

FosSS1

Fai

Di1

Sor

Sim Di2

CT4CT3

RR

RT

CT5

Yu1

Yu2

SS3HL

Figure 19 Multidimensional scaling of the binary similarity coefficients. The Co1, Co2, and Mou coefficients were excluded fromanalysis as they are strong outliers (see text).

The 100 000 similarity values that had been generatedfor each coefficient were analyzed to calculate thefollowing descriptive statistics: mean, standard deviation(std), coefficient of variation (cv), and 5 and 95 percentiles[perc(5) and perc(95)]. These values are listed in Table 18.

The minimum and maximum values of all the coefficientsare 0 and 1, respectively.

Inspection of Table 18 suggests that most of thecoefficients have a mean value around 0.5 and that theyspan the similarity range in a satisfactory way. There



are three very anomalous coefficients: CO1 (B24) andCO2 (B25) yield very high values and MOU (B16) yieldsvery low values. These outlier coefficients were probablyoriginally proposed to deal with short vectors, where theparameters b, c, and d may have less influence than a.Less extreme behavior is exhibited by CT1 (B39), CT4(B42), CT3 (B41), SIM (B7), SS2 (B13), DI2 (B32), andKUL (B11) (which all have mean values >0.55) andCT2 (B40), SS4 (B29), HL (B38), and GK (B27) (whichall have mean values <0.30). Turning to the standarddeviations [and excluding CO1 (B24), CO2 (B25), andMOU (B16)], the coefficients showing the maximumvariability are YU1 (B20), SOR (B33), DI2 (B32), DI1(B31), FOS (B22), FOR (B6), GLE (B4), JA (B14), BB(B8), JT (B3), and DK (B9) (all with standard deviations>0.30), whereas the minimum variability is provided byDEN (B23), CT1 (B39), CT5 (B43), CT2 (B40), DIS(B26), and COH (B34) (all with standard deviations lowerthan 0.20).

The ordered sequences of similarity values (inascending order) were plotted for each coefficient toexplore the functional shape. In order to simplify the anal-ysis and discussion that follows, the plots are presented inthree different figures: symmetric functions (Figure 16),asymmetric functions (Figure 17), and correlation-basedfunctions (Figure 18). Inspection of these figures showsthat the shapes of the functions can be approximatelycategorized as logarithmic, exponential, sigmoidal, orquasi-linear in character.

MDS was performed on the 41ð 41 matrix of thepairwise Pearson correlation coefficients calculated fromthe simulated data. This analysis omits the Mountford andCole coefficients [i.e. CO1 (B24), CO2 (B25), and MOU(B16)], as they are significant outliers (Figures 17 and 18).The final configuration of the binary similarity coefficientsin a two-dimensional MDS plot is shown in Figure 19.At a first glance, the similarity coefficients appear wellclustered according to their symmetry properties, with thesymmetric functions (green squares, on the middle-left-bottom side), the asymmetric functions (blue triangles,on the right side), and the correlation-based functions(red circles, on the left-top side) well separated fromeach other. In this respect, it is interesting to note thatBUB (B10) and FAI (B15), which are intermediate incharacter between symmetric and asymmetric functions,are appropriately located between the symmetric andasymmetric clusters. Many of the coefficients are verynear to each other in the plot, indicating close similarityrelationships, e.g. the group comprising SM (B1), RT(B2), SS2 (B13), and AC (B44), which have a rankcorrelation equal to one. In much the same way, thegroup comprising JT (B3), JA (B14), GLE (B4), SS1(B12), FOR (B6), FOS (B22), and DK (B9) has rankcorrelations larger than 0.99. Some coefficients, however,

are quite isolated in the MDS plot. This is the case for thepairs CT1 (B39) and CT2 (B40), SIM (B7) and DI2 (B32),and CT3 (B41) and CT4 (B42); RR (B5) and CT5 (B43)also seem to be quite separated from other coefficients.

ACKNOWLEDGMENTS

The Authors warmly thank Michel Marie Deza of theEcole Normal Superieure (Paris, France) for his help inmanaging mathematical properties of distances and LuisPeinador Sarabia of the University of Burgos (Spain) forhis suggestions.

ABBREVIATIONS AND ACRONYMS

SOM self-organizing mapsMDS multidimensional scalingMST minimum spanning treek-NN k-nearest neighborDA discriminant analysisAD applicability domainLCM locally centered MahalanobisATDM atemporal target diffusion modelCMD Canonical measure of distanceNER nonerror rateAR average rank

RELATED ARTICLES

Chemometrics (Volume 11)Chemometrics ž Clustering and Classification of Analyt-ical Data ž Soft Modeling of Analytical Data

Pharmaceuticals and Drugs (Volume 8)Quantitative Structure-Activity Relationships andComputational Methods in Drug Discovery

REFERENCES

1. V. Batagelj, M. Bren, ‘Comparing Resemblance Measures’,J. Classif., 12, 73–90 (1995).

2. M.M. Deza, E. Deza, Encyclopedia of Distances, Springer,Dordrecht, 2009.

3. P. Legendre, L. Legendre, Numerical Ecology, 2nd edition,Elsevier, Amsterdam, 1998.


34 CHEMOMETRICS

4. P.H.A. Sneath, R.R. Sokal, Numerical Taxonomy,Freeman, San Francisco, CA, 1973.

5. S.-H. Cha, ‘Comprehensive Survey on Distance/SimilarityMeasures between Probability Density Functions’, Int. J.Math. Models Methods Appl. Sci., 1, 300–307 (2007).

6. C.M. Cuadras, ‘Distancias Estadısticas (in Spanish)’,Estadistica Espanola, 30, 295–378 (1989).

7. A.G. Maldonado, J.P. Doucet, M. Petitjean, B.T. Fan,‘Molecular Similarity and Diversity in Chemoinformatics:From Theory to Applications’, Mol. Diversity, 10, 39–79(2006).

8. Y.C. Martin, J.L. Kofron, L.M. Traphagen, ‘Do StructurallySimilar Molecules Have Similar Biological Activity?’J. Med. Chem., 45, 4350–4358 (2002).

9. P. Willett, Similarity and Clustering in Chemical Informa-tion Systems, Research Studies Press, Leichworth, 1987.

10. R. Todeschini, V. Consonni, H. Xiang, J. Holliday,M. Buscema, P. Willett, ‘Similarity Coefficients forBinary Chemoinformatics Data: Overview and ExtendedComparison Using Simulated and Real Datasets’, J. Chem.Inf. Model., 52, 2884–2901 (2012).

11. W.J. Krzanowski, Principles of Multivariate Analysis,Oxford Science Publications, Oxford, 1988.

12. R. Todeschini, D. Ballabio, V. Consonni, F. Sahigara, P.Filzmoser, ‘Locally-Centred Mahalanobis Distance: A NewDistance Measure with Salient Features Towards OutlierDetection’, Anal. Chim. Acta, 787, 1–9 (2013).

13. D.E. Rumelhart, P. Smolensky, J.L. McClelland, G.E.Hinton, Schemata and Sequential Thought Processes inPDP Models, MIT Press, Cambridge, 1986.

14. J.C. Gower, ‘Generalized Procrustes Analysis’, Psychome-trika, 40, 31–51 (1975).

15. R. Todeschini, V. Consonni, A. Manganaro, D. Ballabio,A. Mauri, ‘Canonical Measure of Correlation (CMC)and Canonical Measure of Distance (CMD) betweenSets of Data. Part 1. Theory and Simple ChemometricApplications’, Anal. Chim. Acta, 648, 45–51 (2009).

16. R.A. Fisher, ‘The Use of Multiple Measurements inTaxonomic Problems’, Ann. Eugen., 7, 179–188 (1936).

17. M. Forina, C. Armanino, S. Lanteri, E. Tiscornia,‘Classification of Olive Oils from their Fatty AcidComposition’, in Food Research and Data Analysis,Applied Science Publishers, London, 189–214, 1983.

18. M. Forina, Artificial dataset by M. Forina (University ofGenoa) (2000).

19. Y. Miyashita, Y. Takahashi, C. Takayama, T. Ohkubo, K.Fumatsu, S. Sasaki, ‘Computer-Assisted Structure/TasteStudies on Sulfamates by Pattern Recognition Methods’,Anal. Chim. Acta, 184, 143–149 (1986).

20. P.P. Mager, Design Statistics in Pharmacochemistry,Research Studies Press, Letchworth, 1991.

21. M. Forina, C. Armanino, M. Castino, M. Ubigli, ‘Multi-variate Data Analysis as Discriminating Method of theOrigin of Wines’, Vitis, 25, 189–201 (1986).

22. K.A. Baggerly, J.S. Morris, S.R. Edmonson, K.R. Coombes,‘Signal in Noise: Evaluating Reported Reproducibility ofSerum Proteomic Tests for Ovarian Cancer’, J. Natl. CancerInst., 97, 307–309 (2005).

23. J.W. Smith, J.E. Everhart, W.C. Dickson, W.C. Knowler,R.S. Johannes, ‘Using the ADAP Learning Algorithm toForecast the Onset of Diabetes Mellitus’, Proc. Symp.Comput. Appl. Med. Care, 9, 261–265 (1988).


Documents

Distances and other dissimilarity measures in chemometrics