Mining statistically significant co location and segregation patterns

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014 1185

Mining Statistically Significant Co-location andSegregation Patterns

Sajib Barua and Jörg Sander

Abstract—In spatial domains, interaction between features gives rise to two types of interaction patterns: co-location and segregationpatterns. Existing approaches to finding co-location patterns have several shortcomings: (1) They depend on user specifiedthresholds for prevalence measures; (2) they do not take spatial auto-correlation into account; and (3) they may report co-locationseven if the features are randomly distributed. Segregation patterns have yet to receive much attention. In this paper, we propose amethod for finding both types of interaction patterns, based on a statistical test. We introduce a new definition of co-location andsegregation pattern, we propose a model for the null distribution of features so spatial auto-correlation is taken into account, and wedesign an algorithm for finding both co-location and segregation patterns. We also develop two strategies to reduce the computationalcost compared to a naïve approach based on simulations of the data distribution, and we propose an approach to reduce the runtimeof our algorithm even further by using an approximation of the neighborhood of features. We evaluate our method empirically usingsynthetic and real data sets and demonstrate its advantages over a state-of-the-art co-location mining algorithm.

Index Terms—Spatial data, co-location, segregation, spatial interaction, statistically significant pattern

1 INTRODUCTION

IN spatial domains, interaction between features givesrise to two types of interaction patterns. A positive

interaction (aggregation) brings a subset of features closeto each other whereas a negative interaction (inhibition)results in subsets of features segregating from each other.Co-location patterns, intended to represents positive inter-actions, have been defined as subsets of Boolean spatialfeatures whose instances are often seen to be located atclose spatial proximity [1]. Segregation patterns, represent-ing negative interactions, can be defined as subsets ofBoolean spatial features whose instances are infrequentlyseen to be located at close spatial proximity (i.e., whoseco-locations are “unusually” rare). Interaction pattern min-ing can lead to important domain related insights in areassuch as ecology, biology, epidemiology, earth science, andtransportation. For instance, the Nile crocodile and theEgyptian plover (a bird that has a symbiotic relationshipwith the Nile crocodile) are often seen together giving riseto a co-location pattern {Nile crocodile, Egyptial plover}.In urban areas, we also see co-location patterns such as{shopping mall, restaurant}. Examples of segregation pat-terns are common in ecology, where they arise from pro-cesses such as the competition between plants or the territo-rial behaviour of animals. For instance, in a forest, some tree

• The authors are with the Department of Computing Science, Universityof Alberta, Edmonton, AB T6G 2E8, Canada.E-mail: {sajib, jsander}@ualberta.ca.

Manuscript received 19 June 2012; revised 14 May 2013; accepted 23 May2013. Date of publication 3 June 2013; date of current version 7 May 2014.Recommended for acceptance by C. Böhm.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier 10.1109/TKDE.2013.88

species are less likely found closer than a particular distancefrom each other due to their competition for resources.

Existing co-location mining algorithms are motivated byassociation rule mining (ARM) [2]. Most of the currentalgorithms [1], [3]–[6] adopt an approach similar to theApriori algorithm proposed for ARM in [2], by introducingsome notion of transaction over the space, and a suitableprevalence measure.

The current state-of-the-art approach is the event centricmodel [3] where a transaction is generated from a prox-imity neighborhood of feature instances. Feature instancespresent in such a neighborhood are neighbors of eachother forming a clique. The proposed prevalence measureis called the Participation Index (PI).

The works in [7] and [8] look for “complex patterns”that occur due to a mixed type of interaction (a combina-tion of positive and negative), using a proposed prevalencemeasure called Maximum Participation Index (maxPI).

In existing co-location mining algorithms [1], [3], [5],and [6] a co-location pattern is reported as prevalent, if itsPI-value is greater than a user specified threshold. The com-plex pattern mining algorithm proposed in [8] also reportsa pattern as prevalent if its maxPI-value is greater than auser defined threshold. Finding pattern defined in this way,is reasonably efficient since the PI is anti-monotonic and themaxPI is weakly anti-monotonic. However, using such anapproach may not be meaningful from an application pointof view.

We argue that the prevalence measure threshold shouldnot be global and pre-defined, but should be decided basedon the distribution and the total number of instances ofeach individual feature involved in an interaction. In spa-tial data sets, the value of a prevalence measure like the PIis not necessarily, by itself, whether high or low, indicative

1041-4347 c© 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1186 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014

of a positive or negative interaction between features. It isnot uncommon to see subsets of features with a very highprevalence measure value due to randomness, presence ofspatial autocorrelation, and abundance of feature instancesalone, i.e., without true interaction between the involvedfeatures. It is also possible that the prevalence measurevalue of a group of positively interacting features is rel-atively low if one of the participating features has a lowparticipation ratio. There are similar issues with negativeinteractions. Not every negative interaction has a low par-ticipation index in absolute terms, and not every patternwith a low prevalence measure value represents necessar-ily a segregation pattern, e.g., non-interacting features withfew instances may also have a very low PI value. Clearly, insuch cases, the existing co-location mining algorithms willreport meaningless “prevalent” patterns or miss meaning-ful patterns, they may even report subsets of features asprevalent co-location (i.e., aggregation pattern), when it istruly a segregation pattern.

Consider the following (sketches of) example scenarios(including a real data set), in order to see the need for adifferent type of approach that takes the distribution andthe total number of feature instances into account. Eachof these examples is illustrated with a data set and fig-ure in Appendix A, which is available in the ComputerSociety Digital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.88.

Example 1. Two features ◦ and � (e.g., animal species)with a true spatial association so that ◦s are likely seenclose to �s. There are only a few instances of ◦, but �is abundant, so that all ◦s are found close to some �sand form co-locations. Many �s, however, will not have◦s in their neighborhood. If most of the �s are with-out ◦s, the participation index of {◦,�} will be ratherlow (e.g., PI=0.2, see Appendix A, available online), anda co-location algorithm using typical thresholds will notreport the pattern although ◦ and � are spatially related.

Example 2. Two features ◦ and � that are independent ofeach other, but abundant in the study area and randomlydistributed. We might see a high enough PI-value for{◦,�} (e.g., PI=0.42, see Appendix A, available online),so that {◦,�} would be reported as a prevalent co-location pattern, using a typical PI threshold used inpractice.

Example 3. Two auto-correlated features ◦ and � (i.e.,instances of a feature have a tendency to form clusters)that are independent of each other. If a cluster of ◦ and acluster of � happen to overlap by chance, a good num-ber of instances of {◦,�} will be generated, resulting ina high PI-value (e.g., PI=0.43, see Appendix A, availableonline). Using a typical PI threshold, {◦,�} would bereported, even though there is no real spatial associationbetween them.

Example 4. Two features ◦ and �, where ◦ is spatially auto-correlated and � is randomly distributed. If most of theclusters of ◦ have at least one instance of � nearby,and if many �s are found close to some clusters of ◦,the participation index of {◦,�} could be higher than agiven threshold (e.g., PI=0.5, see Appendix A, available

online). Again, {◦,�} would be reported, even thoughthere is no spatial association.

Example 5. An inhibition process, where feature ◦ and fea-ture � exhibit a spatial inhibition at a distance Rd, whichmeans that in most cases, � will not appear within anRd-neighborhood of ◦ and vice versa. Even so, the PI-value of {◦,�} can be quite high. Appendix A, availableonline, shows a scenario where the PI-value is equalto 0.55, and {◦,�} would be wrongly reported as co-location pattern when in fact it is a segregation pattern(the expected PI-value in the given scenario, assumingthere no interaction between ◦ and � is 0.71).

Example 6. Two features ◦ and � that are distributed inde-pendently of each other, but both of which have a lownumber of instance. In such a case, the PI-value can bevery low (e.g., PI=0.2, see Appendix A, available online),which, however, would be expected and {◦,�} shouldnot be reported even as a segregation pattern.

Example 7 (A real data set). The spatial dependencybetween the positions of two types (“on” and “off”)of retinal neurons, known as the cholinergic amacrinecells have been found to be spatially independent [9].Yet when analysed naïvely, the pattern {on, off} wouldbe wrongly reported as prevalent co-location, since itsPI-values for different distances are very high (0.5 upto 1.0). See Appendix A, available online, for a moredetailed description.

To overcome the limitations of the existing approaches,when using global prevalence thresholds, we propose todefine the notion of a spatial interaction (co-location orsegregation) based on a statistical test, develop appropri-ate null models for such tests, and propose computationalmethods to find statistically significant co-location andsegregation patterns.

The contributions of the paper are the following:

1) We propose a mining algorithm that reports groupsof features as co-location or as segregation pat-terns if the participating features have a positiveor negative interaction, respectively, among them-selves. To determine the type of interaction, we testthe statistical significance of the PI-value of a pat-tern instead of comparing its frequency against asimple threshold. To this end, we develop appropri-ate Null models that also take the possible spatialauto-correlation of individual features into account.The estimation of the null distribution is obtainedthrough randomization test, which is common inspatial statistics, since no closed form expressionsthat model the joint distribution of more than twofeatures exist in the literature1.

2) We introduce two strategies to improve the runtimeof our proposed method. Due to the large numberof simulations conducted in randomization tests,the statistical significance tests can become compu-tationally expensive. We improve the runtime byintroducing a pruning strategy to identify candidate

1. Analytical models exist only for pairs of features, and we willuse them in some cases to validate our approach on pairs of features.

BARUA AND SANDER: MINING STATISTICALLY SIGNIFICANT CO-LOCATION AND SEGREGATION PATTERNS 1187

patterns for which the prevalence measure compu-tation is unnecessary. Taking spatial auto-correlationof features into account, we also show that in a sim-ulation, we do not need to generate all instancesof an auto-correlated feature and can reduce theruntime of the data generation phase.

3) To improve runtime further, we propose a differenttest statistics, which can be viewed as an approx-imation of the PI-value, using only a subset ofthe total instances of a pattern. We propose a gridbased sampling approach to identify these patterninstances efficiently and compare the effectiveness,and trade-off between runtime and accuracy.

4) We demonstrate the effectiveness of our approachto finding both co-location patterns as well as segre-gation patterns using a variety of real and syntheticdata sets.

The rest of the paper is organized as follows. In Section 2,we define the notion of a statistically significant interactionpattern and propose our method SSCSP for mining suchpatterns. Section 3 shows how to improve the runtime ofthe baseline algorithm. We conduct experiments to validateour model in Section 4. Section 5 describes related work.Section 6 concludes the paper.

2 STATISTICALLY SIGNIFICANT PATTERNS

The main idea of our approach is to test the null hypoth-esis H0 of no spatial dependency against an alternativehypothesis H1 of spatial dependency (positive or nega-tive) between spatial features. More precisely, we estimatethe probability (p-value) of obtaining a PI-value at least asextreme as the observed PI-value if the features were spa-tially independent of each other. If the p-value is low, theobserved PI-value is rare under the null model, indicatinga co-location or segregation among the features.

We model the distribution of each individual feature andestimate an empirical distribution of the PI-value underthe null hypothesis, using randomization tests. In our nullmodel, we also take spatial auto-correlation into account,which has not been considered by the current literature onco-location mining.

Note that such an approach works with any type ofprevalence measure that captures spatial interaction.

2.1 Basic ConceptsHere we briefly discuss some basic concepts. A spatialpoint process is a “stochastic mechanism which generates acountable set of events xi in the plane" [10]. A point patternis a realization of such a process that consists of a collec-tion of events or objects occurring in a study region. Anevent set is made up of locations defined by some set ofcoordinates and “marks” that are attached to the events.Marks are often categorical such as class, sex, species, dis-ease but can also be continuous, for instance, temporalinformation. The simplest way of treating a spatial pointpattern is assuming that the pattern is random and is arealization of a complete spatial randomness (CSR) whichis a homogenous Poisson process. CSR has two properties:

1) Equal probability: Any event has equal probabilityof being in any position of the study area.

2) Independence: The position of an event is indepen-dent of the position of any other event.

The second property tells us that no interaction existsbetween events in the given point pattern [10]. In manyreal scenarios, the spatial data sets do not follow CSR dueto interactions between events. This can be due to someevents causing the occurrence of other events at nearbylocations. The departure from CSR results in either (i) clus-tering (aggregation) or (ii) regularity (segregation) for anevent set [10].

Estimation of the theoretical distribution of a spatialstochastic point process is difficult. In most cases a MonteCarlo simulation is the only way to estimate the mean andthe distribution of a test statistic. Selection of the test statis-tic depends on the type of mark. Classical techniques todetermine properties of the distribution of single featuresand pairs of features (e.g., clustering tendency), use dis-tance based methods such as pair-wise distance, nearestneighbor distance, and empty space distance to measureinter-point dependency [11]. Second order analysis suchas Ripley’s K-function [12] or the Pair Correlation Function(PCF) [11] are other alternatives and also popular to detectclustering of events. The K-function is defined by Ripley forstationary point processes. For event density λ (number ofevents per unit area), λK(d) is the expected number of otherevents within a distance d of a randomly chosen event of theprocess [12]. Formally, K(d) = 1

λE[n(X ∩ b(u, d)\{u})|u ∈ X],

where u is a point of X and b(u, d) is a disc of radius d cen-tered on u. In a homogenous Poisson process, the expectednumber of points falling in b(u, d) is λπd2, thus Kpois(d) isequal to πd2. A K-function value greater than πd2 for apoint pattern suggests clustering, while a value less thanπd2 suggests regularity. PCF is another way of interpretingthe K-function, formally defined as g(d) = K′(d)

2πd [11] whereK′(d) is the derivative of K. These measures are designedfor at most two types of events, i.e., bivariate point pro-cesses. In our work, we propose a method that can findco-location as well as segregation patterns for more thantwo types of events.

2.2 Null Model DesignOur null hypothesis must model the assumption that dif-ferent features are distributed in the space independentlyof each other. A spatial feature could be either spatiallyauto-correlated or not spatially auto-correlated. A featurewhich is spatially auto-correlated in the given data is mod-eled as a cluster process [11]. To determine if a feature isspatially auto-correlated or not, we compute the PCF value(g(d)). Values of g(d) > 1 suggest clustering or attraction atdistance d. A feature has a regular distribution (inhibition)if g(d) < 1, and a feature shows CSR if g(d) = 1. Hencefor g(d) ≤ 1, the feature is considered to be not spatiallyauto-correlated.

To model an aggregated point pattern, Neyman andScott [13] introduce the Poisson cluster process using thefollowing three postulates:

1) First, parent events are generated from a Poissonprocess. The intensity of the Poisson process couldbe either a constant (homogenous Poisson process)or a function (inhomogenous Poisson process).


2) Each parent gives rise to a finite set of offspringevents according to some probability distribution.

3) The offspring events are independently and iden-tically distributed in a predefined neighborhood oftheir parent event.

The offspring sets represent the final cluster process. Insuch a model, auto-correlation can be measured in termsof intensity of the parent process and the intensity of theoffspring process. In a Matérn’s cluster process [11], anothermodel, cluster centers are also generated from a Poissonprocess with intensity κ . Then each cluster center c is replacedby a random number of offspring points, where the numberof points is generated from a Poisson process with intensityμ; the point themselves are uniformly and independentlydistributed inside a disc of radius r centered at c.

A spatial distribution of a feature can be described interms of summary statistics, i.e. a set of parameters. Ifa feature is detected to be spatially auto-correlated, theparameters can be estimated using a model fitting tech-nique. The method of Minimum Contrast [14] fits a pointprocess model to a given point data set. This technique firstcomputes a summary statistics from the point data. A theo-retically expected value of the model to fit is either derivedor estimated from simulation. Then the model is fitted tothe given data set by finding optimal parameter values ofthe model to give the closest match between the theoreti-cal curve and the empirical curve. For the Matérn Clusterprocess [11], the summary statistics are κ , μ, and r.

The data sets we will simulate according to our nullhypothesis maintain the following properties: (1) samenumber of instances for each feature as in the observeddata, and (2) similar spatial distribution of each featureby maintaining same distributional properties (summarystatistics) estimated from the observed data. For instance, ifa feature is spatially auto-correlated in the given data set,the feature will also be clustered in the same degree and theclusters will be randomly distributed over the study areain the generated data sets under the null hypothesis. Foran auto-correlated feature, we estimate the parameters of aMatérn Cluster process. If a feature is randomly distributed,we estimate its Poisson intensity by fitting a Poisson pointprocess to the given data.

2.3 Definition of Co-location and SegregationWe state two definitions from the literature [1] and [3] sincewe use the PI as a spatial interaction measure:

Definition 1. An interaction pattern is a subset of k differentfeatures f1, f2, . . . , fk having a spatial interaction within agiven distance Rd. Rd is called as the interaction distance.A group of features are said to have a spatial interaction iffeatures of each possible pairs are neighbors of each other. Twofeature instances are neighbors of each other if their Euclidiandistance is not more than the interaction distance Rd. LetC = {f1, f2, . . . , fk} be an interaction pattern. In an instanceof C, one instance from each of the k features will be presentand all these feature instances are neighbors of each other.

Definition 2. The Participation Ratio of feature fi in patternC, pr(C, fi), is the fraction of instances of fi participating in

any instance of C. Formally, pr(C, fi) = |(πfi(all instances of C))||instances of fi| .

Here π is the relational projection with duplicationelimination.

For instance, let an interaction pattern C = {P, Q, R} and P,Q, and R have nP, nQ, and nR instances respectively. If nC

P ,nC

Q, and nCR distinct instances of P, Q, and R, respectively,

participate in pattern C, the participation ratio of P, Q, R

are nCP

nP,

nCQ

nQ, nC

RnR

respectively.

Definition 3. The Participation Index (PI) of an interactionpattern C is defined as PI(C) = mink{pr(C, fk)}.

For example, let an interaction pattern C = {P, Q, R} wherethe participation ratios of P, Q, and R are 2

4 , 27 , and 1

8respectively. The PI-value of C is 1

8 .

Lemma 1. The participation ratio and the participation indexare monotonically non-increasing with the increase of patternsize, that is if C′ ⊂ C and f ∈ C′ then pr(C′, f ) ≥ pr(C, f )and PI(C′) ≥ PI(C) [3].

Using PI as a measure of spatial interaction, we can definea statistically significant interaction pattern C as:

Definition 4. An interaction pattern C = {f1, f2, . . . , fk} is sta-tistically significant co-location pattern at level α, if theprobability (p-value) of seeing, in a data set conforming to ournull hypothesis, a PI-value of C larger than or equal to theobserved PI-value is not greater than α.

Definition 5. An interaction pattern C = {f1, f2, . . . , fk} is sta-tistically significant segregation pattern at level α, if theprobability (p-value) of seeing, in a data set conforming to ournull hypothesis, a PI-value of C smaller than or equal to theobserved PI-value is not greater than α.

2.3.1 Statistical Significance TestLet PIobs(C) denote the participation index of C in theobserved data, and let PI0(C) denote the participation indexof C in a data set generated under our null hypothesis. Thenwe estimate, using the distribution of PI-values under thenull model, two probabilities: ppos = Pr(PI0(C) ≥ PIobs(C)),the probability that PI0(C) is at least PIobs(C), and pneg =Pr(PI0(C) ≤ PIobs(C)), the probability that PI0(C) is at mostPIobs(C). If ppos ≤ α, or pneg ≤ α, the null hypothesis isrejected and the PIobs(C)-value is significant at level α.

α is the probability of committing a type I error, thatis rejecting a null hypothesis when the null hypothesis istrue, i.e. the probability of accepting a spurious co-locationor a segregation pattern. If a typical value of α = 0.05 isused, there is 5% chance that a spurious co-location or asegregation is reported.

To compute ppos and pneg, we do randomization tests,generating a large number of simulated data sets that con-form to the null hypothesis. Then we compute the PI-valueof a pattern C, PI0(C), in each simulation run and computeppos and pneg respectively as:

ppos = R≥PIobs + 1R + 1

, (1)

pneg = R≤PIobs + 1R + 1

. (2)

Here R≥PIobs of eq. 1 represents the number of simulationswhere the computed PI0(C) is not smaller than the PIobs-value. R≤PIobs of eq. 2 is the number of simulations where


Fig. 1. (a) Instances of all clusters. (b) Generated instances.

the computed PI0(C) is not greater than the PIobs-value.R represents the total number of simulations. In both thenumerator and the denominator one is added to accountfor the observed data.

To get a good critical region for the test statistic, Besagand Diggle in [15] suggest the number of simulations tobe computed as α(R + 1) = 5. Accordingly, 499 simulationsare required for α = 0.01, 99 simulations are required forα = 0.05.

3 ALGORITHM

3.1 SSCSP: A Mining Algorithm Based on AllInstances of an Interaction Pattern

For our approach, we have to determine the PI-value ofeach interaction pattern C, in each simulation run of therandomization tests. This requires identifying all instancesof C, which naively amounts to checking the neighbor-hoods of each participating feature in C. In the following,we describe both the data generation step and the p-valuecomputation, including strategies for reducing the overallcomputational cost of this approach.

Data generation for the simulation runs: In a sim-ulation, we generate instances of each feature. For anauto-correlated feature, we only generate instances of thoseclusters which are close enough to different features (auto-correlated or not) to be potentially involved in interactions.We avoid generating instances in a cluster cf ∗ (radius Rf ∗ )of a feature f ∗ if the center of cf ∗ is farther away thanRf ∗ + Rd from every instance of different features. Forauto-correlated features fi (fi �= f ∗), we can determine thiswithout looking at the instances of fi by checking only thatthe center of cf ∗ is farther away than Rf ∗ +Rfi +Rd from thecenters of all clusters of fi.

Fig. 1 shows an example with two autocorrelated fea-tures. Fig. 1(b) shows the partial amount of instancesgenerated to compute the same PI-value that would becomputed from all instances as in Fig. 1(a).

p-value computation: First, we compute the PI-value,PIobs(C), of each possible interaction pattern C in theobserved data. To calculate the p-values ppos and pneg, wemaintain two counters for the PI-value of C: R≥PIobs andR≤PIobs . In a single simulation run Ri, R≥PIobs is incre-mented by one if PIRi

0 (C) ≥ PIobs(C), where PIRi0 (C) is the

PI-value of C in Ri. The other counter, R≤PIobs , is incre-mented if PIRi

0 (C) ≤ PIobs(C). An interaction pattern Cwill be reported as a statistically significant (at level α)

co-location or segregation pattern, if ppos ≤ α or pneg ≤ α,respectively, after all simulations.

We can reduce the total number of PI-value computa-tions in a simulation run using the following property. ThePI-value of a pattern C in a simulation run Ri, PIRi

0 (C), mustbe smaller than PIobs(C) (PIRi

0 (C) < PIobs(C)), if a subsetC′ � C exists for which PIRi

0 (C′) < PIobs(C).

Lemma 2. For an interaction pattern C and a simulation Ri, ifthere is a subset C′ � C such that PIRi

0 (C′) < PIobs(C), thenPIRi

0 (C) < PIobs(C).

Proof. Assume PIRi0 (C′) < PIobs(C). According to lemma

1, PIRi0 (C′) ≥ PIRi

0 (C), thus it follows that PIRi0 (C) <

PIobs(C).We can apply lemma 2 to prune the actual computation

of PIRi0 (C) and just increment the counter R≤PIobs whenever

we know of a subset C′ � C for which PIRi0 (C′) < PIobs(C).

To do this efficiently, the PI-values of C’s subsets that wewant to check have to be readily available. If we check thePI-values of patterns in order of increasing pattern size,we could, in principle, store the PI-values of shorter pat-terns so that they are available when checking patterns oflarger sizes. However, this approach could require a largespace overhead, and actually checking too many subsetsmay overall not reduce the computational cost. Anotherissue with such an approach would be that the PI-valuesof some subsets would not computed because of the samepruning strategy. Therefore, we propose to use only thesubsets of size 2 for checking, since their PI-values are allcomputed initially. Although storing

(n2

)PI-values requires

some space, we can reuse the same space for each simu-lation run during the randomization tests. While checkingthe subsets of size 2, if one is found for which the lemmaapplies, we will stop checking the remaining subsets of size2. If after checking all subsets of size 2, the lemma couldnot be applied, we compute PIRi

0 (C). Algorithm 1 and 2 inAppendix B, available online, shows the pseudo-code forthe complete procedure.

3.2 A Sampling Based ApproachIn SSCSP the PI-value of a pattern C is computed, as a teststatistics, using all the instances of C. Here we propose aprevalence measure PI∗ as a test statistics that is computedefficiently from a subset of the instances of C. The newprevalence measure PI∗ can be seen as an approximationof the original PI-value, which leads, in most cases, to thesame statistical inferences.

If a true co-location or segregation relationship existsamong a group of features C, this should be reflected evenin a subset of the total instances of C, and a statistical testshould be able to capture this dependency from such asubset. Instead of looking at the full neighborhood So ofa feature instance I, we consider only a sub-region S of So.By considering a larger sub-region which covers more areaof So, the computed PI∗-value will be more similar to theoriginal PI-value.

In our randomization tests, we obtain the distributionof the PI∗-values under the null model. We argue that iftwo features A and B are truly associated or segregated,


Fig. 2. Dashed bordered region is a sampled neighborhood for a featureinstance present anywhere of cell X . (a) l = Rd . (b) l = Rd

2 . (c) l = Rd3 .

the observed PI∗-value in a given data set should also bestatistically significant when compared to the distributionof the PI∗-values under the null model. If A and B are inde-pendent of each other, this should also be reflected by theobserved PI∗-value being closer to the expected PI∗-valueunder the null model.

Our experimental results show that this approach worksextremely well, in general.

A neighborhood sampling approach using a grid basedspace partitioning: To select sub-regions of actual neighbor-hoods, a grid is placed over the whole study area. Each gridcell is a square with a diagonal length l being equal to Rd

w ,where Rd is the interaction neighborhood radius and w ≥ 1is an integer.

If l = Rd, the selected sub-region represents a sam-pled neighborhood for a feature instance I that consistsof a single cell X that contains I. If l = Rd

2 , the sampledneighborhood consists of the cell X that contains I, plusthe 8 cells surrounding X. In general, if l = Rd

w , the sam-pled neighborhood of I consists of (2w − 1)2 cells includingX. We denote the corresponding neighborhood by S(2w−1)2 .Fig. 2 illustrates the sampled neighborhoods for w equalto 1, 2, and 3, i.e., S1, S9, and S25. Note that any otherfeature instance located in a sampled neighborhood of I isnecessarily co-located with I by construction.

The actual neighborhood of a feature instance fi is a cir-cular region So centered at fi and the area is πR2

d. The area

of a sampled neighborhood S1 (Fig. 2(a)) isR2

d2 . Hence S1

covers 12π

of So for any feature instance I appearing in S1.When w = 2, the area of the sampled neighborhood S9

(Fig. 2(b)) is9R2

d8 , and it covers 9

8πof So for any I in X.

The sampled neighborhood S9 is 2.25 times larger than S1.When w = 3, the area of S25 (Fig. 2(c)) becomes 2.78 times

larger than S1. In general the area of S(2w−1)2 isR2

d(2− 1w )2

2

and it covers (2− 1w )2

2πof the actual neighborhood So.

The area of S(2w−1)2 slowly increases with increasingw, but is limited while the number of cells that haveto be checked increases fast with increasing w. Whenw → ∞, a sampled neighborhood will cover 0.64 of So

(limw→∞(2− 1

w )2

2π∗ So = 0.64 ∗ So), while the sampled neigh-

borhood with increasing w is comprised of (2w − 1)2 → ∞number of cells, all of which have to be checked for patternsof size 2. However the number of cell lookups in a sam-pled neighborhood decreases with increasing pattern size.To find a k-size interaction pattern instance, we check theoverlapping region of the neighborhoods of the k − 1 par-ticipating feature instances for the presence of an instance

Fig. 3. Finding an interaction pattern instance (a) of size 3. (b) of size 4.

of the k-th feature. For instance, to find {A1, B2, C2} in agrid where l = Rd

2 (Fig. 3(a)), we check the overlappingregion of the sampled neighborhoods of A1 and B2. InFig. 3(a), the actual neighborhood of a feature instance isshown by a circle; whereas the sampled neighborhoods forA1 and B2 are shown by dashed and dotted squares, respec-tively. Here the overlapping region of the two sub-regionsincludes 6 cells (2, 3, 6, 7, 10, and 11) and we can restrictthe search for an instance of C to these 6 cells. Similarly, tofind {A1, B2, C2, D1} the overlapping region of the sampledneighborhoods of A1, B2, and C2 must be checked whenlooking for an instance of D. In the example shown inFig. 3(b), the overlapping region includes only 4 cells (2, 3,6, and 7) indicated by a dotted line.

Clearly, there is a trade-off between the quality of thePI∗-value as a test statistic to determine spatial associationsand the resolution of the grid that induces the sampledneighborhoods based on which the PI∗-values are deter-mined. Note, however, that achieving the best accuracyof the neighborhood approximation is not necessary. Sincewe are making the same kind of error when computingPI∗-values both for the observed data set, as well as inall the simulations, what matters is whether the distri-bution of the PI∗-values will lead to the same statisticalinference about which patterns are statistically significant.Fig. 4 shows 3 empirical distributions of approximate (forw = 3 and w = 4) and actual PI-values (computed fromall the instances) of a pattern {A, B} estimated under thenull model (i.e., for two independent features A and B). InFig. 23 of Appendix B, available online, we show two addi-tional empirical distribution curves of PI∗-values estimated

Fig. 4. Distribution of PI∗-values and PI-values computed under the nullmodel. (a) l = Rd

3 . (b) l = Rd4 . (c) Actual PI-values.


using grid with coarser cell resolution. We can see thatusing a finer cell resolution for the grid, the distributionof PI∗-values of {A, B} becomes more similar to the thedistribution of actual PI-values. In our experiments, wewill demonstrate that values for w equal to 4, 3, or evenas low as 2 work well in all cases, except for patternsthat involve features with an extremely low number ofinstances.

The accuracy of our sampling approach in making a sta-tistical inference on a pattern’s significance depends on thearea of the sampling sub-region, and the number of inter-acting instances of the participating features of a pattern.For a participating feature fi of a pattern C, let IC

fibe an

instance which is found interacting with instances of otherparticipating features of C in the non-approximated inter-action neighborhood So. The probability PS

fiof identifying

ISfi

as interacting with other participating features of C alsoin the sampled sub-region S increases with the increase ofthe area of S and the individual intensity of other partici-pating feature instances in So. The smaller the value of PS

fiis, the more instances of fi interacting with other featuresof C will be required for a good estimation of PS

fifrom the

data. When w equals 1, 2, 3, or 4, a sampled sub-regioncovers 0.15, 0.36, 0.44, and 0.49, respectively, of the non-approximated neighborhood So. With w = 1, we have a verycoarse approximation of the actual neighborhood, whichmay only work well enough if the number of interactinginstances of fi of C is sufficiently large. A more detailedreasoning is given in Appendix B, available online.

Algorithm 3 of Appendix B, available online, is thepseudo-code of our sampling method.

3.3 ComplexityIn the worst case of the all-instance-based algorithm, thereis no pruning in each simulation and we compute thePI-value of each possible interaction pattern C. Before com-puting the PI-value of C, we lookup the stored PI-valuesof its subsets of size 2. Hence the cost for C is the sumof the lookup cost and the cost for computing its PI-value.Computing a PI-value requires the inter-distance computa-tion among features and comparing the computed distancewith Rd. In a grid based sampling approach, the lookupcost of finding interacting feature instances of fi is smallerthan the cost in all-instance-based approach, and we do notcompute distances, since it is guaranteed that all featureinstances fj in a sampled neighborhood are co-located. Thecomplexity of our algorithm, however, is O(2n) in the worstcase. A detailed complexity analysis is given in AppendixB, available online.

While the worst case is expensive, in many important,real applications (e.g. in ecology), the largest pattern sizethat typically exists in the data is much smaller than totalnumber of features n, since a finite interaction neighbor-hood can typically not accommodate instances of n differentfeatures when n is large. In such applications, the actual costin practice is much lower than the worst case. While check-ing neighborhoods of feature instances, we can determinethe size of the largest interaction pattern, and then restrictour search to only patterns up to this size (instead of allsizes).

Fig. 5. (a) Data set where ◦ and � are negatively associated.(b) Estimation of Ripley’s K - function K◦,�(r ).

4 EXPERIMENTAL EVALUATION

For the experiments, we compare the all-instance-basedapproach with our sampling approach for four grid cell res-olutions, given by w = 1, 2, 3, 4, as well as with a standardco-location mining approach.

4.1 Synthetic Data Sets4.1.1 InhibitionHere we show that a set of negatively associated featurescan be wrongly reported as a prevalent co-location patternby the existing co-location mining algorithms, using typicalthreshold values. We also show that our algorithm does notreport such a pattern as a co-location pattern, but ratherreports it as a segregation pattern.

Experimental setup 1: We generate a data set with 40instances of each of two features ◦, and � that inhibit eachother. The pair-wise inhibition of ◦, and � is modeled usinga multi-type Strauss process [11]. A description of this pro-cess and details on our data generation setup is providedin Appendix C, available online. The study area is a unitsquare and the interaction distance (Rd) is set to 0.1. Fig. 5(a)shows the data set. Note that, even when imposing an inhi-bition between ◦ and �, there are still some instances ofpattern {◦,�}.

Result: The actual PIobs({◦,�})-value is 0.55. The ppos-value of PIobs = 0.55 according to equation 1 is 99+1

99+1 = 1,which means that seeing a PI-value of at least 0.55 underthe null model is quite certain. Hence our method will notreport {◦,�} as a significant co-location pattern. Our gridbased sampling approach also does not report {◦,�} as aco-location pattern as the ppos-value is always greater thanα = 0.05. The pneg-value of PIobs = 0.55 according to equa-tion 2 is 1

99+1 = 0.01 < α which means that under the nullmodel the probability of seeing a PI-value of 0.55 or lessis quite unlikely. In our sampling approach, we find thatthe pneg-value is always less than α = 0.05. Hence {◦,�}is reported as a segregation pattern. The complete resultsare shown in Table 8 of Appendix C, available online. Thereported seggregation relationship can be validated by theestimation of Ripley’s cross-K function. In Fig. 5(b), we seethat the estimation of K◦,�(r) using Ripley’s isotropic edgecorrection (solid line) is always below the theoretical curve(dashed line), which means that the average number of �found in a neighborhood of radius r of a ◦ is always lessthan the expected value (πr2) indicating a negative associa-tion. The precision and recall our methods are both equal to1, while the standard method should not report segregation


Fig. 6. (a) Data set with inhibition among ◦, �, and +. (b) Estimation ofthe 3rd order summary statistics T (r ).

patterns. However, it reports {◦,�} as a prevalent co-locationif a rather typical value of 0.55 or less is set as the PIthreshold, which is highly misleading.

Experimental setup 2: Using Geyers triplet process [16], anextension of the Strauss process, we generate an inhibitiondata set (Fig. in 6(a)) of 3 features ◦, �, and +. Each featurehas 50 instances and an inhibition relationship is imposedamong all 3 features. A detail description of this experi-mental setup is provided in Appendix C, available online.The study area is a unit square and the interaction distance(Rd) is set to 0.1.

Result: The PIobs({◦,�,+})-value is 0.42. The ppos-valueis 0.99, which is greater than α = 0.05, and hence ourmethod does not report {◦,�,+} as a significant co-locationpattern. Our grid based sampling approach also does notreport {◦,�,+} as a co-location pattern as the ppos-value isalways greater than α = 0.05. The pneg-value is 0.03. Thepneg-values using our sampling approach are also smallerthan α = 0.05. Hence in all our approaches, {◦,�,+} isreported as a segregation pattern. The complete results areshown in Table 7 of Appendix C, available online. Thereported seggregation relationship can also be validated byestimating the third order summary statistics T(r) [17]. InFig. 6(b), we see that the estimation of T◦,�,+(r) with bor-der correction (solid line) is always below the theoreticalcurve (dashed line), which means that in an r-neighborhoodof a typical point o, the average number of r-close triplesincluding o is always smaller than the expected value, indi-cating a segregation among features ◦, �, and +. Again,the precision and recall of all our methods are 1, while thestandard method should not report the segregation pattern.However, they will wrongly report the pattern {◦,�,+} asa prevalent co-location if a value of 0.42 or less as the PIthreshold is used, which again, is not uncommon.

4.1.2 Auto-correlationIn this experiment, we show that even though participat-ing features of a pattern are independent of each other,their spatial auto-correlation properties can generate a PI-value higher than a typical threshold. Our algorithms donot report such patterns as a true co-locations.Experimental setup: We generate a synthetic data set(shown in Fig. 7(a)) with 2 different features ◦, and �.Feature � has 120 instances which are independently anduniformly distributed. Feature ◦ has 100 instances which arespatially auto-correlated. The spatial distribution of ◦ fol-lows the model of Matérn’s cluster process [11]. The study

TABLE 1Relationships Implanted in the Synthetic Data

area is a unit square and the spatial interaction neighbor-hood radius (Rd) is 0.1. The summary statistics of ◦ areκ = 40, μ = 5, and r = 0.05.

Result: The PIobs({◦,�})-value is 0.46. The ppos-value is0.75 and the pneg-value is 0.31 which are greater than α, andhence pattern {◦,�} is not reported as a co-location or segre-gation pattern. Our grid based sampling approach also doesnot report {◦,�} as a co-location or a segregation pattern.Table 8 in Appendix C, available online, shows the com-plete results from our different approaches. The standardco-location approaches, on the other hand, will mistakenlyreport the pattern {◦,�} as prevalent since its PI-value of0.46 is higher than typical thresholds.

4.1.3 Mixed Spatial InteractionsHere, we generated a synthetic data set with 5 differentfeature types ◦, �, +, ×, and � (Fig. 7(b)). Among thesefeatures, we impose different spatial relationships such aspositive association, auto-correlation, inhibition, and ran-domness. We show that our algorithms detect co-locationand segregation patterns occurring due to positive andnegative associations, and that we do not report “false” pat-terns even if they have high PI-values.Experimental setup: Features ◦, � and × have 40 instanceseach. Feature + has 118 instances, and feature � has 30instances. Our study area is a unit square and the inter-action neighborhood radius (Rd) is set to 0.1. A detaileddescription of the data generation models is provided inAppendix C, available online. Table 1 shows the spatialrelationship that are implanted in the synthetic data.

Result: Our algorithms detects patterns generated duethe spatial associations that are implanted in the syn-thetic data. In the Appendix C, available online, Tables 9,10, and 11 show the complete results for the computedPI-values, the PI∗-values for different grid sizes, the ppos-values, and the pneg-values of all possible subsets. Theresults for each possible subset of features are discussedin detail, by increasing pattern size, in Appendix C, avail-able online. The SSCSP algorithm reports all 4 implanted

Fig. 7. (a) Data set with 2 features. (b) Data set with 5 features.


TABLE 2Existing Methods vs. Our Method

co-location and all 3 segregation patterns. It also reportsthe 4 additional co-location patterns {◦,+, �}, {◦,×, �},{+,×, �}, and {◦,+,×, �}, which correspond to the 4implanted patterns but include the additional, indepen-dently distributed feature �. Due to the positive associationamong ◦, +, and ×, the amount of pattern instances foundin the observation is significantly higher than the amountfound under the independence assumption. These addi-tional four patterns are in a sense redundant patterns sincethey reflect the implanted, strong association between threeof the involved features. Such patterns can, in principle, bepruned by using an independence assumption, conditionedon already found sub-patterns. However this is beyondthe current scope of this paper. As can be seen in Tables9, 10, and 11: the sampling based approach find exactlythe same patterns as the all-instances-based approach. Theresult of a standard co-location algorithm depends onthe chosen threshold. For this experiment we report theperformance for three different thresholds: 0.2, 0.4, and0.5. Table 2 shows precsion, recall, and F-measure (har-monic mean of precision and recall), for the standard co-location algorithm using these thresholds, as well as for ourmethod.

For instance, if the PI threshold for the standard co-location mining algorithm is set to 0.4, 19 patterns will bereported as prevalent; among those, 4 patterns are true co-locations, 4 patterns are the same redundant patterns thatour algorithm reports as well, 3 patterns are the wronglyreported segregation patterns and the rest are meaning-less patterns. In this case, precision, recall, and F-measure,not counting the segregation patterns (since the co-locationmining algorithm should, according to its semantics, notfind those) are 4

19 = 0.21, 44 = 1, and 0.35, respectively.2

4.1.4 Runtime ComparisonFor an auto-correlated feature, we do not generate all of itsinstances and we can also prune candidate patterns whichcan not contribute to the p-value computation under certaincircumstances (see Sect. 3). In a naïve approach, we do notapply any of these techniques.

All experiments are conducted on an Intel Core i3 pro-cessor machine with a cpu speed of 2.10 Ghz. The mainmemory size is 2 GB and the OS is Windows 7. For runtimecomparison, we generate a data set with 4 different features◦, �, +, and ×. Features ◦,�, and + are auto-correlatedfeatures. They also show an inhibition relationship with fea-ture ×. The study area is a square with an area of 100 sq.

2. If we are “generous" and count the wrongly reported segregationpatterns as correct, the precision, recall, and F-measure would be 7

19 =0.37, 7

7 = 1, and 0.54, respectively, which is still substantially worsethan our method.

Fig. 8. Runtime comparison.

units and the interaction distance Rd is set to 0.1. We imposea positive association among ◦, �, and + where eachinstance of a feature is in co-location with an instance of theother two types of features. Feature ◦, feature �, and feature+ have 400 instances each and feature × has 20 instances.The naïve approach, the All-instances-based SSCSP, as wellas all the sampling approaches find the same significant co-location patterns ({◦,�}, {◦,+}, {�,+}, and {◦,�,+}) andthe same significant segregation patterns ({◦,×}, {�,×},and {+,×}). We conduct four more, similar experiments,and in each experiment we keep the total cluster num-ber of each auto-correlated feature the same but increasedthe total number of instances per cluster by a factor k forall clusters. For all these experiments the same co-locationand segregation patterns are reported as significant by alldifferent approaches. Fig. 8 shows the runtime of a naïveapproach, the all-instances-based SSCSP algorithm, and thegrid based sampling approach with 4 different cell resolu-tions. Fig. 9 shows that with the increase of the number ofinstances, we obtain an increasing speedup growing from1.9 to 5.31 for the All-instances-based SSCSP algorithm. Weobtain further increasing speedup using grid based sam-pling. With increasing number of feature instances, thespeedup increases from 4.7 to 12.8 when l = Rd, from 4to 11.9 when l = Rd

2 , from 3.12 to 10.9 when l = Rd3 , and

from 2.3 to 9.3 when l = Rd4 . A cell resolution of l = Rd

gives the best speedup but may not be a safe choice formining a true co-location or segregation which has veryfew instances. Our experiments overall suggest that a cellresolution of either l = Rd

2 or l = Rd3 is a good choice since in

all conducted experiments with synthetic and real data sets(to be discussed next), it only missed one true co-locationpattern in a case very one of the involved features has avery low number of instances.

For an auto-correlated feature, if the number of clustersincreases, the chance of a cluster being close to other fea-tures will be higher. Hence the data generation step mighthave to generate more instances of each auto-correlated fea-ture in such cases. In another 5 experiments, we increasedthe number of instances of feature × and the number ofclusters of each auto-correlated feature (◦, �, and +) bythe same factor but keep the number of instances percluster the same. Fig. 10 shows the runtime and Fig. 11shows the speedup obtained by the different approachesin the 5 experiments. We see that with increasing num-ber of clusters, after increasing first, the speedup eventuallygoes down. This happens when more and more instances


Fig. 9. Speedup.

actually have to be generated, eventually leaving only thespeedup due to candidate pruning.

These experiments also demonstrate that the runtimes ofour approaches are acceptable for many real world appli-cation where the quality of the results matters more thanspeed.

Comparison with the existing algorithms: The com-putational time of the existing algorithms depends on theselection of the PIthre-value. A low PIthre-value is compu-tationally more expensive than a high PIthre-value. A lowPIthre-value allows fewer pruning and thus results in morecandidate patterns as being prevalent. Hence there is nofair way to compare our algorithm with the existing algo-rithms. To show a range of possible results, we use 0.2and 0.5 as PIthre-value to measure the computational timeof the algorithm in [5] and compare it with our algo-rithm (shown in Table 12 of Appendix C, available online),using the data of the experiment of Section 4.1.3. Due tothe randomization tests required for the significance tests,our algorithm is clearly slower than the existing algorithm.However with PIthre = 0.2, the join-less algorithm reportsall subsets as prevalent, which is meaningless, and whenwhen PIthre = 0.5 the algorithm reports the four true co-location patterns, but also eleven additional patterns, fiveof which are meaningless, four of which are redundant pat-terns, two of which are in fact segregation patterns and notco-location patters, which is a particularly severe mistake.

4.2 Real Data Sets4.2.1 Ants DataThe nesting behavior of two species of ants (Cataglyphisbicolor and Messor wasman) is investigated to check if they

Fig. 10. Runtime comparison.

Fig. 11. Speedup.

have any dependency on biological grounds. The Messorants live on seeds while the Cataglyphis ants collect deadinsects for foods which are for the most part dead Messorants. Zodarium frenatum, a hunting spider, kills Messor ants.The question is if there is any possible connection we candetermine between these two ant species based on theirnest locations. The full data set gives the spatial locations ofnests recorded by Professor R.D. Harkness [18]. It comprises97 nests (68 Messor and 29 Cataglyphis) inside an irregularconvex polygon (Fig. 12(a)).

We run our algorithm on the ants data and computethe PI-value based on all instances and based on the gridbased sampling approach. Each of the 24 Cataglyphis antnests is close to at least one Messor ant’s nest, not morethan 50 unit away, and the participation ratio of Cataglyphisant is 24

29 = 0.83. For Messor ants, the participation ratio is3068 = 0.44. Thus the actual PIobs-value of interaction pat-tern {Cataglyphis, Messor} is 0.44. In the randomization test,we generate 99 simulation runs and find that in 18 simu-lation runs, the PIRi

0 -value is greater than or equal to thePIobs-value. The ppos-value is equal to 18+1

99+1 = 0.19, whichis greater than 0.05 and thus not statistically significant.Hence we can not conclude that there is a positive depen-dency between these two types of ants. The pneg-valueis calculated as 81+1

99+1 = 0.82, which is greater than 0.05and thus the interaction pattern can not be a statisticallysignificant segregation pattern. Table 13 in the AppendixD, available online, shows the computed PI-values, ppos-values, and pneg-values using the grid based samplingmethod with different cell resolutions. Again, for all gridsizes, the result is the same, i.e. the spatial interaction ofCataglyphis and Messor is neither a statistically significant

Fig. 12. (a) Ants data - ◦ = Cataglyphis ants and � = Messor ants.(b) Toronto address data (4 features).


Fig. 13. Distribution of (a) newly emergent (mark 1), (b) 1 year old (mark2), and (c) 2 years old (mark 3) canes.

co-location nor a statistically significant segregation. Infact, clear evidence of a spatial association between thesetwo species is also not found in [18]. Existing co-locationmining algorithms would report {Cataglyphis, Messor} as aprevalent co-location if a value of 0.44 or less is set as PIthreshold.

4.2.2 Bramble Canes DataHutchings recorded and analyzed the cane distribution ofRubus fruticosus (blackberry). The blackberry bush is knownas Bramble. Bramble canes data (published in [10]) recordsthe locations (x, y) and ages of bramble canes in a fieldof a 9m square plot. The canes were classified accordingto age as either winter buds breaking the soil surface, un-branched and non-flowering first year stems, or branchedand flower bearing second year stems [19]. These threeclasses are encoded as marks 1, 2, and 3 respectively inthe data set. There are 359 canes with mark 1, 385 withmark 2, and 79 with mark 3. Hutchings’ investigation findsan aggregated pattern in all cohorts of canes [19]. This indi-cates the presence of auto-correlation for each mark. Digglealso analyses the bivariate pattern formed by canes withmark 1 and 2 and finds a positive dependency betweenthese two types [10] (Section 6.3.2).

For our experiments, we re-scaled the 9 m square plotarea to the unit square and set the co-location radius to 0.1.The individual distribution of each type is shown in Fig. 13.The PI-values, ppos-values, and pneg-values are shown inTable 14 of the Appendix D, available online. In the result,all possible subsets are reported as significant co-locationpatterns. This also conforms with Diggle’s investigation

Fig. 14. Ripley’s (a) K1,2, (b) K1,3, and (c) K2,3 curves.

TABLE 3Pari-Wise Spatial Association: Auto: Auto-Correlation, Ind:

Independency, +: Positive, -: Negative

where a pair-wise positive dependency among differenttypes of canes is also reported. The aggregation tendency ofthe three types of canes that is reported (as {1, 2, 3}) in ourapproach can also be predicted from the estimated Ripley’scross-K function curves (Fig. 14) of all possible pairs. Inall 3 cross-K function curves of Fig. 14, we see that theestimated K-value from the data at the co-location distance(Rd = 0.1) is always greater than the theoretical K-value(estimated from a Poisson distribution) indicating an pair-wise aggregation tendency. A positive association among allpairs and similar spatial distribution of each type of caneresults in a positive association among all three types ofcanes.

4.2.3 Lansing Woods DataD.J. Gerrard prepared this famous multi-type point dataset from a plot of 19.6 acre in Lansing Woods, ClintonCounty, Michigan, USA. This data set records the loca-tion of 2251 trees of 6 different species (135 black oaks,703 hickories, 514 maples, 105 red oaks, 346 white oaks,and 448 miscellaneous trees). For our experiments, we setthe interaction distance to 92.4 feet and re-scaled the orig-inal plot size (924 × 924 feet) to the unit square in orderto mine significant positive and negative interactions. Theindividual distribution of each tree species is shown in Fig.15. We estimate the pair-wise correlation function (g(d))value for each tree species. At distance d = 92.4 feet,we find each tree species as spatially auto-correlated asg(d) > 1.

In the Appendix D, available online, Table 15 shows thePI-values and ppos-values of the significant co-location pat-terns found by our algorithms. In the Appendix D, availableonline, Table 16 shows the PI-values and pneg-values of thesegregation patterns found by our algorithms. Diggle [10]and Perry et al. [20] analyze the Lansing Woods data tofind bivariate patterns only. Some of the 2-size patternsthat are reported in our method are also found in theirwork. In their result, hickory and maple are reported todeviate from randomness and exhibit segregation. Our find-ings can also be validated by estimating Ripley’s cross-Kfunction at the interaction distance 92.4 feet for all pairs.From the estimated Ripley’s cross-K function values, wefind the following pair-wise spatial relationships (Table 3):To report co-locations of Table 15 using the existing co-location algorithms, the PI threshold can not be greaterthan 0.702. Such a threshold would, however, also selectinteraction patterns {Black oak, Maple} and {Hickory, Maple}of Table 16 as co-location patterns, which are actuallysegregation patterns.

In the Appendix D, available online, Table 17 showssome of the interaction patterns that have high PI-values (actual and approximated) but not reported as


Fig. 15. Spatial distribution of each tree species.

significant by our method. Existing co-location min-ing algorithms will report them as prevalent co-locationpatterns as their actual PI-values are 1. Ripley’s K-function values for {Hickory, Red oak}, {Hickory, White oak},and {Red oak, White oak} at distance 92.4 feet also indicatepair-wise independence among the participating featuresinvolved in these patterns. The investigation of Diggle andPerry et al. and the result of cross-K function is used as theground truth. As these approaches analyze only bivariatespatial relationship, we compute precision, recall, and F-measure of all methods using only patterns of size 2. Table 4shows the precision, recall, and F-measure of our methodand the existing co-location algorithms for 3 different PIthreshold values, 0.2, 0.4, and 0.5, for patterns of size 2.There are 2 true co-location patterns of size 2 and 4 true seg-regation patterns of size 2. For instance, with a PI thresholdvalue of 0.4, the existing co-location algorithms report 15pairs as prevalent co-locations. All the true co-location pat-terns are included among those reported patterns. Countingsegregation patterns as mistake, the precision, recall, andF-measure are 2

15 = 0.13, 22 = 1, and 0.23, respectively.3

4.2.4 Toronto Address Repository DataThe Toronto Open Data provides a data set with over500000 addresses within the City of Toronto enclosed in apolygonal area. Each address point has a series of attributesincluding a feature class with 65 features and coordinates.After removing entries with missing data and remov-ing features with very high frequency (e.g. high densityresidential), we consider 10 features for our experiment:low density residential (66 instances), nursing home (31instances), public primary school (510 instances), separateprimary school (166 instances), college (32 instances), uni-versity (91 instances), fire station (63 instances), police sta-tion (19 instances), other emergency service (21 instances),and fire/ambulance station (16 instances). Due to spacelimitations, only some of the feature distributions areshown in Fig. 12(b). To determine if a feature shows clus-tering (spatial auto-correlation), regularity (inhibition), orrandomness (Poisson), we compute the pair correlation func-tion g(d) [11]. Police stations, fire stations, fire/ambulance

3. If we are “generous" again, and count the wrongly reportedsegregation pattern as pattern as correct, the precision, recall, and F-measure would be 6

15 = 0.4, 66 = 1, and 0.57, respectively, which is

still substantially worse than our method.

TABLE 4Existing Methods vs. Our Method

stations, and separate primary schools show regular distri-butions, since g(d) < 1 at smaller d values. The remainingfeatures are auto-correlated since their g(d) > 1 for smallervalues of d. The interaction neighborhood radius is setto 500.

In Table 5, we show statistically significant 2-size, 3-size, and 4-size co-locations and their PIobs-values com-puted by the All-instances-based SSCSP algorithm. Notethat the PIobs-values are so low that existing co-locationmining algorithms would return almost every featurecombination as a co-location if their global thresholdwould be set so that the reported statistically significantco-locations can be returned. Our grid based samplingapproach also finds all the statistically significant co-locations for all grid sizes, with the exception of co-location{Low density resid., Univ., Fire station, Police station}, which is missedwhen a grid with only l = Rd (i.e. w = 1) is used for sam-pling. In the Appendix D, available online, Table 18 showsthe actual and approximate PIobs-values and ppos-values ofall the reported co-locations.

5 RELATED WORK

In spatial statistics, the spatial co-location or segregationpattern mining problem is seen a bit different than in thedata mining community. There, co-location or segregationpattern mining is similar to the problem of finding associ-ations or interactions in multi-type spatial point processes.There are several measures used to compute spatial inter-action such as Ripley’s K-function [12], distance based mea-sures (e.g., F function, G function) [11], and co-variogramfunction [21]. With a large collection of Boolean spatial fea-tures, computation of the above measures becomes expen-sive as the number of candidate subsets increases exponen-tially in the number of different features. Mane et al. in [22]use bivariate K-function [11] as a spatial statistical measurewith a data mining tool to find the clusters of female chim-panzees’ locations and investigate the dynamics of spatialinteraction of a female chimpanzee with other male chim-panzees in the community. There, each female chimpanzeerepresents a unique mark. Two clustering methods (SPACE-1 and SPACE-2) are proposed which use Ripley’s cross-Kfunction to find clusters among different marked pointprocesses.

In the data mining community, co-location pattern min-ing approaches are mainly based on spatial relationshipsuch as “close to" proposed by Han and Koperski in [23],which presents a method to mine spatial association rulesindicating a strong relationship among a set of spatial andsome non-spatial predicates. A spatial association rule ofthe form X → Y states that in a spatial database if a setof features X is present, another set of features Y is more


TABLE 5Found Statistically Significant Co-Locations

A feature present in a co-location is shown by√

.

likely to be present. Rules are built in an apriori-like fashionand a rule is defined as strong if it has enough support andconfidence [2]. Morimoto in [4] proposes a method to findgroups of different service types originated from nearbylocations and report a group if its occurrence frequency isabove a given threshold. Shekhar et al. in [3] discuss threemodels (reference feature centric model, window centricmodel, and event centric model) that can be used to materi-alize “transactions” in a continuous spatial domain so thata frequent itemset mining approach can be used. Usingthe notion of an event centric model a mining algorithmis developed which utilizes the anti-monotonic property ofa proposed prevalence measure, called participation index(PI), to find all possible co-location patterns.

Many of the follow-up work on the co-location min-ing approach in [3] have focused on improving the run-time. As an extension of the work in [3], [1] proposesa multi-resolution pruning technique to filter false can-didates. To improve the runtime of the algorithm in [1],Yo et al. in [5] and [6] propose two instance look-upschemes where a transaction is materialized in two dif-ferent ways. In [5], a transaction is materialized from astar neighborhood and in [6], a transaction is materializedfrom a clique neighborhood. Xiao et al. in [24] improvethe runtime of frequent itemset based methods [1], [4],and [6] by starting from the most dense region of objectsand then proceeding to less dense regions. From a denseregion, the method counts the number of instances ofa feature of a candidate co-location. Assuming that allthe remaining instances are in co-locations, the methodthen estimates an upper bound of the PI-value and ifit is below the threshold the candidate co-location ispruned.

All the above mentioned co-location pattern discoverymethods use a predefined threshold to report a prevalentco-location. Therefore, if thresholds are not selected prop-erly, meaningless co-location patterns could be reported inthe presence of spatial auto-correlation and feature abun-dance, or meaningful co-location patterns could be missedwhen the threshold is too high. In [25], we introduce anew definition of co-location based on statistical signifi-cance test and propose a mining algorithm (SSCP). TheSSCP algorithm relies on randomization tests to estimatethe distribution of a test statistic under a null hypothesis. Toreduce the computational cost of the simulations conductedduring the randomization tests, SSCP algorithm adapts two

strategies – one in data generation and the other in preva-lence measure computation steps. In this article, we extendthis work and propose a sampling approach to improve theruntime of the SSCP algorithm further. For the significancetest, we show that our sampling approach does not need allthe existing instances of a co-location, rather a sample of theactual number of instances. Our sampling approach uses agrid based technique to generate such a sample efficientlyand can reduce the computational cost of SSCP approachfurther.

Besides finding patterns occurring due to a positive asso-ciation among spatial features, researchers often look forpatterns that can also occur due to the effect of an inhibitionor a negative association among spatial features. In associ-ation rule mining, several works are found which look forpatterns occurring due to the correlation among items fromthe market basket data. Among those, the work of [26] canbe mentioned. In [26] Brin et al. use χ2 as a test statisticto measure the significance of an association of a groupof items. To the best of our knowledge not many worksin the spatial domain can be found that look for patternsoccurring due to a negative interaction. Munro et al. in [7]first discuss more complex spatial relationships and spatialpatterns that occur due to such a relationship. A combina-tion of positive and negative correlation behavior amonga group of features gives rise to a complex type of co-location pattern. Arunasalam et al. in [8] develop a methodto mine positive, negative, and complex (mixed) co-locationpatterns. For mining such patterns, their method uses a userspecified threshold on prevalence measure called maximumparticipation index (maxPI) which was first introduced in [27]to detect a co-location pattern where at least one feature hasa high participation ratio. maxPI of a co-location C is themaximal participation ratio of all the features of C. In min-ing such complex patterns, the total number of candidatepatterns of size 1 is doubled for a given number of features.For instance, for a given set of features, {A, B, C, D}, the can-didate 1-item set will be {A, B, C, D,−A,−B,−C,−D} as weare now considering both the presence and the absence of afeature in constructing all candidate patterns. The exponen-tial growth of candidate space with an increased number offeatures makes the pattern mining computationally expen-sive. Arunasalam et al. propose a pruning technique whichworks only when a threshold value of 0.5 or greater isselected. In their method, selection of the right threshold isimportant for capturing a pattern occurring due to a true


correlation behavior. This method lacks a validation of thesignificance of a pattern when the pattern size is greaterthan two. Spatial auto-correlation behavior is also not con-sidered in their approach. In this article, we show that ourmodel can not only find patterns occurring due to a positiveassociation but also mine patterns generated by an inhibi-tion or a negative association relationship. For mining truepatterns, our model does not depend on a prevalence mea-sure threshold. It can mine both co-location and inhibitivepatterns using only one type of prevalence measure. Spatialauto-correlation behavior which is common in the spatialdomain is also taken into account in our mining approach.

6 CONCLUSION

In this paper, we propose a new definition of co-locationand segregation patterns and a method to detect them.Existing approaches in the literature find prevalent patternsbased on a predefined threshold value which can lead tomissing meaningful patterns or reporting meaningless pat-terns. Our method uses a statistical test. Such statisticaltest is computationally expensive and we introduce twoapproaches to improve the runtime. In our first approach,we reduce the runtime by generating a reduced numberof instances for an auto-correlated feature in a simulateddata generation step and by pruning unnecessary candidatepatterns in the PI-value computation step. In the secondapproach, we show that a PI-value of a pattern computedfrom a subset of the total instances is, in general, sufficientto test the significance of a pattern. We introduce a gridbased sampling approach to identify the instances of a pat-tern for the significance test at a reduced computationalcost. As a result, the speedup is further improved comparedto our first approach while at the same time.

We evaluate our algorithm using synthetic and real datasets. Our experimental results show that our samplingapproach never misses any true patterns when the num-ber of feature instances is not extremely low. However fora pattern with a very few instances of a participating fea-ture, we recommend to use a finer grid instead of a coarsergrid. Both the all-instance-based and grid based samplingalgorithms find all the true patterns from the synthetic datasets. Using real data sets, we show that our algorithms donot miss any pattern of size 2 found in others work found inecology. The pattern finding approach proposed in ecologycan not detect patterns of size greater than 2. We show thatour methods also find meaningful patterns of larger sizes.

In the future, we would like to investigate how the num-ber of instances and spatial distribution of the participatingfeatures should affect the selection of the right grid reso-lution for our sampling approach. It is possible to designa “mixed” approach in which grid cells of different sizesand even the full circular neighborhood are used for dif-ferent features, depending on the number of its instances.Future work also includes studying and comparing alter-native prevalence measures in our framework which couldalso allow additional pruning techniques. We also plan toinvestigate other sampling approaches such as samplingbased on randomly selected regions, Latin hypercube sam-pling [28], and the space filling curve technique [29] whichcould reduce the computational cost further.

REFERENCES

[1] Y. Huang, S. Shekhar, and H. Xiong, “Discovering co-location pat-terns from spatial data sets: A general approach,” IEEE Trans.Knowl. Data Eng., vol. 16, no. 12, pp. 1472–1485, Dec. 2004.

[2] R. Agrawal and R. Srikant, “Fast algorithms for mining asso-ciation rules in large databases,” in Proc. 20th Int. Conf. VLDB,Santiago, Chile, 1994, pp. 487–499.

[3] S. Shekhar and Y. Huang, “Discovering spatial co-location pat-terns: A summary of results,” in Proc. 7th Int. SSTD, RedondoBeach, CA, USA, 2001, pp. 236–256.

[4] Y. Morimoto, “Mining frequent neighboring class sets in spatialdatabases,” in Proc. 7th ACM SIGKDD Int. Conf. KDD, New York,NY, USA, 2001, pp. 353–358.

[5] J. S. Yoo and S. Shekhar, “A joinless approach for mining spatialcolocation patterns,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10,pp. 1323–1337, Oct. 2006.

[6] J. S. Yoo and S. Shekhar, “A partial join approach for min-ing co-location patterns,” in Proc. 12th ACM Int. Workshop GIS,Washington, DC, USA, 2004, pp. 241–249.

[7] R. Munro, S. Chawla, and P. Sun, “Complex spatial relationships,”in Proc. 3rd IEEE ICDM, 2003, pp. 227–234.

[8] B. Arunasalam, S. Chawla, and P. Sun, “Striking two birds withone stone: Simultaneous mining of positive and negative spatialpatterns,” in Proc. 5th SIAM ICDM, 2005, pp. 173–182.

[9] P. J. Diggle, “Displaced amacrine cells in the retina of a rabbit :Analysis of a bivariate spatial point pattern,” J. Neurosci. Meth.,vol. 18, no. 1–2, pp. 115–25, 1986.

[10] P. J. Diggle, Statistical Analysis of Spatial Point Patterns, 2nd ed.London, U.K.: Arnold, 2003.

[11] J. Illian, A. Penttinen, H. Stoyan, and D. Stoyan, StatisticalAnalysis and Modelling of Spatial Point Patterns. Hoboken, NJ, USA:Wiley, 2008.

[12] B. D. Ripley, “The second-order analysis of stationary pointprocesses,” J. Appl. Probab., vol. 13, no. 2, pp. 255–266, 1976.

[13] J. Neyman and E. L. Scott, “Statistical approach to problems ofcosmology,” J. Roy. Statist. Soc. Ser. B, vol. 20, no. 1, pp. 1–43, 1958.

[14] P. J. Diggle and R. J. Gratton, “Monte Carlo methods of inferencefor implicit statistical models,” J. Roy. Statist. Soc. Ser. B, vol. 46,no. 2, pp. 193–227, 1984.

[15] J. Besag and P. J. Diggle, “Simple Monte Carlo tests for spatialpatterns,” Appl. Statist., vol. 26, no. 3, pp. 327–333, 1977.

[16] C. J. Geyer, “Likelihood inference for spatial point processes,” inStochastic Geometry: Likelihood and Computation, O. E. Barndorff-Nielsen, W. S. Kendall, and M. V. Lieshout, Eds. Boca Raton, FL,USA: Chapman and Hall / CRC, 1999, no. 80, ch. 3, pp. 79–140.

[17] K. Schladitz and A. Baddeley, “A third order point process char-acteristic,” Scandinavian J. Statist., vol. 27, no. 4, pp. 657–671, 2000.

[18] R. D. Harkness and V. Isham, “A bivariate spatial point patternof ants’ nests,” J. Roy. Statist. Soc. Ser. C (Appl. Statist.), vol. 32,no. 3, pp. 293–303, 1983.

[19] M. J. Hutchings, “Standing crop and pattern in pure stands ofMercurialis Perennis and Rubus Fruticosus in mixed deciduouswoodland,” Nordic Society Oikos, vol. 31, no. 3, pp. 351–357, 1979.

[20] G. L. W. Perry, B. P. Miller, and N. J. Enright, “A comparisonof methods for the statistical analysis of spatial point patterns inplant ecology,” Plant Ecol., vol. 187, no. 1, pp. 59–82, 2006.

[21] N. A. C. Cressie, Statistics for Spatial Data. New York, NY, USA:Wiley, 1993.

[22] S. Mane, C. Murray, S. Shekhar, J. Srivastava, and A. Pusey,“Spatial clustering of chimpanzee locations for neighborhoodidentification,” in Proc. 5th IEEE ICDM, Washington, DC, USA,2005, pp. 737–740.

[23] K. Koperski and J. Han, “Discovery of spatial association rules ingeographic information databases,” in Proc. 4th Int. SSD, Portland,ME, USA, 1995, pp. 47–66.

[24] X. Xiao, X. Xie, Q. Luo, and W.-Y. Ma, “Density based co-locationpattern discovery,” in Proc. 16th ACM Int. Symp. Adv. GIS, Irvine,CA, USA, 2008, pp. 250–259.

[25] S. Barua and J. Sander, “SSCP: Mining statistically significant co-location patterns,” in Proc. 12th Int. SSTD, Minneapolis, MN, USA,2011, pp. 2–20.

[26] S. Brin, R. Motwani, and C. Silverstein, “Beyond market baskets:Generalizing association rules to correlations,” in Proc. SIGMOD,New York, NY, USA, 1997, pp. 265–276.


[27] Y. Huang, J. Pei, and H. Xiong, “Mining co-location patterns withrare events from spatial data sets,” GeoInformatica, vol. 10, no. 3,pp. 239–260, 2006.

[28] M. D. McKay, R. J. Beckman, and W. J. Conover, “A comparisonof three methods for selecting values of input variables in theanalysis of output from a computer code,” Technometrics, vol. 42,no. 1, pp. 55–61, 2000.

[29] H. Sagan, Space-Filling Curve. New York, NY, USA: SpringerVerlag, May 1995.

[30] E. Wienawa-Narkiewicz, “Light and electron microscopic stud-ies of retinal organisation,” Doctoral, Australian Nat. Univ.,Canberra, ACT, Australia, 1983.

[31] W. Hoeffding, “Probability inequalities for sums of bounded ran-dom variables,” J. Amer. Statist. Assoc., vol. 58, no. 301, pp. 13–30,1963.

[32] A. J. Baddley, “Spatial Point Processes and their Applications,” inLecture Notes in Mathematics: Stochastic Geometry. Berlin, Germany:Springer-Verlag, 2007.

Sajib Barua received the B.Sc. degree incomputer science and engineering from theBangladesh University of Engineering andTechnology (BUET), Dhaka, Bangladesh, andthe M.Sc. degree in computer science from theUniversity of Manitoba, Winnipeg, MB, Canada.He is currently pursuing the Ph.D. degreewith the Department of Computing Science,University of Alberta, Edmonton, AB, Canada,under the supervision of Dr. J. Sander. Hiscurrent research interests include data mining,

spatial databases, and performance computing.

Jörg Sander received the Ph.D. degreein computer science from the University ofMunich (LMU), Germany, in 1998. He wasa Post-Doctoral Fellow for 1 year at theUniversity of British Columbia, Vancouver, BC,Canada, before joining the University of Alberta,Edmonton, AB, Canada, in 2001, where he iscurrently a Tenured Associate Professor. Hiscurrent research interests include knowledgediscovery in databases, especially clustering anddata mining, with particular interest in spatial,

spatio-temporal, and biological applications, as well as techniques andindex structures that improve scalability of basic operations such assimilarity search in large databases.

� For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

Education

Mining statistically significant co location and segregation patterns