Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
DISCOVERING PATTERNS OF INSURGENCY VIA SPATI<)-TEMPORALDATA MINING
James P. Rogers, James A. Shme, Shashl Shekhar, Mete Cehk
U.S. Army ERDC, TopographIC Engmeenng Center, VA, USA{james.p.rogers.II,james.a.shine} @erdc.usace.army.mil
Department of Computer Science, University of Minnesota, MN, USA{shekhar,mcelik }@cs.umn.edu
Abstract
The need to discover patterns in spatio-temporal (ST)data has driven much recent research in ST cooccurrencepatterns. Early work focused on discovering spatial patterns such as co-location without examining the development of patterns over time or the temporal aspect of STdatasets. This paper describes a novel set of cooccurrencepatterns called mixed-drove co-occurrence patterns (MDCOPs). They represent subsets of two or more differentST object-types whose instances are close to each otherboth spatially and temporally. However, mining MDCOPsis com utationall ve ex ensive due to com lex interestmeasures, larger archived and historical datasets, and ex-ponential growth in candidate patterns with the number ofobject-types. We propose a monotonic composite interestmeasure or discoverin MDCOPs and two novel MDCOPmining algorithms. Analytical results show that the pro-posed algorithms are correct and complete. Experimentalresults also show that the proposed methods are computa-tionall more e cient than naive alternatives.
1. Introduction
The Army manages huge amounts of spatio-temporal(ST) data from a multItude of databases, and the volumeof such data contmues to expand as database archIves growand ST sensors increase in number and resolution Thisdata is stored in attributes, values and tables, with impor-tant relationships and patterns that are not known. Hu-mans have limited capacities to find these patterns withanalySIS and knowledge dIscovery, makmg automated andsemI-automated pattern analySIS essentIal. RelatIonshIpsbetween different ST events are vital to increasing knowl-edge and understanding of military challenges such as dis-covering and predicting insurgent attack patterns and tac-tIcs, and understandmg the nature of asymmetnc warfare.As a result, ST co-occurrence pattern mmmg has been the
I
subject of much recent research.
GIven a movmg object database, our mm IS to dIscovermixed-drove ST co-occurrence patterns (MDCOPs) representing subsets of different object-types whose instancesare located close together in geographic space for a significant fraction of time. Unlike the objectives of some otherST co-occurrence pattern identification approaches wherethe pattern is the primary interest, in MDCOPs both thepattern and the nature of the different object-types are ofinterest.
An example of an MDCOP IS m Amencan footballwhere two teams try to outscore each other by movmg aoot a to t e opponent s en 0 tee. anous com-
plex interactions occur within and across teams to achievethis goal. These interactions involve intentional and accidental MDCOPs, the identification of which may helpteams to stu y t elr opponent s tactIcs. Ject-types maybe defined by the roles of the offenSIve and defenSIveplayers, such as quarterback, running back, wide receiver,kicker, holder, linebacker, and cornerback. An MDCOPis a subset of these object-types (such as kicker, holderor {wideJeceiver, cornerback ~ ) that occur frequently. Oneexample MDCOP mvolves offenSIve WIde receIvers, defensive linebackers, and defensive cornerbacks, and is calleda Broken Blitz play. In this play, the objective of the offensive wide receivers is to outrun any linebackers anddefenSIve backs and get behmd them, catchmg an unde-fended pass and hopefully runmng untouched for a touch-down This interaction creates an MDCOP between widereceivers and cornerbacks. An example Broken Blitz playis given in Fig. 1. It shows the positions of four offensivewide receivers (WI, W2, W3, and WA), two defensivecornerbacks (C.l and C.2), two defenSIve bnebackers (L.Iand L.2), and a quarterback (Q.l) m four tIme slots. Thesolid lines between the players show the neighboring play-ers. The wide receivers W.l and WA cross over each otherand the wide receivers W.2 and W.3 run directly to the endzone of the field. ImtIally, the WIde receIvers W.l and WAare co-located WIth cornerbacks C.l and C.2 and the WIde
Report Documentation Page Form ApprovedOMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.
1. REPORT DATE DEC 2008
2. REPORT TYPE N/A
3. DATES COVERED -
4. TITLE AND SUBTITLE Discovering Patterns Of Insurgency Via Spatio-Temporal Data Mining
5a. CONTRACT NUMBER
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) U.S. Army ERDC, Topographic Engineering Center, VA
8. PERFORMING ORGANIZATIONREPORT NUMBER
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited
13. SUPPLEMENTARY NOTES See also ADM002187. Proceedings of the Army Science Conference (26th) Held in Orlando, Florida on 1-4December 2008
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT
UU
18. NUMBEROF PAGES
8
19a. NAME OFRESPONSIBLE PERSON
a. REPORT unclassified
b. ABSTRACT unclassified
c. THIS PAGE unclassified
Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18
CQiTY-lTIJ [QJJ[QJJ
~DW'J.)
cY[P<Q> <0> C@0Z[4)
@> <0> @>
0.1 @> 0.1 a:~<TI> Cl[J)
C0J~ <TI> @)CWD(\'Lv
@~ ~ ~
01
~.''''b4. WcJ,
'1[3
(a) time slot t=O (b) time slot t=l (c) time slot t=2 (d) time slot t=3 (e) coach sketch
Figure 1. An example Broken Blitz play In American football
receivers W.2 and W.3 are co-located with linebackers L.Iand L.2 at time slot t=O (Fig. la). In time slot t=l, the fourwide receivers begin to run, while the linebackers run to-wards the quarterback and the cornerbacks remam m theIrongmal posItIon, posSIbly due to a fake handon from thequarterbaek to the fUnning baek (Fig. Ib). In time slot t 2,the wide receivers W.I and WA cross over each other andtry to drift further away from their respective cornerbacks(Fig. Ic). When the quarterback shows signs of throwingthe football, both cornerbacks and linebackers run to theirrespective wide receivers (Fig. Id). The overall sketch ofthe game tactics can be seen in Fig. Ie. In this example, wide receivers and cornerbacks form an MDCOP sincethey are persistent over time and they occur in 2 out of 4time slots. However, wide receivers and linebackers do notform an MDCOP due to the lack of temporal persistence.
Other applications for which discovering co-occurringpatterns of specific combinations of object-types is im-portant include battlefield planning and strategy, ecology(tracking species and pollutant movements), homeland defense (looking for significant "events"), and transportation(road and network planning) [8, 11].
However, dlscovenng MDCOPs poses several non-trivial challenges. First, current interest measures (i.e. thespatial prevalence measure) are not sufficient to quantifysuch patterns, so new composite interest measures mustbe created and formahzed [9]. Second, the set of can-dldate patterns grows exponentIally WIth the number ofobject-types. Finally, since spatio-temporal datasets arehuge, computationally efficient algorithms must be devel-oped [16].
This paper focuses on MDCOPs (typed collections ofmovmg objects) by extendmg mterest measures for spatIalco-locatIon patterns gIVen a user-defined partICIpatIOn mdex threshold [9]. The following issues are beyond thescope of this paper: (i) determining thresholds for MD-COP interest measures; (ii) similarity measures for track-mg movmg objects ; (111) mdexmg and query processmgIssues related to mmmg objects; and (IV) dlscovenng mul-
2:
tisets (e.g.{A, A, B}).
2. Related Work
Data analysis can be broadly categorized into statisti-cal approaches and data mining approaches. In statistical approaches, there are bodies of work in both spatialand temporal analysis. Spatial point patterns are often described by metrics such as the intensity function and Ripley's K [14, 15]. Other measures such as complete spatial randomness (CSR) and spatial covariance functions areused to describe the spatial relationships of adjacent areasand continuous variables as random fields [5]. Temporalpatterns have been extensively studied in models such asmoving averages, first and second order autoregression, integration, seasonality, and cointegration [18], [6]. Therehas also been some recent research in combining spatialand temporal analysis, such as Brix and Diggle's extendedintensity function and the extended K(r, t) function [1,13].Most attempts to combine the fields suffer from limitationssuch as the inability to model space-time interactions, assuming separability and independence between space andtime [15]. Statistical research specifically focused on STco-occurrence patterns and their possible interactions hasbeen limited
Previous data mining studies for mining ST co-occurrence patterns can be classified into two categories:mining of uniform groups of moving objects, and miningof mixed groups of moving objects.
To mine uniform groups of moving objects, the prob-lems of discovering flock patterns [12, 7] and moving clus-ters [10] are defined. A flock pattern is a moving groupof the same kmd of objects, such as a sheep flock or abIrd flock. Gudmundsson et al. proposed algonthms fordetection of the flock pattern in ST datasets [7]. Kalniset al. defined the problem of discovering moving clus-ters and proposed clustering-based methods to mine suchpatterns [raJ. In thIS approach, If there IS a large enoughnumber of common objects between clusters m consecutIve
time slots, such clusters are called moving clusters. Thesemethods do not take object-types into account, and thusare not effechve for mmmg MDCOPs [4]. To mme mIxedgroups of movmg objects, the problems of discovenng collocation episodes [3] and topological patterns [17] are important. Both generalize co-location patterns [9] to the STdomain. A collocation episode is a sequence of co-locationpatterns wIth some common object-types across consecutive time slots. However, if there is no common object-typein consecutive time slots, the proposed approach will notidentify any pattern. For example, if the window lengthis 2, the collocation episodes algorithm will not be ableto find any pattern from the dataset given in Fig. 1. Thealgorithm tries to find co-location patterns that are persistent in 2 consecutive time slots, but there is no such patternin the dataset because wide receivers and cornerbacks areforming co-locations in time slots t=O and t=3 and widereceivers and linebackers are forming co-locations in timeslots t=O. Thus, there may not be any co-location patternsand collocation episodes identified in the dataset.
A topologIcal pattern [17] IS a subset of object-typeswhose mstances are close m space and hme. An mterestmeasure for a topological pattern {A,B} (e.g. participationindex or support) is a ST join of instances of A and instances of B [9]. This statistic may be high even if manyinstances of A and many instances of B are not spatially to-gether for a moment m hme. The semanhcs of topologIcalpatterns are not well-defined for moving objects. For ex-ample, this approach can not determine the fraction of timethat a pattern occurs. This approach may not be able to tellin which time slots a pattern occurs, since there is no timeslot nohon. In the dataset gIven m FIg. 1, thIS approach wIlldiscover the two patterns of {W,C}, {W,L}, {W,C}, and{W,C,S}. Both patterns have the same support, but pattern{W, C} occurs in 2 time slots out of 4 (a persistent pattern)and patterns {W,L}, {W,C}, and {W,C,S} occur in 1 timeslot out of 4 (a tranSIent pattern) smce tracks of objects arerepresented as ST instances. The persistent pattern {W, C ~
occurs in time slots t=O and t=3 and its instances {WI, Cl}and {W4, C2} occur in time slot t=O and {WI, C2} and{W4, Cl} in time slot t=l. The transient pattern {W, L}occurs in time slot t=O and its instances {W2, Ll}, {W3,Ll}, {W2, L2}, and {W3, L2} occur in time slot t=O.
In contrast, our proposed interest measure and algo-nthms wIll effiCIently mme mIxed groups of objects (e.gMDCOPs) which are close in space and persistent in time.Unlike the techniques just described, our approach discovers persistent patterns that co-occur in most but not allST intervals; consecutive co-occurrences are not mandatory. For example, our approach will find the MDCOP{wideJeceiver, cornerback} pattern in Fig. 1, if the fraction of time slots where the pattern occurs over the totalnumber of time slots is no less than the threshold 0.5, sincemstances are co-located m 2 hme slots out of 4. It wIll reject the patterns {w,q, {w,q, and {W,C,S ~ in Fig. 1 at
3
the same threshold, since they are co-located on only 1 timeslot out of 4.
3. Basic Concepts & Problem Definition
3.1 Spatial Prevalence Measure
Spatial co-location mining algorithms are used to diScover sets of mixed object-types that are frequently locatedtogether in a spatial framework for a given set of spatialobject-types, their instances, and a spatial neighbor relationship R [9]. For example, in Fig. 2, in time slot t=O,{A.l, C.l} is an instance of a co-location if the distancebetween the objects is no more than a given neighborhooddistance threshold. In Fig. 2, the solid lines show the distance between the objects that satisfies the neighborhooddistance threshold. The participation index is used to determine the strength of the co-location pattern. If the indexis greater than or equal to a threshold [9], a co-locationis called spatial prevalent. The participation index is de-fined as the minimum of the participation ratios (the frac-tion of the number of instances of object-types forming colocation instances to the total number of instances). Forexample, in Fig. 2, {A, B} is a co-location in time slott=O, and its instances are {A.l, B.l}, {A.2, B.l}, {A.3,B.2}, and {A.3, B.3}. In the dataset, object-type A has 4instances and three of them (A.l, A.2, and A.3) are con-tributing to the co-location {A, B L so the participation ra-tio of A is 3/4. The participation ratio of B is 3/5 since 3out of 5 instances are contributing to the co-location {A,B}. The participation index of the co-location {A, B} isthe minimun of 4/5 and 3/5, or 3/5. It has been shown thatthe participation index is anti-monotone in the size of colocations [9]. In other words, participation_index(Pj ) :::;
participation_index(P;) if Pi is a subset of Pi. In addi-tion, the participation index has a spatial statistical inter-pretation as an upper bound on the cross - K function [5].
3.2 Modeling MDCOPs
Given a set of spatio-temporal mixed object-types anda set of their instances with a neighborhood relation R, anMDCOP is a subset of spatio-temporal mixed object-typeswhose mstances are neIghbors m space and hme.
Definition 3.1 Given a spatia-temporal pattern and a setT F of time slots, such that T F = [To, ... ,Tn-I], the timeprevalence orpersistence measure ofthe pattern is thefraction of time slots where the pattern occurs over the totalnumber of time slots.
For example, in Fig. 2, the total number of time slotsis 4 and pattern {A, B} occurs in all 4 time slots, so itstime prevalence index is 4/4. Pattern {A, C} occurs in 3hme slots, namely, hme slots t-O, t-l, and t-2, and ItS hmeprevalence index is 3/4 (Fig. 2b).
~
~
(al An input spatio-temporal dataset
Mixed-Drove Spatia-Temporal Spatial prevalence index values Time prevalenceCo-occurrence Patterns time slot a time slot I time slot 2 time slot 3 index values
AB 3/5 3/5 3/5 3/5 4/4AC 2/4 2/4 2/4 a 3/4BC a 3/5 3/5 3/5 3/4
ABC a 2/5 2/5 a 2/4
(bl A set of output mixed-drove spatio-temporal co-occurrence patterns
Figure 2. An input spatia-temporal dataset and a set of output MDCOPs
Definition 3.2 Given a spatio-temporal dataset of mixedobject-types ST, and a spatial prevalence threshold Bp , themixed-drove prevalence measure of pattern Pi is a com-position oj the spatial prevalence and the time prevalencemeasures as shown below.
where Prob stands for probability of overall prevalencetime slots and s_prev stands for spatial prevalence, e.g.,the participation index, described in Section 3.1.
Definition 3.3 Given a spatio-temporal dataset of mixedobject-types ST and a threshold pair (Bp , Btirne ), MDCOPPi is a mixed-drove prevalent pattern if its mixed-droveprevalence measure satisfies the following.
prevalent m time slot t-2, It IS not time prevalent smce ItStime prevalence index is no more than the given time preva-lence threshold 0.5.
3.3 Problem statement
Given:- A set P of Boolean spatio-temporal object-types over
a common spatio-temporal framework STF.- A neighbor relation R over locations.- A spatial prevalence threshold, Bp .
- A time prevalence threshold, &tirne .' _Find: {Pi IPi is a subset of P and Pi is a prevalent MDCOPas in Definition 3.3}.Objective: Minimize computation cost.Constraints: To find a correct and complete set of MDCOPS:
ProbtmETF [s_prev(Pi , time_slot t rn ) ~ Bp ] ~ Btirne , Threshold values selected for MDCOP interest mea-(772")------cscc:ucccreC-sc-T«tTh-=-e-=scc:p-=-a"'tI-=-a'l-=-an-=-dTLtI~m-=-e----:pcccre-:vCC:a:-Tle-=-nc-c::-Ce~mCC:e-=-aCC:su-=-r-:e-=-s') Th-=-av=-Ce~ImC:-::-_
where Frob stands for probabIlIty of overall prevalence portant ImplIcatIOns on the mmmg processes and results.time slots, s_prev stands for spatial prevalence, Bp is the Selection of a small interest measure threshold (close to 0)spatial prevalence threshold, and Btirne is the time preva- increases the algorithm's computational complexity and thelence threshold. number of generated prevalent patterns. ThIS may cause
For example, in Fig. 2, {A, B ~ is an MDCOP because it generation of insignificant patterns. Selection of a largeis spatial prevalent in time slots t=O, t=1, t=2, and t=3 since interest measure threshold (close to 1) decreases the com-its participation indices are no less than the given thresh- putational complexity of the algorithms and the number ofold 0.4 in these time slots, and is time prevalent since its prevalent patterns. This may cause pruning of some of thetime prevalence mdex of I IS above the threshold 0.5. In sIgmficant patterns. Nevertheless the selection of mterestcontrast, {B, D~ is not an MDCOP. Although it is spatial measure threshold values is dependent on the application
Prune non-prevalent patterns~
T•Collect MDCOPs I j j/1 \~ I MDCOPlist IPrune time non : Ij 4 pruned pruned
"' revalent oatterns
\ \ \4 I Prune time non~ I ~4 pruned pruned revalent natterns 3 Prune time non : I
"' pruned revalent oatterns~
3 \ 1\ \ 3 pruned r~~~le~\Il~~;;~':~s I 2 - ,-. .- '~~,~2 2 I Prune time non~ I
t=O \ t=1 \ 1=2 \r/revalent oatterns
t=O t=1 t=2 t=3 t=O t=1 t=2 t=3\tlmeSl~S
time slots time slots Prune hme non-prevalent patterns
(a) Na'ive Approach (b) MDCOP-Miner (c) FastMDCOP-Miner
Figure 3. Comparison of NaIve Approach, MDCOP-Miner, and FastMDCOP-Miner
and/or purpose of the analysis.
4. Mining MDCOPs
In this section, we discuss a naive approach and thenpropose two novel MDCOP mining algorithms - MDCOPMiner and FastMDCOP-Miner - to mine MDCOPs.
4.1 NaIve approach
A naive approach can use a spatial co-location mining algorithm for each time slot to find spatial prevalentco-locations and then apply a post-processing step to dis-cover MDCOPs by checking their time prevalence. Tomine co-locations, Huang, Shekhar and Xiong proposed ajoin-based approach, Yoo, Shekhar and Celik proposed apartial join-based approach and a join-less approach, andZhang et ai. proposed a multi-way spatial join-based approach [2, 9, 19,20]. This study will be based on the joinbased co-location algorithm proposed by Huang et aI., butit is also possible to use other approaches. The naive approach wIll generate size k + 1 candidate co-locatIOns foreach tIme slot usmg spatIal prevalent size k subclasses untIlthere are no more candidates. After finding all size spatialprevalent co-locatIons m each tIme slot, a post-processmgstep can be used to drscover MDCOPs by prumng out tImenon-prevalent co-locations. Even though this approach willprune out spatial non-prevalent co-locations early, it willnot prune out time non-prevalent MDCOPs before the post-processmg step (FIg. 3a). ThIS leads to unnecessary com-putatIonal cost.
4.2 MDCOP-Miner
To eliminate the drawbacks of the Naive approach, wepropose an MDCOP mmmg algonthm (MDCOP-Mmer)whIch mcorporates a tIme-prevalence based filtenng step
5
in each iteration. The algorithm will first discover allSIze k spatIal prevalent MDCOPs, and then WIll apply atIme-prevalence based filter to dIscover MDCOPS. Fmally,the algorithm will generate size k + 1 candidate MDCOPsusmg SIze k MDCOPs (FIg. 3b). The partICIpatIOn mdexis used as a spatial prevalence interest measure to check ifthe pattern is spatial prevalent at a time slot [9]. The timeprevalence from Definition 3.1 is used as a time prevalencemterest measure. Algonthm I gIves the pseudo code of theboth the MDCOP-Miner algorithm and the FastMDCOPMiner discussed in the next section. The choice of thealgorithm is provided by the user. The inputs are algorithmchoice alg_choice with value MDCOP-Miner, a set of dis-tinct spatial object-types E, a spatio-temporal dataset ST,a spatial neighborhood relationship R, and thresholds ofinterest measures, i.e. spatial prevalence and time prevalence; the output is a set of MDCOPs. In the algorithm,steps 1 include initialization of the parameters, steps 2through 14 give an iterative process to mine MDCOPs, andstep 15 gives a union of the results. Steps 2 through 14 continue until there are no candidate MDCOPs to be generated.
4.3 FastMDCOP-Miner
In this section, we discuss the FastMDCOP-Mineralgorithm, which further improves on the computationalefficiency of the MDCOP-Miner discussed in Section 4.2.As can be seen in Fig. 3b and in Algorithm 1, MDCOP-Miner waits to prune time non-prevalent patterns untilall size k spatial prevalent patterns are generated for alltIme slots and then prunes tIme non-prevalent patterns todiscover MDCOPs. However, we can further optimizethe MDCOP-Miner by pruning time-non prevalent patterns at an earlier stage. We move "prune non-prevalentpatterns" between the tIme slots shown m FIg. 3c wherethe candIdate SIze 2 pattern generatIon IS Illustrated. The
Algorithm I:MDCOP-Mmer and FastMDCOP-Mmer
Iiijli.ilS:alg_choice: MDCOP-Miner or FastMDCOP-MinerE: a set of dlstlnct spatlal obJect-typesST: a spatio-temporal dataset
<objecLtype, objecLid, x, y, time slot>R: spatial neighborhood relationshipTF: a time slot frame {to, ... ,tn-dOp: a spatial prevalence threshold~in~ a time prevalence threshold
Output : MDCOPs whose spatial prevalenceindices are no less than Op, for time prevalence
ces are no ess t an time.
Variables:k: co-occurrence sizet: time slots (O, ... ,n-l)T k : set of instances of size k co-occurrencesCk : set of candidate size k co-occurrencesSPk: set of spatial prevalent size k
co-occurrences1'Pk: set of time prevalent size k
co OCCULrencesM DPk: set of mixed-drove size k
CO-OCClJrrences
Algorithm:1) initialization k-l,Ck-E,MDPk(O)-ST2) while (not empty M DPkl {3) C k+1(0) = gen_candidate_co - occ(Ck, M DPk)4) for each time_slot t in (O, ... ,n -1) {5) Tk+1 (t) = gen_co_oce-inst(Ck+1(t), Tk(t), R)6) SPk+l(t) = jind_spatiaLprev_co_occ(Tk+l(t),Ck+l(t),Op)7) If (alg_choice =="FastMDCOP-Miner") {8) T Pk+1 (t) = jind_time-index(SPk+1 (t))9) M DPk+l (t) = jind_time_prev_co_occ(TPk+l (t), Otime)
11) If alg_choice=="MDCOP-Miner" {12) TPk+1 = jind_time_index(SPk+l)13) MDPk+1 = jind_time-prev_co_occ(TPk+I,Otime) }14) k=k+l}15) return union (MDP2, ... ,MDPk+l)
pseudo-code of the FastMDCOP-Miner is also given inAlgonthm 1. When the FastMDCOP-Mmer is chosen, thealgorithm will activate steps 8, 9, and 10 and deactivatesteps 12 and 13. ThiS will allow the algonthm to checkthe time prevalence of a pattern after every time slot isprocessed. The functions of the algorithm are as describedin Section 4.2. In step 8, FastMDCOP-Miner checkswhether the time prevalence mdices of Size k patterns(SiZe 2 patterns m Fig. 3c) satisfy the time prevalencethreshold before generating size k patterns for the nexttime slot. Early discovered time non-prevalent patterns arepruned in Step 9 and time prevalent patterns are used ascandidate co-occurrences (Step 10) in the next time slot.For example, assume that there are 10 time slots and thetime prevalence threshold is 0.5. In this case, a size kpattern should be present for at least 5 time slots to satisfythe threshold. If the time prevalence index of a patternis 0 for the first (or any) 6 time slots, there is no need togenerate it and check its prevalence for the rest of the timeslots, since it will now be impossible for it to satisfy thegiven threshold regardless of the remammg results.
6
5. Experimental Evaluation
Miner, MDCOP-Miner and naIve approach to answer thefollowmg questIOns:
- What is the effect of the number of timeslots?- What is the effect of the number of object-types?- What is the effect of the spatial prevalence threshold?- What is the effect of the time prevalence threshold?Dataset: The real dataset contains the location and time
information of moving objects (Figure 4). It includes 15time snapshots and 22 distinct vehicle types and their in-stances. The minimum instance number is 2, the maximuminstance number is 78, and the average number of instancesis 19.
Figure 4. Real Dataset
1. Effect of Number of Time Slots: We evaluated theeffect of the number of time slots on the execution timeof the MDCOP algonthms usmg the real dataset. ThepartiCipation mdex, time prevalence mdex, and distancewere set at 0.2, 0.8, and 150m respectively. Experimentswere run for a minimum of 1 time slot and a maximum of14 time slots. Results showed that the FastMDCOP-Mmerreqmres less execution time than the MDCOP-Mmer andnaIve approaches, since it prunes out time non-prevalentMDCOPs as early as pOSSible (Fig. Sa). As the number oftime slots increases, the ratio of the increase in executiontime is smaller for FastMDCOP-Miner than for the otherapproaches. Fig. 5b shows the number of generated Size2 and size 3 instances for algorithms. The FastMDCOP-Mmer generates fewer patterns. The MDCOP-Mmer andnaIve approaches generate the same number of size 2instances.
2. Effect of Number of Object-types: We evaluated theeffect of the number of object-types on the execution time
----+- FaSlMDCOP-Miner7 ---E- MDCOP-Miner
-----'f- NaIve Approach
234 5 6 7891011121314
Numberoftimeslots
(a) Execution time
7000 -6000 -EillillillJ
sooo ==4000 =3000
ZOOO
1000
1 4 6 8 10 12 14
NLimberoflime~ilots
(b) Generated instances
same number of spatial prevalent patterns as the time prevalence index increases. In that case, the cost of the postprocessing step will reflect the trend of the naIve approach.Experimental results show that the FastMDCOP-Miner ismore computationally efficient than the other approaches(Fig. 7a-b). The execution times ofthe FastMDCOP-Minerand MDCOP-Miner decrease as the time prevalence threshold increases. It is also observed that the naIve approachIS computatIonally more expenSIve as the tIme prevalencethreshold decreases because of the increase in the numberof MDCOPs to be discovered.
] 10
R 8
Time prevalence index threshold
14000
12000
101110
,[ 60000
4()OO
2000
Time prevalence index threshold
aL-~~~~~~----'0.2 OJ 0.4 0.5 0.6 0.7 0.8 0.9 1
12 ,-----<-~F;-aSlrvc;::IDc;;Co:O-P-c-;:!....li~ner~~'-----B- MDCOP-Miner-----+----- NalveAppfO<lch
.~
Figure 5. Effect of number of time slots
creases (Fig. 6a-b). It is observed that the increase in
index, time prevalence index, number of time slots anddistance were set at 0.2, 0.8, 15, and 150m respectively.Results showed that the FastMDCOP-Miner outperformsthe other approaches as the number of object-types in-
executIOn tIme for the naIve approach IS bIgger than thatof the MDCOP-Miner and the FastMDCOP-Miner as the
(a) Execution time (b) Generated instances
number of object-types mcreases for datasets.Figure 7. Effect of the time prevalencethreshold
----+- FaSlMDCOP-Miner7 ---E- MDCOP-Miner
-----'f- NaIve Approach
.~
o'----~~~~~~~4 6 8 10 12 14 16 18 20
Number of object-types
8000
7000
6000
~ 5000
[ 4000
03000
2000
1000
012 16 20
Number of object-types
4. Effect of the Spatial Prevalence Threshold:We evaluated the effect of the spatial prevalence threshold on theexecution times of the MDCOP algorithms. The fixed parameters were time prevalence index, number of time slots,and distance, with values of 0.2, 15, and 100m respectively. Fig. 8a shows the execution times of the algorithms and Fig. 8b shows the number of generated size 2
(a) Execution time (b) Generated instancesand 3 instances for the algorithms. FastMDCOP-Miner andMDCOP-Miner do not generate more than size 3 instancesfor a spatIal prevalence threshold of greater than 0.2. The
Figure 6. Effect of number of object-types
The execution times of the algorithms increases as thenumber of the object-types increases due to the increasein the number of join operations. The MDCOP-Miner
FastMDCOP-Mmer outperforms the MDCOP-Mmer andnaIve approaches as the spatial prevalence threshold increases (Fig. 8a-b). The cost of the naIve approach ishigher than that of the FastMDCOP-Miner and MDCOPMiner for low values of the spatial prevalence threshold.
and naIve approaches generate the same number of SIze 2instances. In contrast, the FastMDCOP-Miner generates 6. Conclusions and Future Workfewer size 2 instances (Fig. 6b).
3. Effect of' the Time Prevalence Threshold: We eval-uated the effect of the tIme prevalence threshold on the
We defined mixed-drove spatio-temporal co-occurrencepatterns (MDCOPs) and the MDCOP mmmg problem andproposed a new monotomc compOSIte mterest measure
executIOn tImes of the MDCOP mmmg algonthms for thereal dataset. The fixed parameters were participation index,
whIch IS the compOSItIon of dIstmct object-types, spatIalprevalence measures, and time prevalence measures. We
number of time slots, and distance, and their values were presented two novel and computationally efficient algo-0.2, 15, and 150m respectively. For the naIve approach, the rithms for mining these patterns: the MDCOP-Miner, andeffectIve cost m executIon tIme to generate spatIal preva- the FastMDCOP-Mmer. We compared our algonthms WIthlent co-locatIons wIll be constant smce It generates the a naIve approach and showed theIr supenonty m expen-
7
sets. n rocee ings 0 t e t annua mternational workshop on Geographic information systems (ACM-
version published in 6th IEEE International Conference onData Mining 2006, 2008.
[5] N. A. C. Cressie. Statistics for Spatial Data. Wiley, 1993.[6] C. Granger. Time series analysis, cointegration, and ap-
p IcatlOns. n 0 e Tlze ecture, epartment 0 conomics, University of California, San Diego. Paper 200402., http://repositories.cdlib.org/ucsdecon/2004-02, 2004.
[7] J. Gudmundsson, M. v. Kreveld, and B. Speckmann. Effi-cient detection of motion patterns in spatio-temporal data
2000
Spatial prevalence index threshold
0,1 0.2 OJ 0.4 OJ
1000
.'000
4000
5000
0.50.4OJ
FastMDCOP-Miner ------MDCOP-Ivliner -----B
NalveApproach -----+----
0.2
SJl3tial prevalence index threshold
1'-----~-~-~-0.1
threshold
ments usmg a real dataset to examme the effects of the
Figure 8. Effect of the spatial prevalence
(a) Execution time (b) Generated instances GIS'04), pages 250-257, Washmgton DC, USA, 2004.[8] R. Guting and M. Schneider. Moving Object Databases.
Morgan Kaufmann, 2005.[9] Y. Huang, S. Shekhar, and H. XlOng. Uiscovenng co-
locatlOn patterns from spatial datasets: A general approach.IEEE Transactions on Knowledge and Data EngineeringOWE) (Earlier versIOn pubbshed m Symp. on Spatzal andTemporal Databases 2001),16(12):1472-1485,2004.
[10] P. Kalnis, N. Mamoulis, and S. Bakiras. On discoveringnumber of time slots, the number of object-types, and thevalues of the spatial and time prevalance thresholds. Thetwo new proposed algorithms are correct and complete infinding mixed-drove prevalent patterns.
For future work, we would like to explore the relationship between the proposed MDCOP interest measures andspatIO-tempora statIstIca measures 0 mteractIon n-other problem of mterest IS the charactenzatIOn of the prob-ability distribution of the proposed interest measure to helpchoose thresholds in the proposed measures. We plan toexplore other potential interest measures for MDCOPs byevaluating similarity measures for tracks of moving objects. We plan to investigate new monotonic compositeinterest measures and develop other new computationally
moving clusters in spatio-temporal data. In 9th InternationalSymp. on Spatial and Temporal Databases (SSTD) , Angrados Reis, Brazil, 2005.
[11] M. Koubarakis, T. Sellis, A. Frank, S. Grumbach, R. Guting,C. Jensen, N. Lorentzos, H. J. Schek, and M. Scholl. SpatioTemporal Databases: The Chorochronos Approach, LNCS2520, volume 9. S rin er Verla, 2003.
[12] P. Laube, M. v. Kreveld, and S. Imfeld. Finding remo -detectmg relative motion patterns m geospatml llfelmes. In11th International Symp. on Spatial Data Handling, pages201-214. Springer Berlin Heidelberg, 2004.
[13] J. Ma, D. Zeng, and H. Chen. Spatial-temporal cross-correlation analysis. In Proceedings of the 2006 IEEE International Conference on Intelligence and Security Informatics, pages 542-547, San Diego, CA, 2006.
[14] B. Ripley. Spatial Statistics. Wiley, 1981.
efficient algorithms for mining MDCOPs. We also hope toextend our algorithm to mine newly defined patterns in theliterature such as leadership, convergence and query processing [12].
7. Acknowledgements
[15] O. Scha6en6erger and C. Gotway. Statistical Methods jorSpatial Data Analysis. Chapman and Hall, 2005.
[16] SSTDM06. First international workshop on spatial andspatio-temporal data mining. In Conjunction with the 6thIEEE International Conference on Data Mining (ICDM2006), 2006.
[17] J. Wang, W. Hsu, and M. L. Lee. A framework for min-ing topological patterns in spatio-temporal databases. In
Some of the co-authors received support from USDODand NSF. The authors would lIke to thank KIm Koffoit and
ACM fourteenth Conjerence on injormation and Knowledge Management (CIKM'05), Bremen, Germany, 2005.
[18] W. W. S. Wei. Time Series Analysis: Univariate and Multi-
Pradeep Mohan for their comments.
References
[I] A.Enx and P. Diggle. SpatlO-temporal predICtion for log-gaUSSIan cox processes. Journal of the Royal StatistIcal So-ciety, 63(10):823-841,2001.
[2] S. Banerjee, B. P. Carlin, and A. E. Gelfrand. HierarchicalModeling and AnalysIs jor Spatial Data. CRC Press, 2003.
ao, amou IS, an eung. Iscovery 0
collocation episodes in spatiotemporal data. In Proceedingsoj the 6th iEEE internatIOnal Conjerence on Data Mining(lCDM 2006), pages 823-827, Hong Kong, Chma, 2006.
[4] M. Celik, S. Shekhar, J. P. Rogers, and J. A. Shine. Mixed-drove spatio-temporal co-occurrence pattern mining. IEEE1'ransaclOns on Knowledge and Data Engmeering (Earber
variate Methods. AddIson Wesley, 2005.[19] J. S. Yoo and S. Shekhar. A joinless approach for mining
spatial colocation patterns. IEEE Transactions on Knowledge and Data Engineering (TKDE) (partial results published in IEEE Int'l Conf on Data Mining 2005 and ACMInt'l Workshop on Geographic Inf Systems 2005, 18(10),2006.
[20] X. Zhang, N. Mamoulls, D. W. L. Cheung, and Y. Shou. Fastmining of spatial collocations. In 10th ACM SIGKDD In-ternational Con erence on Knowled e Discover and DataMining, pages 384-393, Seatle, WA, 2004.
8