Upload
amy-joseph
View
219
Download
0
Embed Size (px)
Citation preview
Trajectory Data Mining and Management
Hsiao-Ping Tsai 蔡曉萍 @ CSIE, YuanZe Uni.
2009.12.04
Outline
Introduction to Data Mining
Background of Trajectory Data Mining
Part I: Group Movement Patterns Mining
Part II: Semantic Data Compression
Why Data Mining?
The explosive growth of data - toward petabyte scale Commerce: Web, e-commerce, bank/Credit transactions, …
Science: Remote sensing, bioinformatics, …
Many others: news, digital cameras, books, magazine, …
We are drowning in data, but starving for knowledge!
Somebody~
Help~~~~
What Is Data Mining? Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) knowledge, e.g., rules, re
gularities, patterns, constraints, from huge amount of data
Data Mining
Database Technology Statistics
MachineLearning
PatternRecognition
Algorithm
OtherDisciplines
Visualization
Neural NetworkGraph Theory
Confluence of Multiple Disciplines
Potential Applications
Data analysis and decision support Market analysis and management
Risk analysis and management
Fraud detection and detection of unusual patterns (outliers)
Other Applications Text mining and Web mining
Stream data mining
Bioinformatics and bio-data analysis
…
Data Mining Functionalities (1/2) Multidimensional concept description: Characterization
and discrimination Generalize, summarize, and contrast data characteristics
Frequent patterns, association, correlation vs. causality Diaper Beer [0.5%, 75%]
Discovering relation between data items.
Classification and prediction Construct models that describe and distinguish classes
Predict some unknown or missing numerical values
Data Mining Functionalities (2/2) Cluster analysis
Clustering: Group data to form classes Maximizing intra-class similarity and minimizing interclass
similarity
Outlier analysis Outlier: Data object that does not comply with the general
behavior of the data Useful in fraud detection, rare events (exception) analysis
Trend and evolution analysis Trend and deviation, e.g., regression analysis Sequential pattern mining, Periodicity analysis, Similarity-
based analysis
Outline
Introduction to Data Mining
Background of Trajectory Data Mining
Part I: Group Movement Patterns Mining
Part II: Semantic Data Compression
Trajectory data are everywhere! The world becomes more and more mobile
Prevalence of mobile devices, e.g., smart phones, car PNDs, NBs, PDAs, …
Satellite, sensor, RFID, and wireless technologies have fostered many applications Tremendous amounts of trajectory data
Market Prediction: 25-50% of cellphones in 2010 will have GPS
Related Research Projects (1/2) GeoPKDD: Geographic Privacy-aware Knowledge Discovery and Del
ivery (Pisa Uni., Priaeus Uni, …) MotionEye: Querying and Mining Large Datasets of Moving Objects
(UIUC) GeoLife: Building social networks using human location history (Micr
osoft Researcch) Reality Mining (MIT Media Lab) Data Mining in Spatio-Temporal Data Sets (Australia's ICT Research
Centre of Excellence) Trajectory Enabled Service Support Platform for Mobile Users' Behav
ior Pattern Mining (IBM China Research Lab) U.S. Army Research Laboratory
Related Research Prjoects (2/2) Mobile Data Management ( 李強教授@ CSIE.NCKU) Energy efficient strategies for object tracking in sensor networ
ks: A data mining approach ( 曾新穆教授@ CSIE.NCKU) Object tracking and moving pattern mining ( 彭文志教授 @C
SIE.NCTU) Mining Group Patterns of Mobile Users ( 黃三義教授 @CSIE.
NSYSU) …
Wireless Sensor Networks (1/2) Technique advances in wireless sensor network (WSN) are
promising for various applications Object Tracking Military Surveillance Dwelling Security …
These applications generate large amounts of location-related data, and many efforts are devoted to compiling the data to extract useful information Past behavior analysis Future behavior prediction and estimation
Wireless Sensor Networks (2/2) A wireless sensor network (WSN) is composed of a
large number of sensor nodes Each node consists of sensing, processing, and
communicating components WSNs are data driven Energy conservation is paramount among all design issues
Object tracking is viewed as a killer application of WSNs A task of detecting a moving object’s location and
reporting the location data to the sink periodically Tracking moving objects is considered most challenging
Part I: Group Movement Patterns Mining
Hsiao-Ping Tsai, De-Nian Yang, and Ming-Syan Chen, “Exploring Group Moving Pattern for Tracking Objects Efficiently,” accepted by IEEE Trans. on Knowledge and Data Engineering (TKDE) , 2009
o2sink
si
sj
sk
o0
O0,si
o1O1,sj
O2,sk
o2sink
si
sj
sk
o0
g0,si
(a) Monitoring each object individually. (b) Monitoring multiple objects with group data aggregation.
o1
Motivation Many applications are more concerned with the group
relationships and their aggregated movement patterns Movements of creatures have some degree of regularity Many creatures are socially aggregated and migrate together
The application level semantics can be utilized to track objects in efficient ways Data aggregation In-network scheduling Data compression
Assumptions Objects each have a globally unique ID A hierarchical structure of WSN, where each sensor
within a cluster has a locally unique ID, ex. Location of an object is modeled by the ID of a nearby
sensor (or cluster) The trajectory of a moving object is thus modeled as a
series of observations and expressed by a location sequence
a, b, ..., p
Problem Formulation Similarity
Given the similarity measure function simp and a minimal threshold simmin, oi and oj are similar if their similarity score simp(oi, oj) is above simmin, i.e.,
Group A set of objects g is a group if , where so(oi) deno
tes the set of objects that are similar to oi
The moving object clustering (MOC)problem: Given a set of moving objects O together with their associated location sequence dataset S and a minimal threshold simmin, the MOC problem is formulated as partitioning O into non-overlapped groups, denoted by G = {g1, g2, ..., gi}, such that the number of groups is minimized, i.e., |G| is minimal.
Challenges of the MOC Problem How to discover the group relationships?
A centralized approach?
Compare similarity on entire movement trajectories?
sink
Compiling all data at a single node is expensive!
sink
Local characteristics might be blurred!
Other issuesHeterogeneous data from different tracking configurations Trade-off between resolution and privacy-preserving
A distributed mining approach is more desirable
The proposed DGMPMine Algorithm To resolve the MOC problem, we propose a
distributed group movement pattern mining algorithmProvide Transmission
Efficiency
Improve discriminability
Improve Clustering Quality
Provide Flexibility
Preserve Privacy
Definition of a Significant Movement Pattern A subsequence that occurs more frequently carries more
information about the movement of an object Movement transition distribution characterizes the move
ments of an object Definition of a movement pattern
A subsequence s of a sequence S is significant if its occurrence probability is higher than a minimal threshold, i.e., P(s) ≥ Pmi
n
A significant movement pattern is a significant subsequence s together with its transition distribution P(δ|s) with the constraint that P(δ|s) of s must differ from P(δ|suf(s)) with a ratio r or 1/r
VMM
Learning of Significant Movement Patterns
Leaning movement patterns in the trajectory data set by Probabilistic Suffix Tree
PST is an implementation of VMM with least storage requirement
The PST building algorithm learns from a location sequence data set and generate a compact tree with O(n) complexity in both computing and space
Storing significant movement patterns and their empirical probabilities and conditional empirical probabilities of movement patterns
Advantages: Useful and efficient for prediction Controllable tree depth (size)
PT" nokjfb"
PT"n" PT o|" n" PT k |" no" PT j|" nok" PT f|" nokj" PT b|" nokjf"
PT" n" PT o|" n" PT k |" o" PT j|" k" PT f|" j" PT b|" okjf"
0.05 1 1 1 1 0.33
0.0165.
Example of a location sequence and the generated PST
Prediction Complexity: OL
root
b
e
a:1
fjf
b:0.33 e:0.67
kjf
b:0.33 e:0.67
j
f:1
k
j:1
n
o:1
o
k:1
b:0.33 e:0.67
a:0.16b:0.05 e:0.11f:0.16j:0.16k:0.16n:0.05o:0.16
a b c d
e f g h
i j k l
m n o p
location sequence: okjfba,okjfea,nokjfea
Pmin 0. 01Lmax 4 0 min 0r 1. 25
Similarity Comparison
A novel pattern-based similarity measure is proposed to compare the similarity of objects. Measuring the similarity of two objects based on their move
ment patterns Providing better scalability and resilience to outliers Free from sequence alignment and variable length handling
Considering not only the patterns shared by two objects but also their relative importance to individual objects Provide better discriminability
The Novel Similarity Measure Simp
Simp computes the similarity of objects oi and oj based on their PSTs as follows,
: The union of significant patterns on the T1 and T2
S
Lmax : The maximal length of VMM (or maximal depth of a PST)
Σ : The alphabet of symbols (IDs of a cluster of sensors)
PTis : The predicted value of the occurrence probability of s based on Ti
Euclidean distance of a significant pattern regarding to Ti and Tj
normalization factor
Local Grouping Phase--The GMPMine Algorithm
Step 1. Learning movement patterns for each object Step 2. Computing the pair-wise similarity core to
constructing a similarity graph Step 3. Partitioning the similarity graph for highly-
connected sub graphs Step 4. Choosing representative movement patterns
for each group
0
1
2
3
4
0 1 2 3 4
o4:
o5:
o6:
o7:
o8:
(a) Moving trajectories. (b) Location sequence data set at CHa.
ddbadddbadddbaddddaddddaddddaddddabdddabdddabdddabdddaaddddaacdddaacdddbacdddddaadddadddcabdddcabdddcaaddddabdddcaadddcaadd
ddccdddccddccddccddccddccddccddccddccdddccddddcddddddccddcdddcddcdddddddcdddcddcdddccddddcddcdd
ddccdddcaddddacddcadddcadddcacddcabdddadddcccddcccddccd ddccddddccdddcdddccddddaddddacddcaddddacdcccddccdddccd
o9:
o10: dcddddcdddcdddccddccddccdddcddccdddcdddcdddcddcccdcccddcccccdccdccdccdcccdcccdcccddcccddco11:
root
a
a:0.08b:0.33d:0.58
b
a:0.43d:0.57
d
a:0.21b:0.13 d:0.66
aa
ba
da
ab
dba:1
ad
bd
d:1
dd
d:1
a:0.22b:0.08d:0.69
d:1
d:1
a:0.13b:0.50d:0.37
d:1
(a) PST of o4.
root
a
a:0.35b:0.18c:0.18d:0.29
b
a:0.25d:0.75
d
a:0.25b:0.06c:0.12 d:0.57
caa
daa
ab
dba:1
ad
bd
d:1
dd
d:1
a:0.20b:0.04 c:0.20d:0.56
a:0.14b:0.03c:0.14d:0.70
d:1
d:1
c:0.67d:0.33
(b) PST of o5.
add
bdd
d:1
ddd
d:1
a:0.57b:0.14 d:0.29
a:0.32b:0.12d:0.56
aa
ba
d:1
ca
c:1
a:0.6b:0.4
daa:0.6b:0.2 d:0.2
c
a:0.62d:0.38
dca:1
d:1 ac
cdd:1
add
bdd
d:1
ddd
d:1
a:0.36b:0.07 c:0.36d:0.21
cddd:1
root
c
c:0.48d:0.52
c:0.41 d:0.59
cc
dcc:0.9 d:0.1
d:1
d
c:0.38d:0.62
ddc:0.65d:0.35
d:1 cd
(c) PST of o8.
root
cc:0.17d:0.83
c:0.27 d:0.73
d
c:0.32d:0.67
ddc:0.5d:0.5
d:1 cd
(d) PST of o9.
Example of GMPMine
4
5 87
6 9 10
(a) Similarity graph.
58
7
6 9 10
(b) Highly connected subgraphs.
4
1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1simpo4 ,o5 1.618simpo4 ,o9 1.067simpo8 ,o9 1.832
Inconsistency may exist among local grouping results Trajectory of a group may span cross several clusters Group relationships may vary at different locations A CH may have incomplete statistics …
A consensus function is required to combine multiple local grouping results remove inconsistency improve clustering quality improve stability
Global Ensembling Phase
Ga Gb Gc Gd
o0 -1 0 2 -1
o1 -1 1 2 -1
o2 -1 1 2 -1
o3 -1 0 2 -1
o4 1 2 0 0
o5 2 2 2 0
o6 2 -1 0 0
o7 2 -1 -1 0
o8 0 -1 -1 1
o9 0 3 1 1
o10 0 3 -1 1
o11 -1 -1 -1 1
Global Ensembling Phase (contd.) Normalized Mutual Information (NMI) is useful in measuring
the shared information between two grouping results
Given K local grouping results, the objective is to find a solution that keeps most information of the local grouping results
Pa |ga
i |
|O |
Pa,b
gai gb
j
|O |HGi a 1
m i Pa log
Pa
join entropy
entropy
The CE Algorithm For a set of similarity thresholds D, we reformulate
our objective as
The CE algorithm includes three steps:1. Measuring the pair-wise similarity to construct a
similarity matrix by using Jaccard coefficient2. Generating the partitioning results for a set of thresholds
based on the similarity matrix3. Selecting the final ensembling result
G arg maxG , D
i 1
KNMIGi,G
Example of CE
0.67
0.5
0.25
0
0
0
0
o7
0
0
0
0
0
0
0
0
o8
0.5
0
0
0
0
0
0
0
0
o9
0.75
0.67
0
0
0
0
0
0
0
0
o10
0.33
0.25
0.5
0
0
0
0
0
0
0
0
o11
0.5
0.5
0
0
0
0
o6
0.5
0.25
0.25
0.25
0.25
o5
0
0
0
0
o4
o11o10o9o8o7o6
10.50.5o0
0.5o2o3o4o5
0.51o1
o3o2o1
0.67
0.5
0.25
0
0
0
0
o7
0
0
0
0
0
0
0
0
o8
0.5
0
0
0
0
0
0
0
0
o9
0.75
0.67
0
0
0
0
0
0
0
0
o10
0.33
0.25
0.5
0
0
0
0
0
0
0
0
o11
0.5
0.5
0
0
0
0
o6
0.5
0.25
0.25
0.25
0.25
o5
0
0
0
0
o4
o11o10o9o8o7o6
10.50.5o0
0.5o2o3o4o5
0.51o1
o3o2o1
1-1-1-1o11
1-130o10
1130o9
1-1-10o8
0-1-12o7
00-12o6
-120-1o0
-121-1o2-120-1o30021o40222o5
-121-1o1
GdGcGbGa
1-1-1-1o11
1-130o10
1130o9
1-1-10o8
0-1-12o7
00-12o6
-120-1o0
-121-1o2-120-1o30021o40222o5
-121-1o1
GdGcGbGa
(a) The labels of the local grouping results .
(b) Jaccard similarity matrix. (b) Highly connected subgraphs (δ=0.1).
745 8
69
10
1
0
3
2 11
(a) Similarity graph (δ=0.1).
7
5 8
6 9 10
1
03
2
11
δ Gδ ∑NMI(Gδ, Gi)
0.1 {{0,1,2,3},{5,6,7 },{8,9,10,11}} 2.322
0.2 {{0,1,2,3},{5,6,7 },{8,9,10,11}} 2.3220.3 {{0,1,2,3},{4,5,6,7 },{8,9,10,11}} 2.6360.4 {{0,1,2,3},{4,5,6,7 },{8,9,10}} 2.4010.5 {{0,1,2,3},{4,5,6,7 },{8,9,10}} 2.401
}50|10
{ ii
D
Part II: Semantic Data Compression
Hsiao-Ping Tsai, De-Nian Yang, and Ming-Syan Chen, “Exploring Application Level Semantics for Data Compression,” accepted by IEEE Trans. on Knowledge and Data Engineering (TKDE), 2009
senderReceiver
A batch of data
time
Introduction Data transmission of is one of the most energy
expensive operations in WSNs A batch-and-send network
NAND flash memory
reduce network energy consumption
increase network throughput
Data compression is a paradigm in WSNs
However, few works address application-dependent semantics in data, such as the correlations of a group of moving objects
How to manage the location data for a group of objects?Compress data by general algorithms like Huffman? Compress a group of trajectory sequences simultaneously?
Motivation
Redundancy in a group of location sequences comes from two aspects Vertical redundancy Horizontal redundancy
Vertical
redundancy
Group
relationships
Horizontal redundancy
Statistics of symbolsPredictability of symbols
What is Predictability of Symbols? With group movement patterns shared
Predict the next location (symbol)
senderReceiver
Movement Patterns
Movement Patterns
Replacing predictable items with a common symbol helps reduce entropy!
Problem Formulation Assume
A batch-based tracking network Group movement patterns are shared between a sender and a receiver
The Group Data Compression (GDC) ProblemGiven the group movement patterns of a group of objects, the GDC
problem is formulated as a merge problem and a hit item replace (HIR) problem to reduce the amount of bits required to represent their location sequences. The merge problem is to combine multiple location sequences to reduce the
overall sequence length the HIR problem targets to minimize the entropy of a sequence such that the
amount of data is reduced with or without loss of information
Our Approach The proposed two-phase and two-dimensional (2P2D)
algorithm Sequence merge phase
Utilizing the group relationships to merge the location data of a group of objects
Entropy reduction phase Utilizing the object movement patterns to reduce the entropy of the
merged data horizontally
senderReceiver
Movement Patterns
Movement Patterns
1100010111011000110011000111010101111001011011001001000000111111110001111101111000000101101101100000011101
Compressing…Uncompressing…
Compressibility is enhanced w/ or w/o information loss
Guarantee the reduction of entropy
We propose the Merge algorithm that avoids redundant reporting of their locations by trimming multiple ident
ical symbols into a single symbol chooses a qualified symbol to represent multiple symbols when a tolera
nce of loss of accuracy is specified The maximal distance between the reported location and the real location is
below a specified error bound eb While multiple qualified symbols exist, we choose a symbol to minimize th
e average location error
Sequence Merge Phase
o k b a p o jk f b n o c agfg aS0 k b0 1 4 5 6 7 98 10 11 13 14 18 191632 12 15 17
(a)
o k g k / f ab / o o k k k gp jg/S” p o0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17
(b)
i
i
o k b a p o gk f b p l b agfg aS1 k f
o k b a o k fg b b n o c agfk aS2 k f
g f
b / b a / n n o k g / b f fo c
ff
p l /25 26 29 30 31 32 3433 35 36 38 39 43 44 45 46 47412827 40 4237
b /c a
0 1 0 1 0 0 11 1 0 0 1 0 1000 1taglst0 1 0
0 1 4 5 6 7 98 10 11 13 14 18 191632 12 15 17
(c)
i
0 1 0 1 0 0 01 0 0 0 0 0 1000 1taglst1 0 0
0 1 0 1 0 1 00 0 0 0 1 0 1000 1taglst2 1 0
60 symbols -> 20 symbols
Entropy Reduction Phase
Group movement patterns carry the information about whether an item of a sequence is predictable
Since some items are predictable, extra redundancy exists
How to remove the redundancy and even increase the compressibility?
Entropy Reduction Phase According to Shannon’s theorem, the entropy is the upper
bound of the compression ratio Definition of entropy
Increasing the skewness of the data is able to reduce the entropy
eS ep0 ,p1 , . . . ,p | | 1 0 i | | pi log2pi.
eS e 116
, 116
, ..., 116
4.
eS e 116
, 332
, 132
, ..., 116
3.97.
0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 01 1
o k g g b a po k k j e o k k k c ca a0
k0
k1
kS
taglst
simple
0
b0
n
o . g g b . po k k . e . . k k c c. .kk k b n
0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17
e=2.883
e=2.752optimal o k g g b . po k k . e . k k k c c. .kk k b n e=2.718
(a)
0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 01 1
o k g g b a po k k j e o k k k c ca a0
k0
k0
kS
taglst
simple
0
b0
n
o . g g b . po k k j e o . k k c c. .kk k b n
0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17
e=2.883
e=2.963
(b)
i
i
The Hit Item Replacement (HIR) Problem A simple and intuitive method is to replace all predictable
symbols to increase the skewness However, the simple method can not guarantee to reduce the
entropy0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 00 0
n o f b b a on o k i j n k k k j gb c0
k1
k0
jS
taglst
simple
1
f0
b
n . f b b . .n o . i j n k k k . gb ck. . . b
0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17
e=3.053
e=2.854
i
Definition of the Hit Item Replace (HIR) problem:
Given a sequence and the information about whether each item is predictable, the HIR problem is to decide whether to replace each of the predicted symbols in the given sequence with a hit symbol to minimize the entropy of the sequence.
Three Rules1. Accumulation rule:
2. Concentration rule:
3. Multi-symbol rule:
For s, if n nhit , replace all items of .
For s, if n n. or nhit n n. ,replace all predictable items of .
For a set of predictable symbols s, if gains 0,replace all predictable items of s.
n : The number of items of in S.nhit : The number of predictable items of in S.
s : A sub-set of that contains all predictable symbols in S.
Example of the Replace Algorithm
0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 00 0
n o f b b a on o k i j n k k k j gb c0
k1
k0
jtaglst 1
f0
b0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17
S
1 1
1 2
a f
nhit( )
n( )
2
3
j
0
.
2
6
k
2
3
oi
0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 00 0
n o f b b . on o k i j n k k k j gb c0
k1
k0
jtaglst 1
f0
b0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17
S’
1
.i
0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 00 0
n . f b b . .n o k i j n k k k . gb c0
k1
k0
.taglst 1
f0
b0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17
S’
i
0 1
0 2
a f
nhit( )
n( )
2
3
j
2
6
k
2
3
o
5
.
0 1
0 2
a f
nhit( )
n( )
0
1
j
2
6
k
0
1
o
0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 00 0
n . f b b . .n o k i j n k k k . gb c0
k1
k0
.taglst 1
.0
b0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17
S’
i
6
.
0 0
0 1
a f
nhit( )
n( )
0
1
j
2
6
k
0
1
o
0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 00 0
n . f b b . .n o . i j n k k k . gb c0
k1
.0
.taglst 1
.0
b0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17
S’
i
8
.
0 0
0 1
a f
nhit( )
n( )
0
1
j
0
4
k
0
1
o
e=3.053
e=3.053
e=2.969
e=2.893
e=2.854
(a)
(b)
(c)
(d)
(e)
={ f, j, k, o }s
={ f, k }s
={ k }s
={ }s
={ a, f, j, k, o }s
Segmentation, Alignment, and Packaging
E1
E2
E3
A
B
C
S0
S1
S2
t0 t1 t4time
D
t2 t3
S-segments G-segments S-segments
Replace
S'' B D
S' S' S' S' S'
o0 t0 lA bitstreamAHuffman Code Table
(a)
E
A C B D
Huffman
A C
E
Merge
(b)
(c)
(d)o2 t1 lC bitstreamC
g0 t2 lE bitstreamE o1 t3 lB bitstreamB o2 t3 lD bitstreamD
~The End~
Any Question ?