Upload
tommy96
View
479
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
04/10/23 Data Mining: Concepts and Techniques
1
Data Mining: Concepts and
Techniques — Chapter 11 —
—Additional Theme: RFID Data Warehousing and Mining and High-Performance Computing
—
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber. All rights reserved.
Acknowledgements: Hector Gonzalez and Shengnan Cong
04/10/23 Data Mining: Concepts and Techniques
2
04/10/23 Data Mining: Concepts and Techniques
3
Outline
Introduction to RFID Technology
Motivation: Why RFID-Warehousing?
RFID-Warehouse Architecture
Performance Study
Linking RFID Data Analysis with HPC
Conclusions
04/10/23 Data Mining: Concepts and Techniques
4
What is RFID?
Radio Frequency Identification (RFID) Technology that allows a sensor (reader)
to read, from a distance, and without line of sight, a unique electronic product code (EPC) associated with a tag
Tag Reader
04/10/23 Data Mining: Concepts and Techniques
5
RFID System
Source: www.belgravium.com
04/10/23 Data Mining: Concepts and Techniques
6
Applications
Supply Chain Management: real-time inventory tracking
Retail: Active shelves monitor product availability
Access control: toll collection, credit cards, building access
Airline luggage management: (British airways) Implemented to reduce lost/misplaced luggage (20 million bags a year)
Medical: Implant patients with a tag that contains their medical history
Pet identification: Implant RFID tag with pet owner information (www.pet-id.net)
04/10/23 Data Mining: Concepts and Techniques
7
Outline
Introduction to RFID Technology
Motivation: Why RFID-Warehousing?
RFID-Warehouse Architecture
Performance Study
Linking RFID Data Analysis with HPC
Conclusions
04/10/23 Data Mining: Concepts and Techniques
8
RFID Warehouse Architecture
04/10/23 Data Mining: Concepts and Techniques
9
Challenges of RFID Data Sets
Data generated by RFID systems is enormous due to redundancy and low level of abstraction
Walmart is expected to generate 7 terabytes of RFID data per day
Solution Requirements
Highly compact summary of the data OLAP operations on multi-dimensional view of the
data Summary should preserve the path structure of RFID
data It should be possible to efficiently drill down to
individual tags when an interesting pattern is discovered
04/10/23 Data Mining: Concepts and Techniques
10
Why RFID-Warehousing? (1)
Lossless compression Significantly reduce the size of the RFID data set
by redundancy removal and grouping objects that move and stay together
Data cleaning: reasoning based on more complete info Multi-reading, miss-reading, error-reading, bulky
movement, … Multi-dimensional summary: product, location, time, …
Store manager: Check item movements from the backroom to different shelves in his store
Region manager: Collapse intra-store movements and look at distribution centers, warehouses, and stores
04/10/23 Data Mining: Concepts and Techniques
11
Why RFID-Warehousing? (2) Query Processing
Support for OLAP: roll-up, drill-down, slice, and dice
Path query: New to RFID-Warehouses, about the structure of paths
What products that go through quality control have shorter paths?
What locations are common to the paths of a set of defective auto-parts?
Identify containers at a port that have deviated from their historic paths
Data mining Find trends, outliers, frequent, sequential,
flow patterns, …
04/10/23 Data Mining: Concepts and Techniques
12
Example: A Supply Chain Store
A retailer with 3,000 stores, selling 10,000 items a day per store
Each item moves 10 times on average before being sold
Movement recorded as (EPC, location, second) Data volume: 300 million tuples per day (after
redundancy removal) OLAP Query
Avg time for outwear items to move from warehouse to checkout counter in March 2006?
Costly to answer if scanning 1 billion tuples for March
04/10/23 Data Mining: Concepts and Techniques
13
Outline
Introduction to RFID Technology
Motivation: Why RFID-Warehousing?
RFID-Warehouse Architecture
Performance Study
Linking RFID Data Analysis with HPC
Conclusions
04/10/23 Data Mining: Concepts and Techniques
14
Cleaning of RFID Data Records
Raw Data (EPC, location, time) Duplicate records due to multiple
readings of a product at the same location (r1,l1,t1) (r1,l1,t2) ... (r1,l1,t10)
Cleansed Data: Minimal information to store, raw data will be then removed (EPC, Location, time_in, time_out) (r1,l1,t1,t10)
Warehousing can help fill-up missing records and correct wrongly-registered information
04/10/23 Data Mining: Concepts and Techniques
15
Key Compression Ideas (I)
Bulky object movements Objects often move and stay together through the supply
chain If 1000 packs of soda stay together at the distribution center,
register a single record (GID, distribution center, time_in, time_out)
GID is a generalized identifier that represents the 1000 packs that stayed together at the distribution center
Factory
Dist. Center 1
Dist. Center2
…
10 pallets(1000 cases)
store 1
store 2
…
20 cases(1000 packs)
shelf 1
shelf 2
…
10 packs(12 sodas)
04/10/23 Data Mining: Concepts and Techniques
16
Key Compression Ideas (II)
Data generalization Analysis usually takes place at a much higher level of
abstraction than the one present in raw RFID data Aggregate object movements into fewer records
If interested in time at the day level, merge records at the minute level into records at the hour level
Merge and/or collapse of path segments Uninteresting path segments can be ignored or
merged Multiple item movements within the same store
may be uninteresting to a regional manager and thus can be merged
04/10/23 Data Mining: Concepts and Techniques
17
Path-Independent Generalization
Clothing
Outerwear Shoes
Shirt Jacket …SKU level
Type level
Category level
Shirt 1 Shirt n…EPC level Cleansed RFIDDatabase Level
InterestingLevel
04/10/23 Data Mining: Concepts and Techniques
18
Path Generalization
Transportation
dist. center truck backroom shelf checkout
backroom shelf checkout
dist. center truck Store
Store View:
Transportation View:
04/10/23 Data Mining: Concepts and Techniques
19
Why Not Using Traditional Data Cube?
Fact Table: (EPC, location, time_in, time_out) Aggregate: A measure at a single location
e.g., what is the average time that milk stays in the refrigerator in Illinois stores?
What is missing? Measures computed on items that travel
through a series of locations e.g., what is the average time that milk stays at
the refrigerator in Champaign when coming from farm A, and Warehouse B?
Traditional cubes miss the path structure of the data
04/10/23 Data Mining: Concepts and Techniques
20
RFID-Cube Architecture
04/10/23 Data Mining: Concepts and Techniques
21
RFID-Cuboid Architecture (II)
Stay Table: (GIDs, location, time_in, time_out: measures) Records information on items that stay together at a given
location If using record transitions: difficult to answer queries, lots
of intersections needed Map Table: (GID, <GID1,..,GIDn>)
Links together stages that belong to the same path. Provides additional: compression and query processing efficiency
High level GID points to lower level GIDs If saving complete EPC Lists: high costs of IO to retrieve
long lists, costly query processing Information Table: (EPC list, attribute 1,...,attribute n)
Records path-independent attributes of the items, e.g., color, manufacturer, price
04/10/23 Data Mining: Concepts and Techniques
22
RFID-Cuboid Example
epc loc t_in t_out
r1 l1 t1 t10
r1 l2 t20 t30
r2 l1 t1 t10
r2 l3 t20 t30
r3 l1 t1 10
r3 l4 t15 t20
epcs loc t_in t_out
r1,r2,r3 l1 t1 t10
gids
gid gids
g1 g1.1,g1.2
g1.1 r1,r2
g1.2 r3
Cleansed RFID Database Stay Table
Map Table
gid gids
r1,r2 l2 t20 t30
g1
g1.1
r3 l4 t15 t20g1.2
04/10/23 Data Mining: Concepts and Techniques
23
Benefits of the Stay Table (I)
l1
l2
ln
l
ln+1
ln+2
ln+m
… …
Transition Grouping Retrieve all transitions with destination = l Retrieve all transitions with origin = l Intersect results and compute average time IO Cost: n + m retrievals
Prefix Tree Retrieve n records
Stay Grouping Retrieve stay record with location = l IO Cost: 1
Query: What is the average time that items stayat location l ?
04/10/23 Data Mining: Concepts and Techniques
24
Benefits of the Stay Table (II)
(r1,l1,t1,t2)(r1,l2,t3,t4)
…(r2,l1,t1,t2)(r2,l2,t3,t4)
…(rk,l1,t1,t2)(rk,l2,t3,t4)
Query: How many boxes of milk traveled through the locations l1, l7, l13?
Strategy: Retrieve itemsets for
locations l1, l7, l13 Intersect itemsets
IO Cost: One IO per item in
locations l1 or l7 or l13
Observation: Very costly, we retrieve
records at the individual
item level
Strategy: Retrieve the gids for l1, l7, l13 Intersect the gidsIO Cost: One IO per GID in locations l1, l7, and l13Observation: Retrieve records at the group level and thus greatly reduce IO costs
(g1,l1,t1,t2)(g2,l2,t3,t4)
…
With Cleansed Database With Stay Table
04/10/23 Data Mining: Concepts and Techniques
25
Benefits of the Map Table
l1
l2 l3 l4
l5 l6 l7 l8 l9 l10
#EPCs #GIDs
n
n
n
1
3
6
3n 10+n{r1,..,ri} {ri+1,..,rj} {rj+1,..,rk} {rk+1,..,rl} {rl+1,..,rm} {rm+1,..,rn}
04/10/23 Data Mining: Concepts and Techniques
26
Path-Dependent Naming of GIDs
l1 l2
l3 l4
l5 l6
0.0 0.1
0.0.0 0.1.00.1.1
0.0.0.0 0.1.0.1
Assign to each GID a
unique identifier that
encodes the path traversed
by the items that it points
to
Path-dependent name:
Makes it easy to detect if
locations form a path
04/10/23 Data Mining: Concepts and Techniques
27
RFID-Cuboid Construction Algorithm
1. Build a prefix tree for the paths in the cleansed database
2. For each node, record a separate measure for each group of items that share the same leaf and information record
3. Assign GIDs to each node:
GID = parent GID + unique id4. Each node generates a stay record for each
distinct measure5. If multiple nodes share the same location, time,
and measure, generate a single record with multiple GIDs
04/10/23 Data Mining: Concepts and Techniques
28
RFID-Cube Construction
l1 l2
l3 l4
l5 l6
0.0t1,t10: 3
0.1t1,t8: 3
0.0.0t20,t30: 3
0.1.0t20,t30: 3
0.1.1t10,t20: 2
0.0.0.0t40,t60: 3
0.1.0.1t35,t50: 1
l3
l5
0.1.0.0t40,t60: 2
{r1,r2,r3} {r5,r6} {r7}
{r8,r9}
0.0 l1 t1 t10 3
0.0.0 l3 t20 t30 3
GIDs loc t_in t_out count
0.0.0.0 l5 t40 t60 30.0.0.00.1.0.0
l5 t40 t60 5
Stay Table
0.1 l2 t1 t8 3
Path Tree
0.0.00.1.0
l3 t20 t30 6
0.1.0.1 l6 t35 t50 1
0.1.1 l4 t10 t20 2
04/10/23 Data Mining: Concepts and Techniques
29
RFID-Cube Properties
The RFID-cuboid can be constructed on a single scan of the cleansed RFID database
The RFID-cuboid provides lossless compression at its level of abstraction
The size of the RFID-cuboid is smaller than the cleansed data In our experiments we get 80% lossless
compression at the level of abstraction of the raw data
04/10/23 Data Mining: Concepts and Techniques
30
Query Processing
Traditional OLAP operations Roll up, drill down, slice, and dice Can be implemented efficiently with traditional
optimization techniques, e.g., what is the average time spent by milk at the shelf
Path selection (New operation) Compute an aggregate measure on the tags that
travel through a set of locations and that match a selection criteria on path independent dimensions
stay.location = 'shelf', info.product = 'milk' (stay gid info)
q à < c info,(c1 stage1, ..., ck
stagek) >
04/10/23 Data Mining: Concepts and Techniques
31
04/10/23 Data Mining: Concepts and Techniques
32
Query Processing (II)
Query: What is the average time spent from l3 to l5? GIDs for l3 <0.0.0>, <0.1.0> GIDs for l5 <0.0.0.0>, <0.1.0.1> Prefix pairs: p1: (<0.0.0>,<0.0.0.0>)
p2: (<0.1.0>,<0.1.0.1>) Retrieve stay records for each pair (including
intermediate steps) and compute measure Savings: No EPC list intersection, remember that
each EPC list may contain millions of different tags, and retrieving them is a significant IO cost
04/10/23 Data Mining: Concepts and Techniques
33
From RFID-Cuboids to RFID-Warehouse
Materialize the lowest RFID-cuboid at the minimum level of abstraction interested to a user
Materialize frequently requested RFID-cuboids
Materialization is done from the smallest materialized RFID-Cuboid that is at a lower level of abstraction
04/10/23 Data Mining: Concepts and Techniques
34
Outline
Introduction to RFID Technology
Motivation: Why RFID-Warehousing?
RFID-Warehouse Architecture
Performance Study
Linking RFID Data Analysis with HPC
Conclusions
04/10/23 Data Mining: Concepts and Techniques
35
RFID-Cube Compression (I)
0
50
100
150
200
250
300
350
0.1 0.5 1 5 10
Input Stay Records (millions)
Siz
e (
MB
yte
s)
clean
nomap
map
0
5
10
15
20
25
30
35
a b c d e
Path Bulkiness
Siz
e (
MB
yte
s)
clean
nomap
map
Compression vs. Cleansed data size
P=1000, B=(500,150,40,8,1), k = 5
Lossless compression, cuboid is at the same level of abstraction as cleansed RFID database
Compression vs. Data Bulkiness
P=1000, N = 1,000,000, k= 5
Map gives significant benefits for bulky data
For data where items move individually we are better off using tag lists
04/10/23 Data Mining: Concepts and Techniques
36
RFID-Cube Compression (II)
0
2
4
6
8
10
12
1 2 3 4
Abstraction Level
Siz
e (
MB
yte
s) nomap
map
Compression vs. Abstraction Level
P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000
The map provides significant savings over using tag lists
At very high levels of abstraction the stay table is very small, most of the space is used in recording RFID tags
04/10/23 Data Mining: Concepts and Techniques
37
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
level 0 level 1 level 2 level 3
Aggregation Level
Tim
e (s
eco
nd
s)100K tags
200K tags
400K tags
800K tags
1600K tags
RFID-Cube Construction Time
Construction Time
P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000
Savings by constructing from lower level cuboid 50% to 80%
04/10/23 Data Mining: Concepts and Techniques
38
Query Processing
1
10
100
1,000
10,000
1 2 3 4 5
Input Stay Records (millions)
I/O O
pe
rati
on
s (
log
sc
ale
)
clean
nomap
map
1
10
100
1,000
10,000
100,000
1 2 3 4 5
Path Bulkiness
I/O O
pe
rati
on
s (
log
sc
ale
)
clean
nomap
map
Time vs. DB SizeP=1000, B=(500,150,40,8,1), k = 5
Speed up due to stay table 1 order of magnitude
Speed up due to stay table and map table 2 orders of magnitude
Time vs. BulkinessP=1000, k = 5
Speed up is most significant for bulky paths
For non-bulky paths performance is not worse than using the clean table
04/10/23 Data Mining: Concepts and Techniques
39
Discussion
Our RFID cube model works well for bulky object movements
But there are many applications where this assumption is not true and other models are needed
We have only focused on warehousing RFID data, a variety of other problems remain open: Path classification and clustering Workflow analysis Trend analysis Sophisticated RFID data cleaning
04/10/23 Data Mining: Concepts and Techniques
40
Outline
Introduction to RFID Technology
Motivation: Why RFID-Warehousing?
RFID-Warehouse Architecture
Performance Study
Linking RFID Data Analysis with HPC
Conclusions
04/10/23 Data Mining: Concepts and Techniques
41
Linking RFID Data Analysis with HPC
High performance computing will play an important role in RFID data warehousing and data analysis
Most of data cleaning process can be done in parallel and distributed manner
Stay and map tables construction can be constructed in parallel
Parallel computation and consolidation of multi-layer and multi-path data cubes
Query and mining can be processed in parallel
04/10/23 Data Mining: Concepts and Techniques
42
Parallel RFID Data Mining: Promising
Parallel computing has been successfully applied to rather sophisticated data mining algorithms
Parallelizing frequent pattern mining (based on FPgrowth) Shengnan Cong, Jiawei Han, Jay Hoeflinger, and David
Padua, “A Sampling-based Framework for Parallel Data Mining,” PPOPP’05 (ACM SIGPLAN Symp. on Principles & Practice of Parallel Programming)
Parallelizing sequential-pattern mining algorithm (based on PrefixSpan)
Shengnan Cong, Jiawei Han, and David Padua, “Parallel Mining of Closed Sequential Patterns”, KDD'05
Parallel FRID data analysis is highly promising
04/10/23 Data Mining: Concepts and Techniques
43
Mining Frequent Patterns
Breadth-first search vs. depth-first search
Depth-first mining algorithm is proved to be more efficient Depth-first mining algorithm is more convenient to be parallelized
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
04/10/23 Data Mining: Concepts and Techniques
44
Parallel Frequent-Pattern Mining
Target platform ─ distributed memory system Framework for parallelization
Step 1: Each processor scans local portion of the dataset and
accumulate the numbers of occurrence for each items Reduction to obtain the global numbers of occurrence
Step 2: Partition the frequent items and assign a subset to each
processor Each processor makes projections for the assigned items
Step 3: Each processor mines the local projections
independently
04/10/23 Data Mining: Concepts and Techniques
45
Parallel Frequent-Pattern Mining (2)
Load balancing problem Some projection mining time is too large
relative to the overall mining time
Solution: The large projections must be partitioned Challenge: How to identify the large projections?
7.66%
mushroom
12.4%
connect
14.7%
pumsb
47.6%
pumsb_star
42.1%
T30I0.2D1K
4.15%
T40I10D100K T50I5D500KDataset
3.07%Maximal/Overall
21.4%
C10N0.1T8S8I8
13.8%
C50N10T8S20I2.5
14.2%
C100N5T2.5S10I1.25
10.1%
C200N10T2.5S10I1.25
11.6%
C100N20T2.5S10I1.25Dataset
Maximal/Overall
15.3%
C100S50N10
5.02%
C100S100N5
4.53%
C200S25N9
25.9%
gazelleDataset
Maximal/Overall
(a) datasets for frequent-itemset mining
(b) datasets for sequential-pattern mining
(c) datasets for closed-sequential-pattern mining
04/10/23 Data Mining: Concepts and Techniques
46
How to Identify the Large Projections?
To identify the large projections, we need an estimation of the relative mining time of the projections
Static estimation Study the correlation with the dataset parameters
Number of items, number of records, width of records, …
Study the correlation with the characteristics of the projection
Depth, bushiness, tree size, number of leaves, fan-out/in, …
Result ─ No rule found with the above parameters for the projection mining time
04/10/23 Data Mining: Concepts and Techniques
47
Dynamic Estimation
Runtime sampling Use the relative mining time of a sample to estimate the
relative mining time of the whole dataset. Accuracy vs. overhead
Random sampling: random select a subset of records. Not accurate with small sample size. e.g. Dataset —pumsb 1% random sampling Becomes accurate when sample size > 30%, but sampling overhead is over 50% then
04/10/23 Data Mining: Concepts and Techniques
48
Selective Sampling
Selective sampling: for each record, some items are removed
In frequent-itemset mining: Discard the infrequent items Discard a fraction t of the most frequent items
1 a, c, d, f, m
2 b, c, f, m
3 b, f
4 b, c
5 a, f, m
f :4b :3c :3m :3a: 2d: 1
Support threshold =2, t=20%dataset
1 a, c, m
2 b, c, m
3 b
4 b, c
5 a, m
selective sample
04/10/23 Data Mining: Concepts and Techniques
49
Accuracy of Selective Sampling
04/10/23 Data Mining: Concepts and Techniques
50
Overhead of Selective Sampling
(a) datasets for frequent-itemset mining
(b) datasets for sequential-pattern mining
(c) datasets for closed-sequential-pattern mining
04/10/23 Data Mining: Concepts and Techniques
51
Experimental Setups
Two Linux clusters using up to 64 processors
Cluster A – 1GHz Pentium III processor, 1GB memory
Cluster B – 1.3GHz Intel Itanium2 processor, 2GB memory
Implement with C++ using MPI Dataset generator from IBM Datasets
04/10/23 Data Mining: Concepts and Techniques
52
Experimental Setups
04/10/23 Data Mining: Concepts and Techniques
53
Speedups with One-Level Task Partitioning
Parallel frequent-itemset miningmushroom
1
10
100
1 2 4 8 16 32 64
Processor#
optimal
Par-FP
connect
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
pumsb
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
pumsb_star
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
T40I10D100K
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
T50I5D500K
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
T30I0.2D1K
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
04/10/23 Data Mining: Concepts and Techniques
54
Effectiveness of Selective Sampling
Multi-level task partitioningmushroom
1
10
100
1 2 4 8 16 32 64
optimal
one-level
multi-level
connect
1
10
100
1 2 4 8 16 32 64
optimal
one-level
multi-level
pumsb
1
10
100
1 2 4 8 16 32 64
optimal
one-level
multi-level
pumsb_star
1
10
100
1 2 4 8 16 32 64
optimal
one-level
multi-level
T30I0.2D1K
1
10
100
1 2 4 8 16 32 64
optimal
one-level
multi-level
T40I10D100K
1
10
100
1 2 4 8 16 32 64
optimal
one-level
multi-level
T50I5D500K
1
10
100
1 2 4 8 16 32 64
optimal
one-level
multi-level
04/10/23 Data Mining: Concepts and Techniques
55
Speedups with One-Level Task Partitioning (Sequential Patterns)
Parallel sequential-pattern miningC10N0.1T8S8I8
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-Span
C200N10T2.5S10I1.25
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-Span
C100N5T2.5S10I1.25
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-Span
C100N20T2.5S10I1.25
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-Span
C50N10T8S20I2.5
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-Span
04/10/23 Data Mining: Concepts and Techniques
56
Speedups with One-Level Task Partitioning (Closed Sequential Pattern)
Parallel closed-sequential-pattern miningC100S50N10
1
10
100
1 2 4 8 16 32 64
Processor#
optimal
Par-CSP
C100S100N5
1
10
100
1 2 4 8 16 32 64
Processor#
optimal
Par-CSP
C200S25N9
1
10
100
1 2 4 8 16 32 64
Processor#
optimal
Par-CSP
Gazelle
1
10
100
1 2 4 8 16 32 64
Processor#
optimal
Par-CSP
04/10/23 Data Mining: Concepts and Techniques
57
Effectiveness of Selective Sampling
One-level task partitioning with 64 processors The speedups are improved by more than
50% on average.
0
5
10
15
20
25
30
35
40
without sampling
with sampling
04/10/23 Data Mining: Concepts and Techniques
58
Conclusions
A new RFID warehouse model allows efficient and flexible analysis of RFID data
in multidimensional space preserves the structure of the data compresses data by exploiting bulky movements,
concept hierarchies, and path collapsing High-performance computing will benefit RFID data
warehousing and data mining tremendously Efficient and highly parallel algorithms can be
developed for RFID data analysis
04/10/23 Data Mining: Concepts and Techniques
59