Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

1

Data Mining: Concepts and

Techniques — Chapter 11 —

—Additional Theme: RFID Data Warehousing and Mining and High-Performance Computing

—

Jiawei Han and Micheline Kamber

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

©2006 Jiawei Han and Micheline Kamber. All rights reserved.

Acknowledgements: Hector Gonzalez and Shengnan Cong


2


3

Outline

Introduction to RFID Technology

Motivation: Why RFID-Warehousing?

RFID-Warehouse Architecture

Performance Study

Linking RFID Data Analysis with HPC

Conclusions


4

What is RFID?

Radio Frequency Identification (RFID) Technology that allows a sensor (reader)

to read, from a distance, and without line of sight, a unique electronic product code (EPC) associated with a tag

Tag Reader


5

RFID System

Source: www.belgravium.com


6

Applications

Supply Chain Management: real-time inventory tracking

Retail: Active shelves monitor product availability

Access control: toll collection, credit cards, building access

Airline luggage management: (British airways) Implemented to reduce lost/misplaced luggage (20 million bags a year)

Medical: Implant patients with a tag that contains their medical history

Pet identification: Implant RFID tag with pet owner information (www.pet-id.net)


7

Outline




Performance Study


Conclusions


8

RFID Warehouse Architecture


9

Challenges of RFID Data Sets

Data generated by RFID systems is enormous due to redundancy and low level of abstraction

Walmart is expected to generate 7 terabytes of RFID data per day

Solution Requirements

Highly compact summary of the data OLAP operations on multi-dimensional view of the

data Summary should preserve the path structure of RFID

data It should be possible to efficiently drill down to

individual tags when an interesting pattern is discovered


10

Why RFID-Warehousing? (1)

Lossless compression Significantly reduce the size of the RFID data set

by redundancy removal and grouping objects that move and stay together

Data cleaning: reasoning based on more complete info Multi-reading, miss-reading, error-reading, bulky

movement, … Multi-dimensional summary: product, location, time, …

Store manager: Check item movements from the backroom to different shelves in his store

Region manager: Collapse intra-store movements and look at distribution centers, warehouses, and stores


11

Why RFID-Warehousing? (2) Query Processing

Support for OLAP: roll-up, drill-down, slice, and dice

Path query: New to RFID-Warehouses, about the structure of paths

What products that go through quality control have shorter paths?

What locations are common to the paths of a set of defective auto-parts?

Identify containers at a port that have deviated from their historic paths

Data mining Find trends, outliers, frequent, sequential,

flow patterns, …


12

Example: A Supply Chain Store

A retailer with 3,000 stores, selling 10,000 items a day per store

Each item moves 10 times on average before being sold

Movement recorded as (EPC, location, second) Data volume: 300 million tuples per day (after

redundancy removal) OLAP Query

Avg time for outwear items to move from warehouse to checkout counter in March 2006?

Costly to answer if scanning 1 billion tuples for March


13

Outline




Performance Study


Conclusions


14

Cleaning of RFID Data Records

Raw Data (EPC, location, time) Duplicate records due to multiple

readings of a product at the same location (r1,l1,t1) (r1,l1,t2) ... (r1,l1,t10)

Cleansed Data: Minimal information to store, raw data will be then removed (EPC, Location, time_in, time_out) (r1,l1,t1,t10)

Warehousing can help fill-up missing records and correct wrongly-registered information


15

Key Compression Ideas (I)

Bulky object movements Objects often move and stay together through the supply

chain If 1000 packs of soda stay together at the distribution center,

register a single record (GID, distribution center, time_in, time_out)

GID is a generalized identifier that represents the 1000 packs that stayed together at the distribution center

Factory

Dist. Center 1

Dist. Center2

…

10 pallets(1000 cases)

store 1

store 2

…

20 cases(1000 packs)

shelf 1

shelf 2

…

10 packs(12 sodas)


16

Key Compression Ideas (II)

Data generalization Analysis usually takes place at a much higher level of

abstraction than the one present in raw RFID data Aggregate object movements into fewer records

If interested in time at the day level, merge records at the minute level into records at the hour level

Merge and/or collapse of path segments Uninteresting path segments can be ignored or

merged Multiple item movements within the same store

may be uninteresting to a regional manager and thus can be merged


17

Path-Independent Generalization

Clothing

Outerwear Shoes

Shirt Jacket …SKU level

Type level

Category level

Shirt 1 Shirt n…EPC level Cleansed RFIDDatabase Level

InterestingLevel


18

Path Generalization

Transportation

dist. center truck backroom shelf checkout

backroom shelf checkout

dist. center truck Store

Store View:

Transportation View:


19

Why Not Using Traditional Data Cube?

Fact Table: (EPC, location, time_in, time_out) Aggregate: A measure at a single location

e.g., what is the average time that milk stays in the refrigerator in Illinois stores?

What is missing? Measures computed on items that travel

through a series of locations e.g., what is the average time that milk stays at

the refrigerator in Champaign when coming from farm A, and Warehouse B?

Traditional cubes miss the path structure of the data


20

RFID-Cube Architecture


21

RFID-Cuboid Architecture (II)

Stay Table: (GIDs, location, time_in, time_out: measures) Records information on items that stay together at a given

location If using record transitions: difficult to answer queries, lots

of intersections needed Map Table: (GID, <GID1,..,GIDn>)

Links together stages that belong to the same path. Provides additional: compression and query processing efficiency

High level GID points to lower level GIDs If saving complete EPC Lists: high costs of IO to retrieve

long lists, costly query processing Information Table: (EPC list, attribute 1,...,attribute n)

Records path-independent attributes of the items, e.g., color, manufacturer, price


22

RFID-Cuboid Example

epc loc t_in t_out

r1 l1 t1 t10

r1 l2 t20 t30

r2 l1 t1 t10

r2 l3 t20 t30

r3 l1 t1 10

r3 l4 t15 t20

epcs loc t_in t_out

r1,r2,r3 l1 t1 t10

gids

gid gids

g1 g1.1,g1.2

g1.1 r1,r2

g1.2 r3

Cleansed RFID Database Stay Table

Map Table

gid gids

r1,r2 l2 t20 t30

g1

g1.1

r3 l4 t15 t20g1.2


23

Benefits of the Stay Table (I)

l1

l2

ln

l

ln+1

ln+2

ln+m

… …

Transition Grouping Retrieve all transitions with destination = l Retrieve all transitions with origin = l Intersect results and compute average time IO Cost: n + m retrievals

Prefix Tree Retrieve n records

Stay Grouping Retrieve stay record with location = l IO Cost: 1

Query: What is the average time that items stayat location l ?


24

Benefits of the Stay Table (II)

(r1,l1,t1,t2)(r1,l2,t3,t4)

…(r2,l1,t1,t2)(r2,l2,t3,t4)

…(rk,l1,t1,t2)(rk,l2,t3,t4)

Query: How many boxes of milk traveled through the locations l1, l7, l13?

Strategy: Retrieve itemsets for

locations l1, l7, l13 Intersect itemsets

IO Cost: One IO per item in

locations l1 or l7 or l13

Observation: Very costly, we retrieve

records at the individual

item level

Strategy: Retrieve the gids for l1, l7, l13 Intersect the gidsIO Cost: One IO per GID in locations l1, l7, and l13Observation: Retrieve records at the group level and thus greatly reduce IO costs

(g1,l1,t1,t2)(g2,l2,t3,t4)

…

With Cleansed Database With Stay Table


25

Benefits of the Map Table

l1

l2 l3 l4

l5 l6 l7 l8 l9 l10

#EPCs #GIDs

n

n

n

1

3

6

3n 10+n{r1,..,ri} {ri+1,..,rj} {rj+1,..,rk} {rk+1,..,rl} {rl+1,..,rm} {rm+1,..,rn}


26

Path-Dependent Naming of GIDs

l1 l2

l3 l4

l5 l6

0.0 0.1

0.0.0 0.1.00.1.1

0.0.0.0 0.1.0.1

Assign to each GID a

unique identifier that

encodes the path traversed

by the items that it points

to

Path-dependent name:

Makes it easy to detect if

locations form a path


27

RFID-Cuboid Construction Algorithm

1. Build a prefix tree for the paths in the cleansed database

2. For each node, record a separate measure for each group of items that share the same leaf and information record

3. Assign GIDs to each node:

GID = parent GID + unique id4. Each node generates a stay record for each

distinct measure5. If multiple nodes share the same location, time,

and measure, generate a single record with multiple GIDs


28

RFID-Cube Construction

l1 l2

l3 l4

l5 l6

0.0t1,t10: 3

0.1t1,t8: 3

0.0.0t20,t30: 3

0.1.0t20,t30: 3

0.1.1t10,t20: 2

0.0.0.0t40,t60: 3

0.1.0.1t35,t50: 1

l3

l5

0.1.0.0t40,t60: 2

{r1,r2,r3} {r5,r6} {r7}

{r8,r9}

0.0 l1 t1 t10 3

0.0.0 l3 t20 t30 3

GIDs loc t_in t_out count

0.0.0.0 l5 t40 t60 30.0.0.00.1.0.0

l5 t40 t60 5

Stay Table

0.1 l2 t1 t8 3

Path Tree

0.0.00.1.0

l3 t20 t30 6

0.1.0.1 l6 t35 t50 1

0.1.1 l4 t10 t20 2


29

RFID-Cube Properties

The RFID-cuboid can be constructed on a single scan of the cleansed RFID database

The RFID-cuboid provides lossless compression at its level of abstraction

The size of the RFID-cuboid is smaller than the cleansed data In our experiments we get 80% lossless

compression at the level of abstraction of the raw data


30

Query Processing

Traditional OLAP operations Roll up, drill down, slice, and dice Can be implemented efficiently with traditional

optimization techniques, e.g., what is the average time spent by milk at the shelf

Path selection (New operation) Compute an aggregate measure on the tags that

travel through a set of locations and that match a selection criteria on path independent dimensions

stay.location = 'shelf', info.product = 'milk' (stay gid info)

q Ã < c info,(c1 stage1, ..., ck

stagek) >


31


32

Query Processing (II)

Query: What is the average time spent from l3 to l5? GIDs for l3 <0.0.0>, <0.1.0> GIDs for l5 <0.0.0.0>, <0.1.0.1> Prefix pairs: p1: (<0.0.0>,<0.0.0.0>)

p2: (<0.1.0>,<0.1.0.1>) Retrieve stay records for each pair (including

intermediate steps) and compute measure Savings: No EPC list intersection, remember that

each EPC list may contain millions of different tags, and retrieving them is a significant IO cost


33

From RFID-Cuboids to RFID-Warehouse

Materialize the lowest RFID-cuboid at the minimum level of abstraction interested to a user

Materialize frequently requested RFID-cuboids

Materialization is done from the smallest materialized RFID-Cuboid that is at a lower level of abstraction


34

Outline




Performance Study


Conclusions


35

RFID-Cube Compression (I)

0

50

100

150

200

250

300

350

0.1 0.5 1 5 10

Input Stay Records (millions)

Siz

e (

MB

yte

s)

clean

nomap

map

0

5

10

15

20

25

30

35

a b c d e

Path Bulkiness

Siz

e (

MB

yte

s)

clean

nomap

map

Compression vs. Cleansed data size

P=1000, B=(500,150,40,8,1), k = 5

Lossless compression, cuboid is at the same level of abstraction as cleansed RFID database

Compression vs. Data Bulkiness

P=1000, N = 1,000,000, k= 5

Map gives significant benefits for bulky data

For data where items move individually we are better off using tag lists


36

RFID-Cube Compression (II)

0

2

4

6

8

10

12

1 2 3 4

Abstraction Level

Siz

e (

MB

yte

s) nomap

map

Compression vs. Abstraction Level

P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000

The map provides significant savings over using tag lists

At very high levels of abstraction the stay table is very small, most of the space is used in recording RFID tags


37

0

200

400

600

800

1,000

1,200

1,400

1,600

1,800

level 0 level 1 level 2 level 3

Aggregation Level

Tim

e (s

eco

nd

s)100K tags

200K tags

400K tags

800K tags

1600K tags

RFID-Cube Construction Time

Construction Time

P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000

Savings by constructing from lower level cuboid 50% to 80%


38

Query Processing

1

10

100

1,000

10,000

1 2 3 4 5

Input Stay Records (millions)

I/O O

pe

rati

on

s (

log

sc

ale

)

clean

nomap

map

1

10

100

1,000

10,000

100,000

1 2 3 4 5

Path Bulkiness

I/O O

pe

rati

on

s (

log

sc

ale

)

clean

nomap

map

Time vs. DB SizeP=1000, B=(500,150,40,8,1), k = 5

Speed up due to stay table 1 order of magnitude

Speed up due to stay table and map table 2 orders of magnitude

Time vs. BulkinessP=1000, k = 5

Speed up is most significant for bulky paths

For non-bulky paths performance is not worse than using the clean table


39

Discussion

Our RFID cube model works well for bulky object movements

But there are many applications where this assumption is not true and other models are needed

We have only focused on warehousing RFID data, a variety of other problems remain open: Path classification and clustering Workflow analysis Trend analysis Sophisticated RFID data cleaning


40

Outline




Performance Study


Conclusions


41


High performance computing will play an important role in RFID data warehousing and data analysis

Most of data cleaning process can be done in parallel and distributed manner

Stay and map tables construction can be constructed in parallel

Parallel computation and consolidation of multi-layer and multi-path data cubes

Query and mining can be processed in parallel


42

Parallel RFID Data Mining: Promising

Parallel computing has been successfully applied to rather sophisticated data mining algorithms

Parallelizing frequent pattern mining (based on FPgrowth) Shengnan Cong, Jiawei Han, Jay Hoeflinger, and David

Padua, “A Sampling-based Framework for Parallel Data Mining,” PPOPP’05 (ACM SIGPLAN Symp. on Principles & Practice of Parallel Programming)

Parallelizing sequential-pattern mining algorithm (based on PrefixSpan)

Shengnan Cong, Jiawei Han, and David Padua, “Parallel Mining of Closed Sequential Patterns”, KDD'05

Parallel FRID data analysis is highly promising


43

Mining Frequent Patterns

Breadth-first search vs. depth-first search

Depth-first mining algorithm is proved to be more efficient Depth-first mining algorithm is more convenient to be parallelized

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE


44

Parallel Frequent-Pattern Mining

Target platform ─ distributed memory system Framework for parallelization

Step 1: Each processor scans local portion of the dataset and

accumulate the numbers of occurrence for each items Reduction to obtain the global numbers of occurrence

Step 2: Partition the frequent items and assign a subset to each

processor Each processor makes projections for the assigned items

Step 3: Each processor mines the local projections

independently


45

Parallel Frequent-Pattern Mining (2)

Load balancing problem Some projection mining time is too large

relative to the overall mining time

Solution: The large projections must be partitioned Challenge: How to identify the large projections?

7.66%

mushroom

12.4%

connect

14.7%

pumsb

47.6%

pumsb_star

42.1%

T30I0.2D1K

4.15%

T40I10D100K T50I5D500KDataset

3.07%Maximal/Overall

21.4%

C10N0.1T8S8I8

13.8%

C50N10T8S20I2.5

14.2%

C100N5T2.5S10I1.25

10.1%

C200N10T2.5S10I1.25

11.6%

C100N20T2.5S10I1.25Dataset

Maximal/Overall

15.3%

C100S50N10

5.02%

C100S100N5

4.53%

C200S25N9

25.9%

gazelleDataset

Maximal/Overall

(a) datasets for frequent-itemset mining

(b) datasets for sequential-pattern mining

(c) datasets for closed-sequential-pattern mining


46

How to Identify the Large Projections?

To identify the large projections, we need an estimation of the relative mining time of the projections

Static estimation Study the correlation with the dataset parameters

Number of items, number of records, width of records, …

Study the correlation with the characteristics of the projection

Depth, bushiness, tree size, number of leaves, fan-out/in, …

Result ─ No rule found with the above parameters for the projection mining time


47

Dynamic Estimation

Runtime sampling Use the relative mining time of a sample to estimate the

relative mining time of the whole dataset. Accuracy vs. overhead

Random sampling: random select a subset of records. Not accurate with small sample size. e.g. Dataset —pumsb 1% random sampling Becomes accurate when sample size > 30%, but sampling overhead is over 50% then


48

Selective Sampling

Selective sampling: for each record, some items are removed

In frequent-itemset mining: Discard the infrequent items Discard a fraction t of the most frequent items

1 a, c, d, f, m

2 b, c, f, m

3 b, f

4 b, c

5 a, f, m

f :4b :3c :3m :3a: 2d: 1

Support threshold =2, t=20%dataset

1 a, c, m

2 b, c, m

3 b

4 b, c

5 a, m

selective sample


49

Accuracy of Selective Sampling


50

Overhead of Selective Sampling

(a) datasets for frequent-itemset mining

(b) datasets for sequential-pattern mining

(c) datasets for closed-sequential-pattern mining


51

Experimental Setups

Two Linux clusters using up to 64 processors

Cluster A – 1GHz Pentium III processor, 1GB memory

Cluster B – 1.3GHz Intel Itanium2 processor, 2GB memory

Implement with C++ using MPI Dataset generator from IBM Datasets


52

Experimental Setups


53

Speedups with One-Level Task Partitioning

Parallel frequent-itemset miningmushroom

1

10

100

1 2 4 8 16 32 64

Processor#

optimal

Par-FP

connect

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

pumsb

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

pumsb_star

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

T40I10D100K

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

T50I5D500K

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

T30I0.2D1K

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP


54

Effectiveness of Selective Sampling

Multi-level task partitioningmushroom

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

connect

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

pumsb

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

pumsb_star

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

T30I0.2D1K

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

T40I10D100K

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

T50I5D500K

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level


55

Speedups with One-Level Task Partitioning (Sequential Patterns)

Parallel sequential-pattern miningC10N0.1T8S8I8

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-Span

C200N10T2.5S10I1.25

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-Span

C100N5T2.5S10I1.25

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-Span

C100N20T2.5S10I1.25

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-Span

C50N10T8S20I2.5

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-Span


56

Speedups with One-Level Task Partitioning (Closed Sequential Pattern)

Parallel closed-sequential-pattern miningC100S50N10

1

10

100

1 2 4 8 16 32 64

Processor#

optimal

Par-CSP

C100S100N5

1

10

100

1 2 4 8 16 32 64

Processor#

optimal

Par-CSP

C200S25N9

1

10

100

1 2 4 8 16 32 64

Processor#

optimal

Par-CSP

Gazelle

1

10

100

1 2 4 8 16 32 64

Processor#

optimal

Par-CSP


57

Effectiveness of Selective Sampling

One-level task partitioning with 64 processors The speedups are improved by more than

50% on average.

0

5

10

15

20

25

30

35

40

without sampling

with sampling


58

Conclusions

A new RFID warehouse model allows efficient and flexible analysis of RFID data

in multidimensional space preserves the structure of the data compresses data by exploiting bulky movements,

concept hierarchies, and path collapsing High-performance computing will benefit RFID data

warehousing and data mining tremendously Efficient and highly parallel algorithms can be

developed for RFID data analysis


59

Documents

Additional theme: RFID Data Warehousing and High-Performance