59
05/18/22 Data Mining: Concepts and Tec hniques 1 Data Mining: Concepts and Techniques — Chapter 11 — —Additional Theme: RFID Data Warehousing and Mining and High-Performance Computing— Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber. All rights reserved. Acknowledgements: Hector Gonzalez and Shengnan Cong

Additional theme: RFID Data Warehousing and High-Performance

  • Upload
    tommy96

  • View
    479

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

1

Data Mining: Concepts and

Techniques — Chapter 11 —

—Additional Theme: RFID Data Warehousing and Mining and High-Performance Computing

Jiawei Han and Micheline Kamber

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

©2006 Jiawei Han and Micheline Kamber. All rights reserved.

Acknowledgements: Hector Gonzalez and Shengnan Cong

Page 2: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

2

Page 3: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

3

Outline

Introduction to RFID Technology

Motivation: Why RFID-Warehousing?

RFID-Warehouse Architecture

Performance Study

Linking RFID Data Analysis with HPC

Conclusions

Page 4: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

4

What is RFID?

Radio Frequency Identification (RFID) Technology that allows a sensor (reader)

to read, from a distance, and without line of sight, a unique electronic product code (EPC) associated with a tag

Tag Reader

Page 5: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

5

RFID System

Source: www.belgravium.com

Page 6: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

6

Applications

Supply Chain Management: real-time inventory tracking

Retail: Active shelves monitor product availability

Access control: toll collection, credit cards, building access

Airline luggage management: (British airways) Implemented to reduce lost/misplaced luggage (20 million bags a year)

Medical: Implant patients with a tag that contains their medical history

Pet identification: Implant RFID tag with pet owner information (www.pet-id.net)

Page 7: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

7

Outline

Introduction to RFID Technology

Motivation: Why RFID-Warehousing?

RFID-Warehouse Architecture

Performance Study

Linking RFID Data Analysis with HPC

Conclusions

Page 8: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

8

RFID Warehouse Architecture

Page 9: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

9

Challenges of RFID Data Sets

Data generated by RFID systems is enormous due to redundancy and low level of abstraction

Walmart is expected to generate 7 terabytes of RFID data per day

Solution Requirements

Highly compact summary of the data OLAP operations on multi-dimensional view of the

data Summary should preserve the path structure of RFID

data It should be possible to efficiently drill down to

individual tags when an interesting pattern is discovered

Page 10: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

10

Why RFID-Warehousing? (1)

Lossless compression Significantly reduce the size of the RFID data set

by redundancy removal and grouping objects that move and stay together

Data cleaning: reasoning based on more complete info Multi-reading, miss-reading, error-reading, bulky

movement, … Multi-dimensional summary: product, location, time, …

Store manager: Check item movements from the backroom to different shelves in his store

Region manager: Collapse intra-store movements and look at distribution centers, warehouses, and stores

Page 11: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

11

Why RFID-Warehousing? (2) Query Processing

Support for OLAP: roll-up, drill-down, slice, and dice

Path query: New to RFID-Warehouses, about the structure of paths

What products that go through quality control have shorter paths?

What locations are common to the paths of a set of defective auto-parts?

Identify containers at a port that have deviated from their historic paths

Data mining Find trends, outliers, frequent, sequential,

flow patterns, …

Page 12: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

12

Example: A Supply Chain Store

A retailer with 3,000 stores, selling 10,000 items a day per store

Each item moves 10 times on average before being sold

Movement recorded as (EPC, location, second) Data volume: 300 million tuples per day (after

redundancy removal) OLAP Query

Avg time for outwear items to move from warehouse to checkout counter in March 2006?

Costly to answer if scanning 1 billion tuples for March

Page 13: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

13

Outline

Introduction to RFID Technology

Motivation: Why RFID-Warehousing?

RFID-Warehouse Architecture

Performance Study

Linking RFID Data Analysis with HPC

Conclusions

Page 14: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

14

Cleaning of RFID Data Records

Raw Data (EPC, location, time) Duplicate records due to multiple

readings of a product at the same location (r1,l1,t1) (r1,l1,t2) ... (r1,l1,t10)

Cleansed Data: Minimal information to store, raw data will be then removed (EPC, Location, time_in, time_out) (r1,l1,t1,t10)

Warehousing can help fill-up missing records and correct wrongly-registered information

Page 15: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

15

Key Compression Ideas (I)

Bulky object movements Objects often move and stay together through the supply

chain If 1000 packs of soda stay together at the distribution center,

register a single record (GID, distribution center, time_in, time_out)

GID is a generalized identifier that represents the 1000 packs that stayed together at the distribution center

Factory

Dist. Center 1

Dist. Center2

10 pallets(1000 cases)

store 1

store 2

20 cases(1000 packs)

shelf 1

shelf 2

10 packs(12 sodas)

Page 16: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

16

Key Compression Ideas (II)

Data generalization Analysis usually takes place at a much higher level of

abstraction than the one present in raw RFID data Aggregate object movements into fewer records

If interested in time at the day level, merge records at the minute level into records at the hour level

Merge and/or collapse of path segments Uninteresting path segments can be ignored or

merged Multiple item movements within the same store

may be uninteresting to a regional manager and thus can be merged

Page 17: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

17

Path-Independent Generalization

Clothing

Outerwear Shoes

Shirt Jacket …SKU level

Type level

Category level

Shirt 1 Shirt n…EPC level Cleansed RFIDDatabase Level

InterestingLevel

Page 18: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

18

Path Generalization

Transportation

dist. center truck backroom shelf checkout

backroom shelf checkout

dist. center truck Store

Store View:

Transportation View:

Page 19: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

19

Why Not Using Traditional Data Cube?

Fact Table: (EPC, location, time_in, time_out) Aggregate: A measure at a single location

e.g., what is the average time that milk stays in the refrigerator in Illinois stores?

What is missing? Measures computed on items that travel

through a series of locations e.g., what is the average time that milk stays at

the refrigerator in Champaign when coming from farm A, and Warehouse B?

Traditional cubes miss the path structure of the data

Page 20: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

20

RFID-Cube Architecture

Page 21: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

21

RFID-Cuboid Architecture (II)

Stay Table: (GIDs, location, time_in, time_out: measures) Records information on items that stay together at a given

location If using record transitions: difficult to answer queries, lots

of intersections needed Map Table: (GID, <GID1,..,GIDn>)

Links together stages that belong to the same path. Provides additional: compression and query processing efficiency

High level GID points to lower level GIDs If saving complete EPC Lists: high costs of IO to retrieve

long lists, costly query processing Information Table: (EPC list, attribute 1,...,attribute n)

Records path-independent attributes of the items, e.g., color, manufacturer, price

Page 22: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

22

RFID-Cuboid Example

epc loc t_in t_out

r1 l1 t1 t10

r1 l2 t20 t30

r2 l1 t1 t10

r2 l3 t20 t30

r3 l1 t1 10

r3 l4 t15 t20

epcs loc t_in t_out

r1,r2,r3 l1 t1 t10

gids

gid gids

g1 g1.1,g1.2

g1.1 r1,r2

g1.2 r3

Cleansed RFID Database Stay Table

Map Table

gid gids

r1,r2 l2 t20 t30

g1

g1.1

r3 l4 t15 t20g1.2

Page 23: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

23

Benefits of the Stay Table (I)

l1

l2

ln

l

ln+1

ln+2

ln+m

… …

Transition Grouping Retrieve all transitions with destination = l Retrieve all transitions with origin = l Intersect results and compute average time IO Cost: n + m retrievals

Prefix Tree Retrieve n records

Stay Grouping Retrieve stay record with location = l IO Cost: 1

Query: What is the average time that items stayat location l ?

Page 24: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

24

Benefits of the Stay Table (II)

(r1,l1,t1,t2)(r1,l2,t3,t4)

…(r2,l1,t1,t2)(r2,l2,t3,t4)

…(rk,l1,t1,t2)(rk,l2,t3,t4)

Query: How many boxes of milk traveled through the locations l1, l7, l13?

Strategy: Retrieve itemsets for

locations l1, l7, l13 Intersect itemsets

IO Cost: One IO per item in

locations l1 or l7 or l13

Observation: Very costly, we retrieve

records at the individual

item level

Strategy: Retrieve the gids for l1, l7, l13 Intersect the gidsIO Cost: One IO per GID in locations l1, l7, and l13Observation: Retrieve records at the group level and thus greatly reduce IO costs

(g1,l1,t1,t2)(g2,l2,t3,t4)

With Cleansed Database With Stay Table

Page 25: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

25

Benefits of the Map Table

l1

l2 l3 l4

l5 l6 l7 l8 l9 l10

#EPCs #GIDs

n

n

n

1

3

6

3n 10+n{r1,..,ri} {ri+1,..,rj} {rj+1,..,rk} {rk+1,..,rl} {rl+1,..,rm} {rm+1,..,rn}

Page 26: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

26

Path-Dependent Naming of GIDs

l1 l2

l3 l4

l5 l6

0.0 0.1

0.0.0 0.1.00.1.1

0.0.0.0 0.1.0.1

Assign to each GID a

unique identifier that

encodes the path traversed

by the items that it points

to

Path-dependent name:

Makes it easy to detect if

locations form a path

Page 27: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

27

RFID-Cuboid Construction Algorithm

1. Build a prefix tree for the paths in the cleansed database

2. For each node, record a separate measure for each group of items that share the same leaf and information record

3. Assign GIDs to each node:

GID = parent GID + unique id4. Each node generates a stay record for each

distinct measure5. If multiple nodes share the same location, time,

and measure, generate a single record with multiple GIDs

Page 28: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

28

RFID-Cube Construction

l1 l2

l3 l4

l5 l6

0.0t1,t10: 3

0.1t1,t8: 3

0.0.0t20,t30: 3

0.1.0t20,t30: 3

0.1.1t10,t20: 2

0.0.0.0t40,t60: 3

0.1.0.1t35,t50: 1

l3

l5

0.1.0.0t40,t60: 2

{r1,r2,r3} {r5,r6} {r7}

{r8,r9}

0.0 l1 t1 t10 3

0.0.0 l3 t20 t30 3

GIDs loc t_in t_out count

0.0.0.0 l5 t40 t60 30.0.0.00.1.0.0

l5 t40 t60 5

Stay Table

0.1 l2 t1 t8 3

Path Tree

0.0.00.1.0

l3 t20 t30 6

0.1.0.1 l6 t35 t50 1

0.1.1 l4 t10 t20 2

Page 29: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

29

RFID-Cube Properties

The RFID-cuboid can be constructed on a single scan of the cleansed RFID database

The RFID-cuboid provides lossless compression at its level of abstraction

The size of the RFID-cuboid is smaller than the cleansed data In our experiments we get 80% lossless

compression at the level of abstraction of the raw data

Page 30: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

30

Query Processing

Traditional OLAP operations Roll up, drill down, slice, and dice Can be implemented efficiently with traditional

optimization techniques, e.g., what is the average time spent by milk at the shelf

Path selection (New operation) Compute an aggregate measure on the tags that

travel through a set of locations and that match a selection criteria on path independent dimensions

stay.location = 'shelf', info.product = 'milk' (stay gid info)

q à < c info,(c1 stage1, ..., ck

stagek) >

Page 31: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

31

Page 32: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

32

Query Processing (II)

Query: What is the average time spent from l3 to l5? GIDs for l3 <0.0.0>, <0.1.0> GIDs for l5 <0.0.0.0>, <0.1.0.1> Prefix pairs: p1: (<0.0.0>,<0.0.0.0>)

p2: (<0.1.0>,<0.1.0.1>) Retrieve stay records for each pair (including

intermediate steps) and compute measure Savings: No EPC list intersection, remember that

each EPC list may contain millions of different tags, and retrieving them is a significant IO cost

Page 33: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

33

From RFID-Cuboids to RFID-Warehouse

Materialize the lowest RFID-cuboid at the minimum level of abstraction interested to a user

Materialize frequently requested RFID-cuboids

Materialization is done from the smallest materialized RFID-Cuboid that is at a lower level of abstraction

Page 34: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

34

Outline

Introduction to RFID Technology

Motivation: Why RFID-Warehousing?

RFID-Warehouse Architecture

Performance Study

Linking RFID Data Analysis with HPC

Conclusions

Page 35: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

35

RFID-Cube Compression (I)

0

50

100

150

200

250

300

350

0.1 0.5 1 5 10

Input Stay Records (millions)

Siz

e (

MB

yte

s)

clean

nomap

map

0

5

10

15

20

25

30

35

a b c d e

Path Bulkiness

Siz

e (

MB

yte

s)

clean

nomap

map

Compression vs. Cleansed data size

P=1000, B=(500,150,40,8,1), k = 5

Lossless compression, cuboid is at the same level of abstraction as cleansed RFID database

Compression vs. Data Bulkiness

P=1000, N = 1,000,000, k= 5

Map gives significant benefits for bulky data

For data where items move individually we are better off using tag lists

Page 36: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

36

RFID-Cube Compression (II)

0

2

4

6

8

10

12

1 2 3 4

Abstraction Level

Siz

e (

MB

yte

s) nomap

map

Compression vs. Abstraction Level

P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000

The map provides significant savings over using tag lists

At very high levels of abstraction the stay table is very small, most of the space is used in recording RFID tags

Page 37: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

37

0

200

400

600

800

1,000

1,200

1,400

1,600

1,800

level 0 level 1 level 2 level 3

Aggregation Level

Tim

e (s

eco

nd

s)100K tags

200K tags

400K tags

800K tags

1600K tags

RFID-Cube Construction Time

Construction Time

P=1000, B=(500,150,40,8,1), k = 5, N=1,000,000

Savings by constructing from lower level cuboid 50% to 80%

Page 38: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

38

Query Processing

1

10

100

1,000

10,000

1 2 3 4 5

Input Stay Records (millions)

I/O O

pe

rati

on

s (

log

sc

ale

)

clean

nomap

map

1

10

100

1,000

10,000

100,000

1 2 3 4 5

Path Bulkiness

I/O O

pe

rati

on

s (

log

sc

ale

)

clean

nomap

map

Time vs. DB SizeP=1000, B=(500,150,40,8,1), k = 5

Speed up due to stay table 1 order of magnitude

Speed up due to stay table and map table 2 orders of magnitude

Time vs. BulkinessP=1000, k = 5

Speed up is most significant for bulky paths

For non-bulky paths performance is not worse than using the clean table

Page 39: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

39

Discussion

Our RFID cube model works well for bulky object movements

But there are many applications where this assumption is not true and other models are needed

We have only focused on warehousing RFID data, a variety of other problems remain open: Path classification and clustering Workflow analysis Trend analysis Sophisticated RFID data cleaning

Page 40: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

40

Outline

Introduction to RFID Technology

Motivation: Why RFID-Warehousing?

RFID-Warehouse Architecture

Performance Study

Linking RFID Data Analysis with HPC

Conclusions

Page 41: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

41

Linking RFID Data Analysis with HPC

High performance computing will play an important role in RFID data warehousing and data analysis

Most of data cleaning process can be done in parallel and distributed manner

Stay and map tables construction can be constructed in parallel

Parallel computation and consolidation of multi-layer and multi-path data cubes

Query and mining can be processed in parallel

Page 42: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

42

Parallel RFID Data Mining: Promising

Parallel computing has been successfully applied to rather sophisticated data mining algorithms

Parallelizing frequent pattern mining (based on FPgrowth) Shengnan Cong, Jiawei Han, Jay Hoeflinger, and David

Padua, “A Sampling-based Framework for Parallel Data Mining,” PPOPP’05 (ACM SIGPLAN Symp. on Principles & Practice of Parallel Programming)

Parallelizing sequential-pattern mining algorithm (based on PrefixSpan)

Shengnan Cong, Jiawei Han, and David Padua, “Parallel Mining of Closed Sequential Patterns”, KDD'05

Parallel FRID data analysis is highly promising

Page 43: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

43

Mining Frequent Patterns

Breadth-first search vs. depth-first search

Depth-first mining algorithm is proved to be more efficient Depth-first mining algorithm is more convenient to be parallelized

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Page 44: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

44

Parallel Frequent-Pattern Mining

Target platform ─ distributed memory system Framework for parallelization

Step 1: Each processor scans local portion of the dataset and

accumulate the numbers of occurrence for each items Reduction to obtain the global numbers of occurrence

Step 2: Partition the frequent items and assign a subset to each

processor Each processor makes projections for the assigned items

Step 3: Each processor mines the local projections

independently

Page 45: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

45

Parallel Frequent-Pattern Mining (2)

Load balancing problem Some projection mining time is too large

relative to the overall mining time

Solution: The large projections must be partitioned Challenge: How to identify the large projections?

7.66%

mushroom

12.4%

connect

14.7%

pumsb

47.6%

pumsb_star

42.1%

T30I0.2D1K

4.15%

T40I10D100K T50I5D500KDataset

3.07%Maximal/Overall

21.4%

C10N0.1T8S8I8

13.8%

C50N10T8S20I2.5

14.2%

C100N5T2.5S10I1.25

10.1%

C200N10T2.5S10I1.25

11.6%

C100N20T2.5S10I1.25Dataset

Maximal/Overall

15.3%

C100S50N10

5.02%

C100S100N5

4.53%

C200S25N9

25.9%

gazelleDataset

Maximal/Overall

(a) datasets for frequent-itemset mining

(b) datasets for sequential-pattern mining

(c) datasets for closed-sequential-pattern mining

Page 46: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

46

How to Identify the Large Projections?

To identify the large projections, we need an estimation of the relative mining time of the projections

Static estimation Study the correlation with the dataset parameters

Number of items, number of records, width of records, …

Study the correlation with the characteristics of the projection

Depth, bushiness, tree size, number of leaves, fan-out/in, …

Result ─ No rule found with the above parameters for the projection mining time

Page 47: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

47

Dynamic Estimation

Runtime sampling Use the relative mining time of a sample to estimate the

relative mining time of the whole dataset. Accuracy vs. overhead

Random sampling: random select a subset of records. Not accurate with small sample size. e.g. Dataset —pumsb 1% random sampling Becomes accurate when sample size > 30%, but sampling overhead is over 50% then

Page 48: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

48

Selective Sampling

Selective sampling: for each record, some items are removed

In frequent-itemset mining: Discard the infrequent items Discard a fraction t of the most frequent items

1 a, c, d, f, m

2 b, c, f, m

3 b, f

4 b, c

5 a, f, m

f :4b :3c :3m :3a: 2d: 1

Support threshold =2, t=20%dataset

1 a, c, m

2 b, c, m

3 b

4 b, c

5 a, m

selective sample

Page 49: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

49

Accuracy of Selective Sampling

Page 50: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

50

Overhead of Selective Sampling

(a) datasets for frequent-itemset mining

(b) datasets for sequential-pattern mining

(c) datasets for closed-sequential-pattern mining

Page 51: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

51

Experimental Setups

Two Linux clusters using up to 64 processors

Cluster A – 1GHz Pentium III processor, 1GB memory

Cluster B – 1.3GHz Intel Itanium2 processor, 2GB memory

Implement with C++ using MPI Dataset generator from IBM Datasets

Page 52: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

52

Experimental Setups

Page 53: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

53

Speedups with One-Level Task Partitioning

Parallel frequent-itemset miningmushroom

1

10

100

1 2 4 8 16 32 64

Processor#

optimal

Par-FP

connect

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

pumsb

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

pumsb_star

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

T40I10D100K

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

T50I5D500K

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

T30I0.2D1K

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

Page 54: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

54

Effectiveness of Selective Sampling

Multi-level task partitioningmushroom

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

connect

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

pumsb

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

pumsb_star

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

T30I0.2D1K

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

T40I10D100K

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

T50I5D500K

1

10

100

1 2 4 8 16 32 64

optimal

one-level

multi-level

Page 55: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

55

Speedups with One-Level Task Partitioning (Sequential Patterns)

Parallel sequential-pattern miningC10N0.1T8S8I8

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-Span

C200N10T2.5S10I1.25

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-Span

C100N5T2.5S10I1.25

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-Span

C100N20T2.5S10I1.25

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-Span

C50N10T8S20I2.5

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-Span

Page 56: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

56

Speedups with One-Level Task Partitioning (Closed Sequential Pattern)

Parallel closed-sequential-pattern miningC100S50N10

1

10

100

1 2 4 8 16 32 64

Processor#

optimal

Par-CSP

C100S100N5

1

10

100

1 2 4 8 16 32 64

Processor#

optimal

Par-CSP

C200S25N9

1

10

100

1 2 4 8 16 32 64

Processor#

optimal

Par-CSP

Gazelle

1

10

100

1 2 4 8 16 32 64

Processor#

optimal

Par-CSP

Page 57: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

57

Effectiveness of Selective Sampling

One-level task partitioning with 64 processors The speedups are improved by more than

50% on average.

0

5

10

15

20

25

30

35

40

without sampling

with sampling

Page 58: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

58

Conclusions

A new RFID warehouse model allows efficient and flexible analysis of RFID data

in multidimensional space preserves the structure of the data compresses data by exploiting bulky movements,

concept hierarchies, and path collapsing High-performance computing will benefit RFID data

warehousing and data mining tremendously Efficient and highly parallel algorithms can be

developed for RFID data analysis

Page 59: Additional theme: RFID Data Warehousing and High-Performance

04/10/23 Data Mining: Concepts and Techniques

59