M.Phil Probation Talk Association Rules Mining of Existentially Uncertain Data Presenter : Chui Chun Kit Supervisor : Dr. Benjamin C.M. Kao

M.Phil Probation TalkAssociation Rules Mining of Existentially Uncertain Data

Presenter : Chui Chun KitSupervisor : Dr. Benjamin C.M. Kao.

Presentation Outline

Introduction What is association rules? How to mine association rules from large database?

Probabilistic Data (Uncertain Data) What is probabilistic/ uncertain data? Possible World interpretation of uncertain data

Mining frequent patterns from uncertain data Presents a simple algorithm to mine association rules from uncertain

data Identify computational problem

Efficient methods of mining association rules from uncertain data Experimental Results and Discussions Conclusion and Future Work

Section 1Introduction

What is association rule?

Introduction

Suppose Peter is a psychologist He has to judge on a list of psychological symptoms to make

diagnosis and give treatments to his patients. All diagnosis records are stored in a transaction database.

We call each patient record as a transaction, and each psychological symptom as an attribute with value either yes or no (i.e. binary attribute), and the collection of patients’ records as a transaction database.

Mood Disorder

Anxiety Disorder

Eating Disorder

Obsessive-Compulsive Disorder

Depression … Self Destructive Disorder

…

…

…

Patient 1

Patient 2

Transaction Database

Binary AttributesEither Yes/ NoTransactions

Psychological Symptoms Transaction Database

Introduction

One day, when Peter is reviewing his patients’ records, he discovers some patterns of his patients’ psychological symptoms. E.g. Patients having “mood disorder” are often associated with

“eating disorder”. He would like to learn about the associations between

different psychological symptoms from his patients.

Mood Disorder

Anxiety Disorder

Eating Disorder



…

…

…

Patient 1

Patient 2


Introduction

Peter may be interested in the following associations among different psychological symptoms.

Mood Disorder

Anxiety Disorder

Eating Disorder



…

…

…

Patient 1

Patient 2


Mood disorder => Eating disorder

Mood disorder => Depression

Eating disorder => Depression + Mood disorder

Eating disorder + Depression => Self destructive disorder + Mood disorder

These associations are very useful information to assists diagnosis and give treatments.

Association Rules

Introduction

However, the psychological symptoms database is very large, it is impossible to analyze the associations by human inspection.

In Computer Science research, the problem of mining association rules from transaction database is solved in 1993 by R. Agrawal. The Apriori Algorithm

Mood Disorder

Anxiety Disorder

Eating Disorder



…

…

…

Patient 1

Patient 2

Psychological Symptoms Transaction DatabaseThanks computer scientists !

Too many records

Basic algorithm for mining association rules

2% support value means that there are 2% of the patients in the database have both psychological symptoms.

60% confidence value means that 60% of the patients having Eating disorder also have Depression.

IntroductionAssociation Rules

There are two parameters to measure the interestingness of association rules. Support is the fraction of database transaction that

contains the items in the association rule. Support shows how frequent is the items in the rule.

Confidence is the percentage of transaction that contains the antecedent also contains the consequent.

Confidence shows the certainty of the rule.

Eating disorder => Depression

[Support = 2% , confidence = 60%]

Antecedent Consequent


Two steps for mining association rules Step 1: Find ALL frequent itemsets

Itemsets are frequent if their supports are over the user-specified SUPPORT threshold.

Step 2: Generate association rules from the frequent itemsets

An association rule is generated if its confidence is over a user-specified CONFINDENCE threshold.

Given the transaction database,find ALL the association rules withSUPPORT values over 10%, and CONFINDENCE values over 60% please!

Psychological Symptoms Database

Given the transaction database,find ALL the association rules withSUPPORT values over 10%, and CONFINDENCE values over 60% please!

Psychological Symptoms Database

Two steps for mining association rules Step 1: Find ALL frequent itemsets

Itemsets are frequent if their support are over the user-specified SUPPORT threshold.

Step 2: Generate association rules from the frequent itemsets

An association rule is generated if its confidence over a user-specified CONFINDENCE threshold.


The overall performance of mining association rules is determined by the first step.

For the sake of discussion, let us focus on the first step in this talk.

Section 1Introduction

How to mine frequent itemsets from large database?

Mining Frequent ItemsetsProblem Definition

Given Transaction database D

with n attributes, and m transactions.

Each transaction t is a Boolean vector representing the presence or absence of items in that transaction.

Minimum support threshold s.

Find ALL itemsets with support values over s.

I1 I2 I3 I4 I5 … In

t1 1 0 1 1 1 … 1

t2 1 1 1 0 0 … 1

.. … … … … … … …

tm 0 1 1 0 0 1 1

Transaction Database D

Brute-force approach Suppose there are 5 items

in the database. i.e. A,B,C,D and E.

There are totally 32 itemsets.

Scan the database once to count the supports of ALL itemsets together.

If there are n different items, there will be 2^n itemsets to count in total. If there are 20 items, there will

be 1,000,000 itemsets!!! Computationally infeasible

The Apriori Algorithm

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDEPruned supersets

Apriori property : All subsets of a frequent itemset must also be frequent

The Apriori algorithm adopts an iterative approach to identify infrequent itemsets.

No need to count their frequency.

The Apriori AlgorithmHow it works?

CandidatesLarge

itemsets

Apriori-Gen

Subset Function

{A}

{B}

{C}

{D}

{E}

{B}

{C}

{D}

{E}

The Apriori algorithm starts from inspecting ALL size-1 items.

The supports of ALL size-1-candidates are obtained by a SUBSET FUNCTION procedure by scanning the database once.

After obtaining the supports, candidates with support over the support threshold are large items.

X

Item {A} is infrequent, by APRIORI PROPERTY, ALL supersets of {A} must NOT be frequent.

X X X X

XXXXXX

X X X X

X

{BC}

{BD}

{BE}

{CD}

{CE}

{DE}

The APRIORI-GEN procedure generate ONLY those size-(k+1)-candidates which are potentially frequent.

The Apriori Algorithm obtains the frequent itemsets iteratively until no candidates are generated.

X

X X

{BD}

{BE}

{CD}

{CE}

{DE}

{BDE}

{CDE}

X

{BDE}

{CDE}

Save effort for counting the supports of pruned itemsets.

Important Detail of AprioriSubset Function

Subset-Function Scan the database transaction by transaction to

increment the corresponding support counts of the candidates.

Generally there are many candidates, the Subset Function organizes the candidates in a hash-tree data structure.

Each interior node of the hash-tree contains a hash table. Each leaf node of the hash-tree contains a list of itemsets

and support counts.

Candidate itemsets Large itemsets

Apriori-Gen

k=k+1

k=1

Subset Function

Candidate {1,2} is stored in this leaf node.

Important Detail of AprioriHow are candidates stored into the hash tree?

Hash tree data structure Each Interior node contains a hash table. Leaf nodes of hash-tree contains a list of itemsets and

support counts.

Hash table

1,4,7 2,5,8 3,6,9

Candidates

1 {1,2}

2 {2,4}

3 {3,6}

4 {1,5}

5 {2,4}

6 {2,4}

… …

Level 0

Level 1

Hash Tree

{1,2}

{2,4}

2 levels for storing size-2-candidates.

Illustrate how the candidates are stored into the hash-tree.

First, hash on the first item of candidate {1,2}

Then, hash on the second item of candidate {1,2}Similarly, candidate {2,4} is

hashed and stored in this slot.

A transaction with 100 items has 100C2 = 4950 size-2-subsets !

Hash on {1,4} and traverse the hash tree to search for the candidate.

Enumerate all size-2-subsets within the transaction and traverse the hash tree to increment the corresponding support counts.

Important Detail of AprioriHow to process a transaction with the hash-tree?

Fitting a transaction into the hash tree.

Level 0

Level 1

1 2 4 … 9924Transaction 1

Candidate Itemset

Support Count

{1,4}

{1,7} 0

{1,2}

{1,5}

{2,4}

{2,7}

{1,3}

{1,6}

{2,5}

…

{2,3}

…

{3,4}

…

{3,5}

…

{3,6}

…

Subset Function

0

Enumerate ALL size-2 subsets, {1,4} is one of them.

When the itemset is found, increment its support count.

1

Same procedure has to repeat for ALL size-2 subsets of the transaction and for ALL transactions !

Section 2Probabilistic Data

What is probabilistic data?

Probabilistic Database or Uncertain DatabaseProbabilistic Database

In reality, when psychologists make a diagnosis, they estimate the likelihood of presence of each psychological symptom of a patient.

The likelihood of presence of each symptom is represented in terms of existential probabilities.

Mood Disorder

Anxiety Disorder

Eating Disorder



…

…

…

Patient 1

Patient 2


97% 5% 84% 14% 76% 9%

90% 85% 100% 86% 65% 48%

How to mine association rules from uncertain

database?

Other areas of probabilistic databasePattern Recognition

Handwriting recognition Speech recognition…etc

Information RetrievalScientific Database

Probabilistic Database

Feature 1 Feature 2 …

Pattern 1 90% 85% …

Pattern 2 60% 5% …

Binary Features

Section 2Probabilistic Data

Possible World interpretation of uncertain databaseby S. Abiteboul in the paper “On the Representation and Querying of Sets of Possible Worlds“ in SIGMOD 1987.

Possible World Interpretation Example

A database with two psychological symptoms and two patients.

16 Possible Worlds We can discuss the

supports of itemsets of each individual world.

Depression Eating Disorder

Patient 1 90% 80%

Patient 2 40% 70%

1 S1 S2

P1 √ √

P2 √ √

2 S1 S2

P1 × √

P2 √ √

3 S1 S2

P1 √ ×

P2 √ √

4 S1 S2

P1 √ √

P2 × √

5 S1 S2

P1 √ √

P2 √ ×

6 S1 S2

P1 √ √

P2 × ×

9 S1 S2

P1 × √

P2 × √

10 S1 S2

P1 × √

P2 √ ×

11 S1 S2

P1 √ ×

P2 × √

14 S1 S2

P1 × ×

P2 √ ×

15 S1 S2

P1 × ×

P2 × √

16 S1 S2

P1 × ×

P2 × ×

8 S1 S2

P1 √ ×

P2 √ ×

12 S1 S2

P1 √ ×

P2 × ×

13 S1 S2

P1 × √

P2 × ×

7 S1 S2

P1 × ×

P2 √ √

From the uncertain database, one of the possibility is that both patients are actually having both psychological

illnesses.

Psychological symptoms database

On the other hand, the uncertain database also captures the

possibility that patient 1 only has eating disorder illness while patient

2 has both of the illnesses.Thus data uncertainty is eliminated when we focus on individual Possible World!

Each possibility is called a “Possible World”.

Possible World Interpretation Support of itemset

{Depression,Eating Disorder}Depression Eating Disorder

Patient 1 90% 80%

Patient 2 40% 70%

1 S1 S2

P1 √ √

P2 √ √

2 S1 S2

P1 × √

P2 √ √

3 S1 S2

P1 √ ×

P2 √ √

4 S1 S2

P1 √ √

P2 × √

5 S1 S2

P1 √ √

P2 √ ×

6 S1 S2

P1 √ √

P2 × ×

9 S1 S2

P1 × √

P2 × √

10 S1 S2

P1 × √

P2 √ ×

11 S1 S2

P1 √ ×

P2 × √

14 S1 S2

P1 × ×

P2 √ ×

15 S1 S2

P1 × ×

P2 × √

16 S1 S2

P1 × ×

P2 × ×

8 S1 S2

P1 √ ×

P2 √ ×

12 S1 S2

P1 √ ×

P2 × ×

13 S1 S2

P1 × √

P2 × ×

7 S1 S2

P1 × ×

P2 √ √

World Support {S1,S2} World Likelihood

1

2

3

4

5

6

7

8

… …

2 0.9 × 0.8 × 0.4 × 0.7

We can discuss support of itemset {S1,S2} of possible world 1.

We can also discuss the likelihood of possible world 1 being the true world.

1 0.1 × 0.8 × 0.4 × 0.7

1

1

1

1

1

0

0.0504

0.3024

0.0864

0.1296

0.0056

0.0336

0.2016

0.0224

0

We define the expected support being the weighted average support count represented by ALL the possible worlds.

Question:Overall speaking, how many {S1,S2} itemsets will you expect to have from these possible worlds?


Thus data uncertainty is eliminated when we focus on individual Possible World!

Similarly, we can discuss the support and likelihood of Possible World 2.

Possible World Interpretation

World Support {S1,S2} World Likelihood

1

2

3

4

5

6

7

8

… …

2

1

1

1

1

1

1

0

0.0504

0.3024

0.0864

0.1296

0.0056

0.0336

0.2016

0.0224

0

We define the expected support being the weighted average support count represented by ALL the possible worlds.

Weighted Support

0.4032

0.0224

0.0504

0.3024

0.0864

0.1296

0.0056

0

0

Expected Support 1

Notice that the world likelihoods form a discrete probability density function of the support values of itemset {S1,S2}.

Since the possible worlds are independent to each other, the probability density function of the support values of {S1,S2} is as follows

P(support)

Support0 1 2

20.16%

59.68%

20.16%

Expected Support is the is calculated by summing up the weighted support counts of ALL the possible worlds.

We expect there will be 1 patient has both “Eating Disorder” and “Depression”.

Possible World Interpretation

Instead of enumerating all “Possible Worlds” to calculate the expected support, it can be calculated by the following formula

Depression Eating Disorder

Patient 1 90% 80%

Patient 2 40% 70%


Weighted Support

0.72

0.28

TOTAL SUM 1

The expected support can be calculated by simply multiplying the existential probabilities within the transaction and obtain the total sum of all transactions

Mining Frequent Itemsets from probabilistic data Problem Definition

Given an uncertain database D with each item of a transaction associated with an existential probability, and a user-specified support threshold s, return ALL the itemsets having expected support greater than or equal to |D|×s.

In another words, find ALL the itemsets that are expected to be frequent according to the existential probabilities in the uncertain database.

Section 3Mining frequent patterns from uncertain data

The Uncertain Apriori algorithm

Uncertain Apriori Algorithm

All the procedures are the same as conventional association rule mining algorithm.

The only difference is in the subset function.

Candidate itemsets Size-k-large itemsets

Apriori-Gen

Subset Function

k=k+1

k=1

Start End

Increase the expected support count by 0.7*0.3 = 0.21

The only difference is in the subset function.


Level 0

Level 1

Increment the candidate count by the expected support contributed by the transaction.

1 (70%) 2 (50%) 4 (30%) … 9924 (30%)Transaction 1

Candidate Itemset

Expected Support Count

{1,4}

{1,7} 0

{1,4}

{1,7}

{1,2}

{1,5}

{2,4}

{2,7}

{1,3}

{1,6}

{2,5}

…

{2,3}

…

{3,4}

…

{3,5}

…

{3,6}

…

Subset Function

00.21

Instead of storing the support counts, candidate itemsets are associated with an expected support count.


Level 0

Level 1

1 (90%) 2 (2%) 3 (99%) … 9924 (5%)Transaction 1

{1,2}

{1,5}

{2,4}

{2,7}

{1,3}

{1,6}

{2,5}

…

{2,3}

…

{3,4}

…

{3,5}

…

{3,6}

…

Mood Disorder

Anxiety Disorder

Eating Disorder



Patient 1 90% 2% 99% 97% 92% … 5%

Patient 2 89% 96% 80% 4% 8% … 3%

Patient 3 8% 6% 79% 10% 5% … 98%

…


Thus we can apply Uncertain Apriori on uncertain database to mine ALL the frequent itemsets.

Why the algorithm executes so long, even doesn’t terminate ?

{1,4}

{1,7}

Computational Issue

Each item (attribute) of a transaction (object) is associated with an existential probability, despite the items with very high probability of presence, there are large number of items with relatively low probability of presence.

Mood Disorder

Anxiety Disorder

Eating Disorder



Patient 1 90% 2% 99% 97% 92% … 5%

Patient 2 89% 96% 80% 4% 8% … 3%

Patient 3 8% 6% 79% 10% 5% … 98%

…


Computational Issue

Level 0

Level 1

1 (70%)

2 (50%)

4 (30%)

7 (3%)

10 (2%) … 991 (60%)Transaction 1

Candidate Itemset


{1,4} 0

{1,7} 0

{1,10} 0

{4,7} 0

{4,10} 0

{7,10} 0

{1,2}

{1,5}

{2,4}

{2,7}

{1,3}

{1,6}

{2,5}

…

{2,3}

…

{3,4}

…

{3,5}

…

{3,6}

…

Many insignificant subset increments.

If {7,10} turns out to be infrequent after scanning the database, ALL the subset increments are redundant.

Transaction with some low existential probability items

0.21

0.021

0.014

0. 009

0. 0006

0. 006

This is the expected support contributed by the transaction to candidates in this leaf node.

Psychological Symptoms Uncertain

Database

Computational Issue

Preliminary experiment is conducted to verify the computational bottleneck of mining uncertain database. In general, uncertain database will have “longer”

transactions. (i.e. more items per transaction) Some items with high existential probabilities. Some items with low existential probabilities.

In our current study, we focus on dataset with bimodal existential probability distribution.

Computational Issue

Synthetic Dataset simulates a bimodal distribution of existential probability:7 datasets with same frequent itemsets.Vary the percentages of items with low

existential probability in the datasets.

0% 33.33% 50% 60% 66.67% 75%71.4%

1 2 3 4 5 6 7

Preliminary Study

Iterations

Iterations

Iterations

Num

ber

of c

andi

date

item

sets

Num

ber

of la

rge

item

sets

Number of Large itemsets in each iteration

Number of candidates in each iterationTime spent on subset checking in each iteration for different datasets

ALL datasets are having the same large itemsets.

There is a sudden burst of number of candidates in second iteration.

Fraction of items with low existential probability : 0%

Fraction of items with low existential probability : 75%

Since both datasets have the same frequent itemsets, subset increment of the 75% low existential probability items maybe actually redundant.

There is potential to reduce the execution time.

This figure shows the time spent on subset checking in each iteration of different datasetsComputational bottleneck occurs in iteration 2.

0%

1

75%

7

Section 4Efficient Methods of Mining Frequent itemsets from Existentially Uncertain Data

Efficient Method 1Data Trimming

Avoid insignificant subset increments

Method 1 - Data Trimming Strategy

DirectionTry to avoid incrementing those insignificant

expected support counts.Save the effort for

Traversing the hash tree. Computing the expected support. (Multiplication of

float variables) The I/O for retrieving the items with very low

existential probability.


Question: Which item should be trimmed? Intuitively, items with low existential probability

should be trimmed, but how low? For the time being, let assume there is a

user-specified trimming threshold.


Create a trimmed database and trim out all items with existential probability lower than the trimming threshold.

During the trimming process, some statistics are kept for error estimation when mining the trimmed database. Total trimmed expected support count of each item. Maximum existential probability of trimmed item. Other information : e.g. inverted list, signature file …etc

I1 I2 I3 … I4000

t1 90% 80% 3% … 1%

t2 80% 4% 85% … 78%

t3 2% 5% 86% … 89%

t4 5% 95% 3% … 100%

t5 94% 95% 85% … 2%

Uncertain database

I1 I2

t1 90% 80%

t2 80%

t4 95%

t5 94% 95%

+

Statistics

Total expected support trimmed

Maximum existential probability of trimmed item

I1 1.1 5%

I2 1.2 3%

Trimmed Database

The Subset Function scans the trimmed database and count the expected support of every size-2-candidates.We expect mining the trimmed database saves lots of I/O and Computational Costs.

Method 1 - Data Trimming StrategyTrimming Process

Candidate itemsetsSize-k-large

itemsets

Apriori-Gen

Subset Function

Hash-tree

k=k+1

Trimming Module

+

Size-k-infrequent itemsets

Size-k-potentially frequent itemsets

Pruning Module

Patch Up Module

Missed frequent itemsets

Uncertain Database

Trimmed Database

+

Statistics

The uncertain database is first passed into the trimming module to remove the items with low existential probability and gather statistics during the trimming process.

During the trimming process the “true” expected support count of size-1 candidates are counted.i.e. Size-1-large itemsets do not have false negative.

Then the size-1 frequent items are passed into the APRIORI GEN procedure to generate size-2-candidates.

Notice that the infrequent itemsets are only infrequent in the trimmed database.It may contains some true frequent itemsets in the original database.

The Pruning Module uses the statistics gathered from the trimming module to estimate the error and identifies the potentially frequent itemsets from the infrequent itemsets.

Here comes two strategies: Use the potentially frequent itemsets to generate size-k+1-candidates Do not use the potentially frequent itemsets to generate size-k+1 candidates

Finally, all the potentially frequent itemsets are checked against the original database to verify its true support.

Method 1 - Data Trimming StrategyPruning Module The role of the Pruning Module is to

identify those itemsets which are infrequent in the trimmed database but frequent in the original database.

Have to be estimated

This count represents the expected support of the itemset AB where both item A and B are left in the trimmed database.i.e. This count can be obtained by mining the trimmed database

Method 1 - Data Trimming StrategyPruning Module If upper bound of plus is

greater than or equal to the minimum expected support requirement, {A,B} is regarded as potentially frequent.

Otherwise, {A,B} cannot be frequent in the original database and can be pruned.

Have to be estimated

Method 1 - Data Trimming Strategy Max count pruning strategy Pruning strategy depends on statistics from the

Trimming Module. For each size-1 item, keeps

Total expected support count being trimmed of each item. Maximum existential probability of trimmed item.

Global Statistics



I1 1.5 5%

I2 1.2 3%

Since the statistics are “Global” to the whole database, this method is called

Global Max Count Pruning Strategy

Original Database

Using global counts to estimate the whole database is sometime loose, we may use some “Local” statistics to obtain the bound

Local Max Count Pruning Strategy

Local Statistics

Part a

Part b

Part c

Part d

Part e



I1 Part a – 16.6

Part b – 14.2

Part c – 13

Part d – 0.1

Part e – 2.7

Part a – 2%

Part b – 0.5%

Part c – 6%

Part d – 1%

Part e – 0.7%

I2 Part a – 2.7

Part b – 19.5

Part c – 2.6

Part d – 12.3

Part e – 0.3

Part a – 1.1%

Part b – 3%

Part c – 7%

Part d – 2.4%

Part e – 0.2%

Method 1 - Data Trimming StrategyMax count pruning strategy Let , and be the upper bound estimations of ,

and respectively. From iteration 1, we have

SKIP

Method 1 - Data Trimming StrategyPatch Up Module


itemsets

Apriori-Gen

Subset Function

Hash-tree

k=k+1

Trimming Module

+

Trimmed Database

+


Pruning Module

Statistics

Patch Up Module


Uncertain Database


The Pruning Module identifies a set of potentially frequent itemsets.

The Patch Up Module verifies the true frequencies of the potentially frequent itemsets.

Two strategies• One Pass Patch Up Strategy• Multiple Passes Patch Up Strategy

Method 1 - Data Trimming StrategyDetermine trimming threshold


itemsets

Apriori-Gen

Subset Function

Hash-tree

k=k+1

+


Pruning Module

Patch Up Module



Trimming Module

Trimmed Database

+

Statistics

Uncertain Database

Question : Which item should be trimmed?

Method 1 - Data Trimming StrategyDetermine trimming threshold Before scanning the database and

incrementing the support counts of candidates, we cannot deduce which itemset is infrequent.

We can make a guess on the trimming threshold from the statistics gathered from previous iterations.


Cumulative Support of item A in descending order

item A ordered by existential probability in descending order

Cum

ulat

ive

Sup

port

Statistics from previous iteration Order the existential

probabilities of each size-1 item in descending order and plot the cumulative support.

E.g. Item A has it’s expected support just over the support threshold.

It is marginally frequent, it’s supersets are potentially infrequent.

If a superset is infrequent, it won’t be frequent in trimmed database, we want to trim those items such that the error estimation should be tight enough to prune it in the Pruning Module.

Use the existential probability of the intersecting item to be the trimming threshold.


Cumulative Support of item B in descending order

item B ordered by existential probability in descending order

Cum

ulat

ive

Sup

port

Statistics from previous iteration Order the existential

probabilities of each item in descending order and plot the cumulative support.

E.g. Item B has it’s expected support much larger than the support threshold.

It’s supersets are likely to be frequent.

The expected support contributed by these items are insignificant.

Use the existential probability of this item to be the trimming threshold.

Efficient Method 2Decremental Pruning

Identify infrequent candidates during database scan

Method 2 - Decremental Pruning

In some cases, it is possible to identify an itemset to be infrequent before scanning the whole database.

For instance, if the minimum support threshold is 100, and the expected support of item A is 101. After scanning transaction t2, we can conclude that ALL itemsets

containing item A must be infrequent and can be pruned.

A …

t1 70% 0%

t2 50% 0%

… … ….

t100K … …

Uncertain DatabaseTotal expected support of A is 100.3 from transaction t2 onwards.

Total expected support of A is 99.8 from transaction t3 onwards.

We can conclude that Item A is infrequent from t2 to t100K, all candidates containing A must be infrequent.


Before scanning the database, define two “Decremental Counters” for itemset {A,B}

represents the maximum possible support count of itemset {A,B} if ALL items A match with item B, and ALL matching item Bs are having 100% existential probabilities

from transaction t to the end of the database, then itemset {AB} will have support count larger than the minimum support by ” “.


While scanning the transactions, update the decremental counts according to the following equation :

Method 2 - Decremental Pruning Brute-force method

Example Support threshold: 50%, min_sup=2 Expected support of A=2.6, B=2.1, C=2.2 For candidate itemset {A,B} :

A B C

T1 100% 50% 30%

T2 90% 80% 70%

T3 30% 40% 90%

T4 40% 40% 30%

Uncertain Database

Before scanning the database, initialize the decremental counters of candidate {A,B}

Update the decremental counters according to the equation.

We can conclude that candidate {A,B} is infrequent without scanning T3 and T4, which saves the computational efforts in the subset function.

0.1 value means that if 1) ALL the item A match with item B, and 2) ALL matching Bs are having 100% existential probabilitiesfrom transaction 2 to 4, then the expected support count of {A,B} will be 0.1 larger than min_sup.

0.6 value of d0(A,AB) means that if- ALL the item A match with item B and,- ALL matching Bs are having 100% existential probabilities in the whole database, then the expected support count of {A,B} will be 0.6 larger than min_sup .

Method 2 - Decremental Pruning Brute-force method This method is infeasible because

Each candidate has to associate with at least 2 decremental counters.

Even if any itemset is identified infrequent, the subset function still has to traverse the hash tree and reach the leaf nodes to retrieve the corresponding counters before it is known to be infrequent.

Level 0

Level 1

Candidate Itemset


Decremental Counters

AD 0 d0(A,AD),d0(D,AD)

AG 0 d0(A,AG),d0(G,AG)

AB

AE

BD

BG

AC

AF

BE

…

BF

…

CD

…

CE

…

CF

…

Method 2 - Decremental Pruning Aggregate by item method Aggregate by item method

Aggregates the decremental counters and obtains an upper bound of them.

Suppose there are three size-2-candidates

There are totally 6 decremental counters in the brute-force method

Aggregate the counters d0(A,AB) and d0(A,AC) by d0(A), and obtain an upper bound of the two counters.

Brute-force method Aggregate by item method

Method 2 - Decremental Pruning Aggregate by item method

Aggregated Counter

Value

A B C

T1 100% 50% 30%

T2 90% 80% 70%

T3 30% 40% 90%

T4 40% 40% 30%

Uncertain Database

Initialize the countersScan transaction t1 and Update the decremental counters

0.6-[1*(1-0.5)]

Scan transaction t2 and update the decremental counters

Since no counter is smaller than zero, we cannot conclude any candidates to be infrequent.

Since d2(A) is smaller than zero, {AB},{AC} are infrequent and can be pruned

SKIP

Method 2 - Decremental Pruning Hash-tree integration method Other than loosely aggregate the decremental

counts by item, aggregation can be based on the hash function used in the subset function.

Level 0

Level 1

Candidate Itemset





DG 0 d0(D,DG),d0(G,DG)

AB

AE

DE

BD

BG

EG

AC

AF

DF

BE

…

BF

…

CD

…

CE

…

CF

…

Subset Function

Recall that the brute-force approach stores the decremental counters in the leaf nodes.

The aggregated decremental counters are stored in the hash nodes.

When any of the decremental counters become smaller than or equal to zero, the corresponding itemsets in the leaf node cannot be frequent and can be pruned.

Method 2 - Decremental Pruning Hash-tree integration method

Level 0

Level 1

Candidate Itemset





DG 0 d0(D,DG),d0(G,DG)

AB

AE

DE

BD

BG

EG

AC

AF

DF

BE

…

BF

…

CD

…

CE

…

CF

…

Subset FunctionThe hash-tree integration method aggregates the decremental counters according to the hash function.

This is the root of the hash tree.

Method 2 - Decremental Pruning Hash-tree integration method Improving the pruning power

The hash-tree is a prefix tree which is constructed based on lexicographic order of items

Item with higher order will be prefix containing more itemsets.

{A,B}

{A,C}

{A,D}

{B,C}

{B,D}

{C,D}

Level 0 (Root)

3 itemsets under this decremental counter.

1 itemset under this decremental counter only.

If this counter becomes negative during database scan, we can prune 3 itemsets.

If this counter becomes negative during database scan, we can prune 1 itemset only.

Method 2 - Decremental Pruning Hash-tree integration method Our strategy is to reorder the items by their expected

supports in ascending order such that The decremental counters of items in higher lexicographic

orders will be more likely to become negative than those with lower lexicographic orders.

{A,B}

{A,C}

{A,D}

{B,C}

{B,D}

{C,D}

Level 0 (Root)If this counter becomes negative during database scan, we can prune 3 itemsets.

If this counter becomes negative during database scan, we can prune 1 itemset only.

3 itemsets under this decremental counter.

1 itemset under this decremental counter only.

Efficient Method 3Candidates filtering

Identify infrequent candidates before database scan

SKIP

Method 3 – Candidates filtering

It is possible to identify some infrequent candidate itemsets before scanning the database to verify its support.

A B C

T1 30% 50% 100%

T2 70% 80% 90%

T3 90% 40% 30%

T4 30% 40% 40%

Uncertain Database

Expected Support 2.2 2.1 2.7

Maximum existential probability 90% 80% 100%

min_sup = 2

{A,B} {A,C} {B,C}Size-2-candidate itemsets

1.76 2.2 2.1

For instance, after scanning the database, the expected support of item A,B,C are obtained.

During the database scan, keep the maximum existential probability of each item.

Size-2-candidate itemsets are generated.

From the expected supports and maximum existential probabilities obtained above, we can obtain an upper bound of the candidates BEFORE scanning the database.

For {A,B}, if ALL items A matches with B with B’s maximum existential probability, {AB} will have expected support value 2.2*80% = 1.76

This is an upper bound of the expected support of {A,B}, which is smaller than min_sup.Thus it must be infrequent and can be pruned.

Maximum expected support of size-2-candidates

Method 3 – Candidates filtering

A B C

T1 30% 50% 100%

T2 70% 80% 90%

T3 90% 40% 30%

T4 30% 40% 40%

Uncertain Database

Expected Support 2.2 2.1 2.7

Maximum existential probability 90% 80% 100%

min_sup = 2

{A,B} {A,C} {B,C}Size-2-candidate itemsets

1.76 2.2 2.1

For {A,B}, if ALL items A matches with B with B’s maximum existential probabilities, {AB} will have expected support value 2.2*80% = 1.76

This is an upper bound of the expected support of {A,B}, which is smaller than min_sup.Thus it must be infrequent and can be pruned.

Maximum expected support of size-2-candidates

Section 5Experimental Results and Discussions

ExperimentsSynthetic datasets Data associations

Generated by IBM Synthetic Generator. Average length of each transaction (T) Average length of hidden frequent patterns (I) Number of transactions (D)

Data uncertainty We would like to simulate the situation that there are some items

with high existential probabilities, while there are also some items with low existential probabilities.

Bimodal distribution Base of high existential probabilities (HB) Base of low existential probabilities (LB) Standard Deviations for high and low existential probabilities

(HD,LD) Percentage of item with low existential probabilities (R)

T100R75%I6D100K HB90HD5LB10LD5

ExperimentsImplementation C programming language Machine

CPU : 2.6 GHz Memory : 1 Gb Fedora

Experimental settings : T100R75%I6D100K HB90HD5LB10LD5 136 Mb Support threshold 0.5%


Experimental ResultsTrimming Method

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8

Iterations

Exec

utio

n tim

e Uncertain Apriori

Trimming - Global MaxCount

Trimming - Local Max Count


Since we use the one-pass patch up strategy, trimming methods have one more Patch Up phase.

Iteration 2 is computationally expensive because there are many candidates, leading to heavy computational effort in the subset function.

Trimming methods can successfully reduce the number of subset increments in ALL iterations.

Plus the time spent on Patch Up phase, the trimming methods still have a significant performance gain.

Execution time of Trimming Methods VS Uncertain Apriori in each iteration

For Uncertain Apriori, each transaction has 100C2 = 4950 size-2 subsets.For Trimming, each transaction has at least 25C2 = 300 size-2 subsets only!

0

50

100

150

200

250

1 2 3 4 5 6 7 8

Iteration

CPU

Cos

t (s)

Uncertain Apriori

Trimming - Global Max Count


Experimental ResultsCPU Cost Saving by Trimming

-50

0

50

100

150

200

250

1 2 3 4 5 6 7 8

Iterations

CPU

Cos

t Sav

ing

(s)



0

10

20

30

40

50

60

70

80

90

100

2 3 4 5 6

Iteration

CPU

Cos

t Sav

ing

(%)




CPU Cost of Trimming Methods VS Uncertain Apriori in each iteration

Negative CPU saving in iteration 1 because time is spent on gathering the statistics for the Pruning Module.

CPU Cost Saving in each iteration Percentage of CPU Cost Saving from iteration 2 to 6

Trimming methods achieve high computational saving in iterations where CPU cost is significant.

Experimental ResultsI/O Cost Saving by Trimming

0

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8

Iterations

I/O C

ost (

s)

Uncertain Apriori



-6

-5

-4

-3

-2

-1

0

1

2

3

4

5

1 2 3 4 5 6 7 8

Iterations

I/O C

ost S

avin

g (s

)




I/O Cost of Trimming Methods VS Uncertain Apriori in each iteration

I/O Cost Saving of Trimming Methods in each iteration

Trimming Methods have extra I/O effort in iteration 2 because they have to scan the original database PLUS create the trimmed database.

I/O Cost saving occurs from iteration 3 to iteration 6.That is, I/O cost saving will increase if there are longer frequent itemsets.

Experimental ResultsVarying Support Threshold

0

500

1000

1500

2000

2500

3000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Expected Support Threshold (%)

Tota

l Exe

cutio

n Ti

me

(s)

Uncertain Apriori

Trimming - Global MaxCount



Execution time of Trimming Methods VS Uncertain Apriori for different support thresholds

The rate of increase in execution time of Trimming Method is smaller than that of Uncertain Apriori.

20 3360

7177.7

83.33

90.9

0

200

400

600

800

1000

1200

1400

1600

1800

Percentage of items with low existential probability (R)

Tota

l Exe

cutio

n Ti

me(

s)

Uncertain Apriori

Trimming (Global Max Count)

Trimming (Local Max Count)

0% 100%

T100R ? %I6D100K HB90HD5LB10LD5

50%

Execution time of Trimming Methods VS Uncertain Apriori for different percentages of items with low existential probability

ALL the itemsets are having the same frequent itemsets.

Trimming Methods achieve almost linear execution time in increasing percentage of items with low existential probability.

Experimental ResultsVarying percentage of items with low existential probability

Experimental Results Decremental Pruning

0

10

20

30

40

50

60

70

80

90

100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fraction of Database Scanned

Perc

enta

ge o

f Can

dida

tes Pr

uned

Decremental (Aggregate byitems)

Decremental (Integrate withHash tree)

0

200

400

600

800

1000

1200

1400

1600

1800

Percentage of items with low existentialprobability (R)

Exe

cutio

n tim

e (s

) Uncertain Apriori




Percentage of Candidates Pruned during database scan for 2nd iteration

Execution time of Decremental Pruning VS Uncertain Apriori for different percentages of

items with low existential probability

Pruning power of the Decremental Methods in 2nd iteration.

The “Integrate with Hash Tree” method outperforms the “Aggregated by items” method.

Although “Integrate with Hash Tree” method can prune twice number of candidates than the “Aggregate by items” method, the time saving is not significant.This is because the “Integrate with Hash Tree” method has more overhead.

0% 100%50%

Experimental Results Varying percentage of items with low existential probability

0

200

400

600

800

1000

1200

1400

1600

1800

Percentage of items with low existentialprobability (R)

Exe

cutio

n tim

e (s

)

Uncertain Apriori

Trimming (Global MaxCount)

Trimming (Local Max Count)




The Trimming and Decremental Methods can combine together to form a Hybrid Algorithm

Execution time of Decremental and Trimming Methods VS Uncertain Apriori

for different percentages of items with low existential probability

0% 100%50%

Experimental ResultsHybrid Algorithms

0

200

400

600

800

1000

1200

1400

1600

1800


Exec

utio

n tim

e (s

)

Uncertain Apriori

Decremental (Aggregate by items)

Decremental (Integrate with Hashtree)

Decremental (Aggregate by items) +Trimming

Decremental (Integrate with Hashtree) + Trimming

Decremental (Integrate with Hashtree) + Trimming + CandidatePruning


Execution time of Different Combinations VS Uncertain Apriori

for different percentages of items with low existential probability

0% 100%50%

Combining the 3 proposed methods achieves the smallest execution time.

Experimental ResultsVarying percentage of items with low existential probability

-20

0

20

40

60

80

100

Percentage of items with low existential probability(R)

Tota

l CPU

cos

t sav

ing

(%)

Decremental (Integrate withHash tree) + Trimming +Candidate Pruning

-40

-30

-20

-10

0

10

20

30

40

50


Tota

l I/O

Sav

ing

(%)

Decremental (Integrate with Hashtree) + Trimming + CandidatePruning

Overall CPU saving of the Hybrid Algorithm for different percentages of items with low existential probability

Overall I/O saving of the Hybrid Algorithm for different percentages of items with low existential probability


0% 100%50%

0% 100%50%

CPU cost saving occurs when there are 5% or more items with low existential probability in the dataset.

80% or more CPU cost is saved for dataset with 40% or more items with low existential probability.

I/O cost saving occurs when there are 40% or more items with low existential probability in the dataset.

In fact, this figure only shows that I/O cost saving will increase if more items are trimmed.

Actually the I/O saving should also depends on the length of hidden frequent itemsets, which can be shown by varying the (I) parameter in the dataset generation process.

Conclusion

We have defined the problem of mining frequent itemsets from uncertain database.

Possible World interpretation has be adopted to be the theoretical background of the mining process.

Existing frequent itemsets mining algorithms are either inapplicable or unacceptably inefficient to mine uncertain data.

We have identified the computational bottleneck of Uncertain Apriori, and

Proposed a number of efficient methods to reduce both CPU and I/O cost significantly.

Future Works Sensitivity and scalability test on each

parameters T,I,K,HB,LB…etc Generate Association Rules from uncertain data

What is the meaning of the association rules mined from uncertain data?

Real Case Study Other types of association rules

Quantitative association rules Multidimensional association rules

Papers…Now I am interested in these kind of association rules

80% Eating Disorder => 90% Depression

End

Thank you

Documents

M.Phil Probation Talk Association Rules Mining of Existentially Uncertain Data Presenter : Chui Chun Kit Supervisor : Dr. Benjamin C.M. Kao