Upload
dana-andrews
View
219
Download
5
Tags:
Embed Size (px)
Citation preview
M.Phil Probation TalkAssociation Rules Mining of Existentially Uncertain Data
Presenter : Chui Chun KitSupervisor : Dr. Benjamin C.M. Kao.
Presentation Outline
Introduction What is association rules? How to mine association rules from large database?
Probabilistic Data (Uncertain Data) What is probabilistic/ uncertain data? Possible World interpretation of uncertain data
Mining frequent patterns from uncertain data Presents a simple algorithm to mine association rules from uncertain
data Identify computational problem
Efficient methods of mining association rules from uncertain data Experimental Results and Discussions Conclusion and Future Work
Section 1Introduction
What is association rule?
Introduction
Suppose Peter is a psychologist He has to judge on a list of psychological symptoms to make
diagnosis and give treatments to his patients. All diagnosis records are stored in a transaction database.
We call each patient record as a transaction, and each psychological symptom as an attribute with value either yes or no (i.e. binary attribute), and the collection of patients’ records as a transaction database.
Mood Disorder
Anxiety Disorder
Eating Disorder
Obsessive-Compulsive Disorder
Depression … Self Destructive Disorder
…
…
…
Patient 1
Patient 2
Transaction Database
Binary AttributesEither Yes/ NoTransactions
Psychological Symptoms Transaction Database
Introduction
One day, when Peter is reviewing his patients’ records, he discovers some patterns of his patients’ psychological symptoms. E.g. Patients having “mood disorder” are often associated with
“eating disorder”. He would like to learn about the associations between
different psychological symptoms from his patients.
Mood Disorder
Anxiety Disorder
Eating Disorder
Obsessive-Compulsive Disorder
Depression … Self Destructive Disorder
…
…
…
Patient 1
Patient 2
Psychological Symptoms Transaction Database
Introduction
Peter may be interested in the following associations among different psychological symptoms.
Mood Disorder
Anxiety Disorder
Eating Disorder
Obsessive-Compulsive Disorder
Depression … Self Destructive Disorder
…
…
…
Patient 1
Patient 2
Psychological Symptoms Transaction Database
Mood disorder => Eating disorder
Mood disorder => Depression
Eating disorder => Depression + Mood disorder
Eating disorder + Depression => Self destructive disorder + Mood disorder
These associations are very useful information to assists diagnosis and give treatments.
Association Rules
Introduction
However, the psychological symptoms database is very large, it is impossible to analyze the associations by human inspection.
In Computer Science research, the problem of mining association rules from transaction database is solved in 1993 by R. Agrawal. The Apriori Algorithm
Mood Disorder
Anxiety Disorder
Eating Disorder
Obsessive-Compulsive Disorder
Depression … Self Destructive Disorder
…
…
…
Patient 1
Patient 2
Psychological Symptoms Transaction DatabaseThanks computer scientists !
Too many records
Basic algorithm for mining association rules
2% support value means that there are 2% of the patients in the database have both psychological symptoms.
60% confidence value means that 60% of the patients having Eating disorder also have Depression.
IntroductionAssociation Rules
There are two parameters to measure the interestingness of association rules. Support is the fraction of database transaction that
contains the items in the association rule. Support shows how frequent is the items in the rule.
Confidence is the percentage of transaction that contains the antecedent also contains the consequent.
Confidence shows the certainty of the rule.
Eating disorder => Depression
[Support = 2% , confidence = 60%]
Antecedent Consequent
IntroductionAssociation Rules
Two steps for mining association rules Step 1: Find ALL frequent itemsets
Itemsets are frequent if their supports are over the user-specified SUPPORT threshold.
Step 2: Generate association rules from the frequent itemsets
An association rule is generated if its confidence is over a user-specified CONFINDENCE threshold.
Given the transaction database,find ALL the association rules withSUPPORT values over 10%, and CONFINDENCE values over 60% please!
Psychological Symptoms Database
Given the transaction database,find ALL the association rules withSUPPORT values over 10%, and CONFINDENCE values over 60% please!
Psychological Symptoms Database
Two steps for mining association rules Step 1: Find ALL frequent itemsets
Itemsets are frequent if their support are over the user-specified SUPPORT threshold.
Step 2: Generate association rules from the frequent itemsets
An association rule is generated if its confidence over a user-specified CONFINDENCE threshold.
IntroductionAssociation Rules
The overall performance of mining association rules is determined by the first step.
For the sake of discussion, let us focus on the first step in this talk.
Section 1Introduction
How to mine frequent itemsets from large database?
Mining Frequent ItemsetsProblem Definition
Given Transaction database D
with n attributes, and m transactions.
Each transaction t is a Boolean vector representing the presence or absence of items in that transaction.
Minimum support threshold s.
Find ALL itemsets with support values over s.
I1 I2 I3 I4 I5 … In
t1 1 0 1 1 1 … 1
t2 1 1 1 0 0 … 1
.. … … … … … … …
tm 0 1 1 0 0 1 1
Transaction Database D
Brute-force approach Suppose there are 5 items
in the database. i.e. A,B,C,D and E.
There are totally 32 itemsets.
Scan the database once to count the supports of ALL itemsets together.
If there are n different items, there will be 2^n itemsets to count in total. If there are 20 items, there will
be 1,000,000 itemsets!!! Computationally infeasible
The Apriori Algorithm
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
Apriori property : All subsets of a frequent itemset must also be frequent
The Apriori algorithm adopts an iterative approach to identify infrequent itemsets.
No need to count their frequency.
The Apriori AlgorithmHow it works?
CandidatesLarge
itemsets
Apriori-Gen
Subset Function
{A}
{B}
{C}
{D}
{E}
{B}
{C}
{D}
{E}
The Apriori algorithm starts from inspecting ALL size-1 items.
The supports of ALL size-1-candidates are obtained by a SUBSET FUNCTION procedure by scanning the database once.
After obtaining the supports, candidates with support over the support threshold are large items.
X
Item {A} is infrequent, by APRIORI PROPERTY, ALL supersets of {A} must NOT be frequent.
X X X X
XXXXXX
X X X X
X
{BC}
{BD}
{BE}
{CD}
{CE}
{DE}
The APRIORI-GEN procedure generate ONLY those size-(k+1)-candidates which are potentially frequent.
The Apriori Algorithm obtains the frequent itemsets iteratively until no candidates are generated.
X
X X
{BD}
{BE}
{CD}
{CE}
{DE}
{BDE}
{CDE}
X
{BDE}
{CDE}
Save effort for counting the supports of pruned itemsets.
Important Detail of AprioriSubset Function
Subset-Function Scan the database transaction by transaction to
increment the corresponding support counts of the candidates.
Generally there are many candidates, the Subset Function organizes the candidates in a hash-tree data structure.
Each interior node of the hash-tree contains a hash table. Each leaf node of the hash-tree contains a list of itemsets
and support counts.
Candidate itemsets Large itemsets
Apriori-Gen
k=k+1
k=1
Subset Function
Candidate {1,2} is stored in this leaf node.
Important Detail of AprioriHow are candidates stored into the hash tree?
Hash tree data structure Each Interior node contains a hash table. Leaf nodes of hash-tree contains a list of itemsets and
support counts.
Hash table
1,4,7 2,5,8 3,6,9
Candidates
1 {1,2}
2 {2,4}
3 {3,6}
4 {1,5}
5 {2,4}
6 {2,4}
… …
Level 0
Level 1
Hash Tree
{1,2}
{2,4}
2 levels for storing size-2-candidates.
Illustrate how the candidates are stored into the hash-tree.
First, hash on the first item of candidate {1,2}
Then, hash on the second item of candidate {1,2}Similarly, candidate {2,4} is
hashed and stored in this slot.
A transaction with 100 items has 100C2 = 4950 size-2-subsets !
Hash on {1,4} and traverse the hash tree to search for the candidate.
Enumerate all size-2-subsets within the transaction and traverse the hash tree to increment the corresponding support counts.
Important Detail of AprioriHow to process a transaction with the hash-tree?
Fitting a transaction into the hash tree.
Level 0
Level 1
1 2 4 … 9924Transaction 1
Candidate Itemset
Support Count
{1,4}
{1,7} 0
{1,2}
{1,5}
{2,4}
{2,7}
{1,3}
{1,6}
{2,5}
…
{2,3}
…
{3,4}
…
{3,5}
…
{3,6}
…
Subset Function
0
Enumerate ALL size-2 subsets, {1,4} is one of them.
When the itemset is found, increment its support count.
1
Same procedure has to repeat for ALL size-2 subsets of the transaction and for ALL transactions !
Section 2Probabilistic Data
What is probabilistic data?
Probabilistic Database or Uncertain DatabaseProbabilistic Database
In reality, when psychologists make a diagnosis, they estimate the likelihood of presence of each psychological symptom of a patient.
The likelihood of presence of each symptom is represented in terms of existential probabilities.
Mood Disorder
Anxiety Disorder
Eating Disorder
Obsessive-Compulsive Disorder
Depression … Self Destructive Disorder
…
…
…
Patient 1
Patient 2
Psychological Symptoms Transaction Database
97% 5% 84% 14% 76% 9%
90% 85% 100% 86% 65% 48%
How to mine association rules from uncertain
database?
Other areas of probabilistic databasePattern Recognition
Handwriting recognition Speech recognition…etc
Information RetrievalScientific Database
Probabilistic Database
Feature 1 Feature 2 …
Pattern 1 90% 85% …
Pattern 2 60% 5% …
Binary Features
Section 2Probabilistic Data
Possible World interpretation of uncertain databaseby S. Abiteboul in the paper “On the Representation and Querying of Sets of Possible Worlds“ in SIGMOD 1987.
Possible World Interpretation Example
A database with two psychological symptoms and two patients.
16 Possible Worlds We can discuss the
supports of itemsets of each individual world.
Depression Eating Disorder
Patient 1 90% 80%
Patient 2 40% 70%
1 S1 S2
P1 √ √
P2 √ √
2 S1 S2
P1 × √
P2 √ √
3 S1 S2
P1 √ ×
P2 √ √
4 S1 S2
P1 √ √
P2 × √
5 S1 S2
P1 √ √
P2 √ ×
6 S1 S2
P1 √ √
P2 × ×
9 S1 S2
P1 × √
P2 × √
10 S1 S2
P1 × √
P2 √ ×
11 S1 S2
P1 √ ×
P2 × √
14 S1 S2
P1 × ×
P2 √ ×
15 S1 S2
P1 × ×
P2 × √
16 S1 S2
P1 × ×
P2 × ×
8 S1 S2
P1 √ ×
P2 √ ×
12 S1 S2
P1 √ ×
P2 × ×
13 S1 S2
P1 × √
P2 × ×
7 S1 S2
P1 × ×
P2 √ √
From the uncertain database, one of the possibility is that both patients are actually having both psychological
illnesses.
Psychological symptoms database
On the other hand, the uncertain database also captures the
possibility that patient 1 only has eating disorder illness while patient
2 has both of the illnesses.Thus data uncertainty is eliminated when we focus on individual Possible World!
Each possibility is called a “Possible World”.
Possible World Interpretation Support of itemset
{Depression,Eating Disorder}Depression Eating Disorder
Patient 1 90% 80%
Patient 2 40% 70%
1 S1 S2
P1 √ √
P2 √ √
2 S1 S2
P1 × √
P2 √ √
3 S1 S2
P1 √ ×
P2 √ √
4 S1 S2
P1 √ √
P2 × √
5 S1 S2
P1 √ √
P2 √ ×
6 S1 S2
P1 √ √
P2 × ×
9 S1 S2
P1 × √
P2 × √
10 S1 S2
P1 × √
P2 √ ×
11 S1 S2
P1 √ ×
P2 × √
14 S1 S2
P1 × ×
P2 √ ×
15 S1 S2
P1 × ×
P2 × √
16 S1 S2
P1 × ×
P2 × ×
8 S1 S2
P1 √ ×
P2 √ ×
12 S1 S2
P1 √ ×
P2 × ×
13 S1 S2
P1 × √
P2 × ×
7 S1 S2
P1 × ×
P2 √ √
World Support {S1,S2} World Likelihood
1
2
3
4
5
6
7
8
… …
2 0.9 × 0.8 × 0.4 × 0.7
We can discuss support of itemset {S1,S2} of possible world 1.
We can also discuss the likelihood of possible world 1 being the true world.
1 0.1 × 0.8 × 0.4 × 0.7
1
1
1
1
1
0
0.0504
0.3024
0.0864
0.1296
0.0056
0.0336
0.2016
0.0224
0
We define the expected support being the weighted average support count represented by ALL the possible worlds.
Question:Overall speaking, how many {S1,S2} itemsets will you expect to have from these possible worlds?
Psychological symptoms database
Thus data uncertainty is eliminated when we focus on individual Possible World!
Similarly, we can discuss the support and likelihood of Possible World 2.
Possible World Interpretation
World Support {S1,S2} World Likelihood
1
2
3
4
5
6
7
8
… …
2
1
1
1
1
1
1
0
0.0504
0.3024
0.0864
0.1296
0.0056
0.0336
0.2016
0.0224
0
We define the expected support being the weighted average support count represented by ALL the possible worlds.
Weighted Support
0.4032
0.0224
0.0504
0.3024
0.0864
0.1296
0.0056
0
0
Expected Support 1
Notice that the world likelihoods form a discrete probability density function of the support values of itemset {S1,S2}.
Since the possible worlds are independent to each other, the probability density function of the support values of {S1,S2} is as follows
P(support)
Support0 1 2
20.16%
59.68%
20.16%
Expected Support is the is calculated by summing up the weighted support counts of ALL the possible worlds.
We expect there will be 1 patient has both “Eating Disorder” and “Depression”.
Possible World Interpretation
Instead of enumerating all “Possible Worlds” to calculate the expected support, it can be calculated by the following formula
Depression Eating Disorder
Patient 1 90% 80%
Patient 2 40% 70%
Psychological symptoms database
Weighted Support
0.72
0.28
TOTAL SUM 1
The expected support can be calculated by simply multiplying the existential probabilities within the transaction and obtain the total sum of all transactions
Mining Frequent Itemsets from probabilistic data Problem Definition
Given an uncertain database D with each item of a transaction associated with an existential probability, and a user-specified support threshold s, return ALL the itemsets having expected support greater than or equal to |D|×s.
In another words, find ALL the itemsets that are expected to be frequent according to the existential probabilities in the uncertain database.
Section 3Mining frequent patterns from uncertain data
The Uncertain Apriori algorithm
Uncertain Apriori Algorithm
All the procedures are the same as conventional association rule mining algorithm.
The only difference is in the subset function.
Candidate itemsets Size-k-large itemsets
Apriori-Gen
Subset Function
k=k+1
k=1
Start End
Increase the expected support count by 0.7*0.3 = 0.21
The only difference is in the subset function.
Uncertain Apriori Algorithm
Level 0
Level 1
Increment the candidate count by the expected support contributed by the transaction.
1 (70%) 2 (50%) 4 (30%) … 9924 (30%)Transaction 1
Candidate Itemset
Expected Support Count
{1,4}
{1,7} 0
{1,4}
{1,7}
{1,2}
{1,5}
{2,4}
{2,7}
{1,3}
{1,6}
{2,5}
…
{2,3}
…
{3,4}
…
{3,5}
…
{3,6}
…
Subset Function
00.21
Instead of storing the support counts, candidate itemsets are associated with an expected support count.
Uncertain Apriori Algorithm
Level 0
Level 1
1 (90%) 2 (2%) 3 (99%) … 9924 (5%)Transaction 1
{1,2}
{1,5}
{2,4}
{2,7}
{1,3}
{1,6}
{2,5}
…
{2,3}
…
{3,4}
…
{3,5}
…
{3,6}
…
Mood Disorder
Anxiety Disorder
Eating Disorder
Obsessive-Compulsive Disorder
Depression … Self Destructive Disorder
Patient 1 90% 2% 99% 97% 92% … 5%
Patient 2 89% 96% 80% 4% 8% … 3%
Patient 3 8% 6% 79% 10% 5% … 98%
…
Psychological Symptoms Transaction Database
Thus we can apply Uncertain Apriori on uncertain database to mine ALL the frequent itemsets.
Why the algorithm executes so long, even doesn’t terminate ?
{1,4}
{1,7}
Computational Issue
Each item (attribute) of a transaction (object) is associated with an existential probability, despite the items with very high probability of presence, there are large number of items with relatively low probability of presence.
Mood Disorder
Anxiety Disorder
Eating Disorder
Obsessive-Compulsive Disorder
Depression … Self Destructive Disorder
Patient 1 90% 2% 99% 97% 92% … 5%
Patient 2 89% 96% 80% 4% 8% … 3%
Patient 3 8% 6% 79% 10% 5% … 98%
…
Psychological Symptoms Transaction Database
Computational Issue
Level 0
Level 1
1 (70%)
2 (50%)
4 (30%)
7 (3%)
10 (2%) … 991 (60%)Transaction 1
Candidate Itemset
Expected Support Count
{1,4} 0
{1,7} 0
{1,10} 0
{4,7} 0
{4,10} 0
{7,10} 0
{1,2}
{1,5}
{2,4}
{2,7}
{1,3}
{1,6}
{2,5}
…
{2,3}
…
{3,4}
…
{3,5}
…
{3,6}
…
Many insignificant subset increments.
If {7,10} turns out to be infrequent after scanning the database, ALL the subset increments are redundant.
Transaction with some low existential probability items
0.21
0.021
0.014
0. 009
0. 0006
0. 006
This is the expected support contributed by the transaction to candidates in this leaf node.
Psychological Symptoms Uncertain
Database
Computational Issue
Preliminary experiment is conducted to verify the computational bottleneck of mining uncertain database. In general, uncertain database will have “longer”
transactions. (i.e. more items per transaction) Some items with high existential probabilities. Some items with low existential probabilities.
In our current study, we focus on dataset with bimodal existential probability distribution.
Computational Issue
Synthetic Dataset simulates a bimodal distribution of existential probability:7 datasets with same frequent itemsets.Vary the percentages of items with low
existential probability in the datasets.
0% 33.33% 50% 60% 66.67% 75%71.4%
1 2 3 4 5 6 7
Preliminary Study
Iterations
Iterations
Iterations
Num
ber
of c
andi
date
item
sets
Num
ber
of la
rge
item
sets
Number of Large itemsets in each iteration
Number of candidates in each iterationTime spent on subset checking in each iteration for different datasets
ALL datasets are having the same large itemsets.
There is a sudden burst of number of candidates in second iteration.
Fraction of items with low existential probability : 0%
Fraction of items with low existential probability : 75%
Since both datasets have the same frequent itemsets, subset increment of the 75% low existential probability items maybe actually redundant.
There is potential to reduce the execution time.
This figure shows the time spent on subset checking in each iteration of different datasetsComputational bottleneck occurs in iteration 2.
0%
1
75%
7
Section 4Efficient Methods of Mining Frequent itemsets from Existentially Uncertain Data
Efficient Method 1Data Trimming
Avoid insignificant subset increments
Method 1 - Data Trimming Strategy
DirectionTry to avoid incrementing those insignificant
expected support counts.Save the effort for
Traversing the hash tree. Computing the expected support. (Multiplication of
float variables) The I/O for retrieving the items with very low
existential probability.
Method 1 - Data Trimming Strategy
Question: Which item should be trimmed? Intuitively, items with low existential probability
should be trimmed, but how low? For the time being, let assume there is a
user-specified trimming threshold.
Method 1 - Data Trimming Strategy
Create a trimmed database and trim out all items with existential probability lower than the trimming threshold.
During the trimming process, some statistics are kept for error estimation when mining the trimmed database. Total trimmed expected support count of each item. Maximum existential probability of trimmed item. Other information : e.g. inverted list, signature file …etc
I1 I2 I3 … I4000
t1 90% 80% 3% … 1%
t2 80% 4% 85% … 78%
t3 2% 5% 86% … 89%
t4 5% 95% 3% … 100%
t5 94% 95% 85% … 2%
Uncertain database
I1 I2
t1 90% 80%
t2 80%
t4 95%
t5 94% 95%
+
Statistics
Total expected support trimmed
Maximum existential probability of trimmed item
I1 1.1 5%
I2 1.2 3%
Trimmed Database
The Subset Function scans the trimmed database and count the expected support of every size-2-candidates.We expect mining the trimmed database saves lots of I/O and Computational Costs.
Method 1 - Data Trimming StrategyTrimming Process
Candidate itemsetsSize-k-large
itemsets
Apriori-Gen
Subset Function
Hash-tree
k=k+1
Trimming Module
+
Size-k-infrequent itemsets
Size-k-potentially frequent itemsets
Pruning Module
Patch Up Module
Missed frequent itemsets
Uncertain Database
Trimmed Database
+
Statistics
The uncertain database is first passed into the trimming module to remove the items with low existential probability and gather statistics during the trimming process.
During the trimming process the “true” expected support count of size-1 candidates are counted.i.e. Size-1-large itemsets do not have false negative.
Then the size-1 frequent items are passed into the APRIORI GEN procedure to generate size-2-candidates.
Notice that the infrequent itemsets are only infrequent in the trimmed database.It may contains some true frequent itemsets in the original database.
The Pruning Module uses the statistics gathered from the trimming module to estimate the error and identifies the potentially frequent itemsets from the infrequent itemsets.
Here comes two strategies: Use the potentially frequent itemsets to generate size-k+1-candidates Do not use the potentially frequent itemsets to generate size-k+1 candidates
Finally, all the potentially frequent itemsets are checked against the original database to verify its true support.
Method 1 - Data Trimming StrategyPruning Module The role of the Pruning Module is to
identify those itemsets which are infrequent in the trimmed database but frequent in the original database.
Have to be estimated
This count represents the expected support of the itemset AB where both item A and B are left in the trimmed database.i.e. This count can be obtained by mining the trimmed database
Method 1 - Data Trimming StrategyPruning Module If upper bound of plus is
greater than or equal to the minimum expected support requirement, {A,B} is regarded as potentially frequent.
Otherwise, {A,B} cannot be frequent in the original database and can be pruned.
Have to be estimated
Method 1 - Data Trimming Strategy Max count pruning strategy Pruning strategy depends on statistics from the
Trimming Module. For each size-1 item, keeps
Total expected support count being trimmed of each item. Maximum existential probability of trimmed item.
Global Statistics
Total expected support trimmed
Maximum existential probability of trimmed item
I1 1.5 5%
I2 1.2 3%
Since the statistics are “Global” to the whole database, this method is called
Global Max Count Pruning Strategy
Original Database
Using global counts to estimate the whole database is sometime loose, we may use some “Local” statistics to obtain the bound
Local Max Count Pruning Strategy
Local Statistics
Part a
Part b
Part c
Part d
Part e
Total expected support trimmed
Maximum existential probability of trimmed item
I1 Part a – 16.6
Part b – 14.2
Part c – 13
Part d – 0.1
Part e – 2.7
Part a – 2%
Part b – 0.5%
Part c – 6%
Part d – 1%
Part e – 0.7%
I2 Part a – 2.7
Part b – 19.5
Part c – 2.6
Part d – 12.3
Part e – 0.3
Part a – 1.1%
Part b – 3%
Part c – 7%
Part d – 2.4%
Part e – 0.2%
Method 1 - Data Trimming StrategyMax count pruning strategy Let , and be the upper bound estimations of ,
and respectively. From iteration 1, we have
SKIP
Method 1 - Data Trimming StrategyPatch Up Module
Candidate itemsetsSize-k-large
itemsets
Apriori-Gen
Subset Function
Hash-tree
k=k+1
Trimming Module
+
Trimmed Database
+
Size-k-infrequent itemsets
Pruning Module
Statistics
Patch Up Module
Missed frequent itemsets
Uncertain Database
Size-k-potentially frequent itemsets
The Pruning Module identifies a set of potentially frequent itemsets.
The Patch Up Module verifies the true frequencies of the potentially frequent itemsets.
Two strategies• One Pass Patch Up Strategy• Multiple Passes Patch Up Strategy
Method 1 - Data Trimming StrategyDetermine trimming threshold
Candidate itemsetsSize-k-large
itemsets
Apriori-Gen
Subset Function
Hash-tree
k=k+1
+
Size-k-infrequent itemsets
Pruning Module
Patch Up Module
Missed frequent itemsets
Size-k-potentially frequent itemsets
Trimming Module
Trimmed Database
+
Statistics
Uncertain Database
Question : Which item should be trimmed?
Method 1 - Data Trimming StrategyDetermine trimming threshold Before scanning the database and
incrementing the support counts of candidates, we cannot deduce which itemset is infrequent.
We can make a guess on the trimming threshold from the statistics gathered from previous iterations.
Method 1 - Data Trimming StrategyDetermine trimming threshold
Cumulative Support of item A in descending order
item A ordered by existential probability in descending order
Cum
ulat
ive
Sup
port
Statistics from previous iteration Order the existential
probabilities of each size-1 item in descending order and plot the cumulative support.
E.g. Item A has it’s expected support just over the support threshold.
It is marginally frequent, it’s supersets are potentially infrequent.
If a superset is infrequent, it won’t be frequent in trimmed database, we want to trim those items such that the error estimation should be tight enough to prune it in the Pruning Module.
Use the existential probability of the intersecting item to be the trimming threshold.
Method 1 - Data Trimming StrategyDetermine trimming threshold
Cumulative Support of item B in descending order
item B ordered by existential probability in descending order
Cum
ulat
ive
Sup
port
Statistics from previous iteration Order the existential
probabilities of each item in descending order and plot the cumulative support.
E.g. Item B has it’s expected support much larger than the support threshold.
It’s supersets are likely to be frequent.
The expected support contributed by these items are insignificant.
Use the existential probability of this item to be the trimming threshold.
Efficient Method 2Decremental Pruning
Identify infrequent candidates during database scan
Method 2 - Decremental Pruning
In some cases, it is possible to identify an itemset to be infrequent before scanning the whole database.
For instance, if the minimum support threshold is 100, and the expected support of item A is 101. After scanning transaction t2, we can conclude that ALL itemsets
containing item A must be infrequent and can be pruned.
A …
t1 70% 0%
t2 50% 0%
… … ….
t100K … …
Uncertain DatabaseTotal expected support of A is 100.3 from transaction t2 onwards.
Total expected support of A is 99.8 from transaction t3 onwards.
We can conclude that Item A is infrequent from t2 to t100K, all candidates containing A must be infrequent.
Method 2 - Decremental Pruning
Before scanning the database, define two “Decremental Counters” for itemset {A,B}
represents the maximum possible support count of itemset {A,B} if ALL items A match with item B, and ALL matching item Bs are having 100% existential probabilities
from transaction t to the end of the database, then itemset {AB} will have support count larger than the minimum support by ” “.
Method 2 - Decremental Pruning
While scanning the transactions, update the decremental counts according to the following equation :
Method 2 - Decremental Pruning Brute-force method
Example Support threshold: 50%, min_sup=2 Expected support of A=2.6, B=2.1, C=2.2 For candidate itemset {A,B} :
A B C
T1 100% 50% 30%
T2 90% 80% 70%
T3 30% 40% 90%
T4 40% 40% 30%
Uncertain Database
Before scanning the database, initialize the decremental counters of candidate {A,B}
Update the decremental counters according to the equation.
We can conclude that candidate {A,B} is infrequent without scanning T3 and T4, which saves the computational efforts in the subset function.
0.1 value means that if 1) ALL the item A match with item B, and 2) ALL matching Bs are having 100% existential probabilitiesfrom transaction 2 to 4, then the expected support count of {A,B} will be 0.1 larger than min_sup.
0.6 value of d0(A,AB) means that if- ALL the item A match with item B and,- ALL matching Bs are having 100% existential probabilities in the whole database, then the expected support count of {A,B} will be 0.6 larger than min_sup .
Method 2 - Decremental Pruning Brute-force method This method is infeasible because
Each candidate has to associate with at least 2 decremental counters.
Even if any itemset is identified infrequent, the subset function still has to traverse the hash tree and reach the leaf nodes to retrieve the corresponding counters before it is known to be infrequent.
Level 0
Level 1
Candidate Itemset
Expected Support Count
Decremental Counters
AD 0 d0(A,AD),d0(D,AD)
AG 0 d0(A,AG),d0(G,AG)
AB
AE
BD
BG
AC
AF
BE
…
BF
…
CD
…
CE
…
CF
…
Method 2 - Decremental Pruning Aggregate by item method Aggregate by item method
Aggregates the decremental counters and obtains an upper bound of them.
Suppose there are three size-2-candidates
There are totally 6 decremental counters in the brute-force method
Aggregate the counters d0(A,AB) and d0(A,AC) by d0(A), and obtain an upper bound of the two counters.
Brute-force method Aggregate by item method
Method 2 - Decremental Pruning Aggregate by item method
Aggregated Counter
Value
A B C
T1 100% 50% 30%
T2 90% 80% 70%
T3 30% 40% 90%
T4 40% 40% 30%
Uncertain Database
Initialize the countersScan transaction t1 and Update the decremental counters
0.6-[1*(1-0.5)]
Scan transaction t2 and update the decremental counters
Since no counter is smaller than zero, we cannot conclude any candidates to be infrequent.
Since d2(A) is smaller than zero, {AB},{AC} are infrequent and can be pruned
SKIP
Method 2 - Decremental Pruning Hash-tree integration method Other than loosely aggregate the decremental
counts by item, aggregation can be based on the hash function used in the subset function.
Level 0
Level 1
Candidate Itemset
Expected Support Count
Decremental Counters
AD 0 d0(A,AD),d0(D,AD)
AG 0 d0(A,AG),d0(G,AG)
DG 0 d0(D,DG),d0(G,DG)
AB
AE
DE
BD
BG
EG
AC
AF
DF
BE
…
BF
…
CD
…
CE
…
CF
…
Subset Function
Recall that the brute-force approach stores the decremental counters in the leaf nodes.
The aggregated decremental counters are stored in the hash nodes.
When any of the decremental counters become smaller than or equal to zero, the corresponding itemsets in the leaf node cannot be frequent and can be pruned.
Method 2 - Decremental Pruning Hash-tree integration method
Level 0
Level 1
Candidate Itemset
Expected Support Count
Decremental Counters
AD 0 d0(A,AD),d0(D,AD)
AG 0 d0(A,AG),d0(G,AG)
DG 0 d0(D,DG),d0(G,DG)
AB
AE
DE
BD
BG
EG
AC
AF
DF
BE
…
BF
…
CD
…
CE
…
CF
…
Subset FunctionThe hash-tree integration method aggregates the decremental counters according to the hash function.
This is the root of the hash tree.
Method 2 - Decremental Pruning Hash-tree integration method Improving the pruning power
The hash-tree is a prefix tree which is constructed based on lexicographic order of items
Item with higher order will be prefix containing more itemsets.
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}
Level 0 (Root)
3 itemsets under this decremental counter.
1 itemset under this decremental counter only.
If this counter becomes negative during database scan, we can prune 3 itemsets.
If this counter becomes negative during database scan, we can prune 1 itemset only.
Method 2 - Decremental Pruning Hash-tree integration method Our strategy is to reorder the items by their expected
supports in ascending order such that The decremental counters of items in higher lexicographic
orders will be more likely to become negative than those with lower lexicographic orders.
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}
Level 0 (Root)If this counter becomes negative during database scan, we can prune 3 itemsets.
If this counter becomes negative during database scan, we can prune 1 itemset only.
3 itemsets under this decremental counter.
1 itemset under this decremental counter only.
Efficient Method 3Candidates filtering
Identify infrequent candidates before database scan
SKIP
Method 3 – Candidates filtering
It is possible to identify some infrequent candidate itemsets before scanning the database to verify its support.
A B C
T1 30% 50% 100%
T2 70% 80% 90%
T3 90% 40% 30%
T4 30% 40% 40%
Uncertain Database
Expected Support 2.2 2.1 2.7
Maximum existential probability 90% 80% 100%
min_sup = 2
{A,B} {A,C} {B,C}Size-2-candidate itemsets
1.76 2.2 2.1
For instance, after scanning the database, the expected support of item A,B,C are obtained.
During the database scan, keep the maximum existential probability of each item.
Size-2-candidate itemsets are generated.
From the expected supports and maximum existential probabilities obtained above, we can obtain an upper bound of the candidates BEFORE scanning the database.
For {A,B}, if ALL items A matches with B with B’s maximum existential probability, {AB} will have expected support value 2.2*80% = 1.76
This is an upper bound of the expected support of {A,B}, which is smaller than min_sup.Thus it must be infrequent and can be pruned.
Maximum expected support of size-2-candidates
Method 3 – Candidates filtering
A B C
T1 30% 50% 100%
T2 70% 80% 90%
T3 90% 40% 30%
T4 30% 40% 40%
Uncertain Database
Expected Support 2.2 2.1 2.7
Maximum existential probability 90% 80% 100%
min_sup = 2
{A,B} {A,C} {B,C}Size-2-candidate itemsets
1.76 2.2 2.1
For {A,B}, if ALL items A matches with B with B’s maximum existential probabilities, {AB} will have expected support value 2.2*80% = 1.76
This is an upper bound of the expected support of {A,B}, which is smaller than min_sup.Thus it must be infrequent and can be pruned.
Maximum expected support of size-2-candidates
Section 5Experimental Results and Discussions
ExperimentsSynthetic datasets Data associations
Generated by IBM Synthetic Generator. Average length of each transaction (T) Average length of hidden frequent patterns (I) Number of transactions (D)
Data uncertainty We would like to simulate the situation that there are some items
with high existential probabilities, while there are also some items with low existential probabilities.
Bimodal distribution Base of high existential probabilities (HB) Base of low existential probabilities (LB) Standard Deviations for high and low existential probabilities
(HD,LD) Percentage of item with low existential probabilities (R)
T100R75%I6D100K HB90HD5LB10LD5
ExperimentsImplementation C programming language Machine
CPU : 2.6 GHz Memory : 1 Gb Fedora
Experimental settings : T100R75%I6D100K HB90HD5LB10LD5 136 Mb Support threshold 0.5%
T100R75%I6D100K HB90HD5LB10LD5
Experimental ResultsTrimming Method
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8
Iterations
Exec
utio
n tim
e Uncertain Apriori
Trimming - Global MaxCount
Trimming - Local Max Count
T100R75%I6D100K HB90HD5LB10LD5
Since we use the one-pass patch up strategy, trimming methods have one more Patch Up phase.
Iteration 2 is computationally expensive because there are many candidates, leading to heavy computational effort in the subset function.
Trimming methods can successfully reduce the number of subset increments in ALL iterations.
Plus the time spent on Patch Up phase, the trimming methods still have a significant performance gain.
Execution time of Trimming Methods VS Uncertain Apriori in each iteration
For Uncertain Apriori, each transaction has 100C2 = 4950 size-2 subsets.For Trimming, each transaction has at least 25C2 = 300 size-2 subsets only!
0
50
100
150
200
250
1 2 3 4 5 6 7 8
Iteration
CPU
Cos
t (s)
Uncertain Apriori
Trimming - Global Max Count
Trimming - Local Max Count
Experimental ResultsCPU Cost Saving by Trimming
-50
0
50
100
150
200
250
1 2 3 4 5 6 7 8
Iterations
CPU
Cos
t Sav
ing
(s)
Trimming - Global Max Count
Trimming - Local Max Count
0
10
20
30
40
50
60
70
80
90
100
2 3 4 5 6
Iteration
CPU
Cos
t Sav
ing
(%)
Trimming - Global Max Count
Trimming - Local Max Count
T100R75%I6D100K HB90HD5LB10LD5
CPU Cost of Trimming Methods VS Uncertain Apriori in each iteration
Negative CPU saving in iteration 1 because time is spent on gathering the statistics for the Pruning Module.
CPU Cost Saving in each iteration Percentage of CPU Cost Saving from iteration 2 to 6
Trimming methods achieve high computational saving in iterations where CPU cost is significant.
Experimental ResultsI/O Cost Saving by Trimming
0
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8
Iterations
I/O C
ost (
s)
Uncertain Apriori
Trimming - Global Max Count
Trimming - Local Max Count
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
1 2 3 4 5 6 7 8
Iterations
I/O C
ost S
avin
g (s
)
Trimming - Global Max Count
Trimming - Local Max Count
T100R75%I6D100K HB90HD5LB10LD5
I/O Cost of Trimming Methods VS Uncertain Apriori in each iteration
I/O Cost Saving of Trimming Methods in each iteration
Trimming Methods have extra I/O effort in iteration 2 because they have to scan the original database PLUS create the trimmed database.
I/O Cost saving occurs from iteration 3 to iteration 6.That is, I/O cost saving will increase if there are longer frequent itemsets.
Experimental ResultsVarying Support Threshold
0
500
1000
1500
2000
2500
3000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Expected Support Threshold (%)
Tota
l Exe
cutio
n Ti
me
(s)
Uncertain Apriori
Trimming - Global MaxCount
Trimming - Local Max Count
T100R75%I6D100K HB90HD5LB10LD5
Execution time of Trimming Methods VS Uncertain Apriori for different support thresholds
The rate of increase in execution time of Trimming Method is smaller than that of Uncertain Apriori.
20 3360
7177.7
83.33
90.9
0
200
400
600
800
1000
1200
1400
1600
1800
Percentage of items with low existential probability (R)
Tota
l Exe
cutio
n Ti
me(
s)
Uncertain Apriori
Trimming (Global Max Count)
Trimming (Local Max Count)
0% 100%
T100R ? %I6D100K HB90HD5LB10LD5
50%
Execution time of Trimming Methods VS Uncertain Apriori for different percentages of items with low existential probability
ALL the itemsets are having the same frequent itemsets.
Trimming Methods achieve almost linear execution time in increasing percentage of items with low existential probability.
Experimental ResultsVarying percentage of items with low existential probability
Experimental Results Decremental Pruning
0
10
20
30
40
50
60
70
80
90
100
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fraction of Database Scanned
Perc
enta
ge o
f Can
dida
tes Pr
uned
Decremental (Aggregate byitems)
Decremental (Integrate withHash tree)
0
200
400
600
800
1000
1200
1400
1600
1800
Percentage of items with low existentialprobability (R)
Exe
cutio
n tim
e (s
) Uncertain Apriori
Decremental (Aggregate byitems)
Decremental (Integrate withHash tree)
T100R75%I6D100K HB90HD5LB10LD5
Percentage of Candidates Pruned during database scan for 2nd iteration
Execution time of Decremental Pruning VS Uncertain Apriori for different percentages of
items with low existential probability
Pruning power of the Decremental Methods in 2nd iteration.
The “Integrate with Hash Tree” method outperforms the “Aggregated by items” method.
Although “Integrate with Hash Tree” method can prune twice number of candidates than the “Aggregate by items” method, the time saving is not significant.This is because the “Integrate with Hash Tree” method has more overhead.
0% 100%50%
Experimental Results Varying percentage of items with low existential probability
0
200
400
600
800
1000
1200
1400
1600
1800
Percentage of items with low existentialprobability (R)
Exe
cutio
n tim
e (s
)
Uncertain Apriori
Trimming (Global MaxCount)
Trimming (Local Max Count)
Decremental (Aggregate byitems)
Decremental (Integrate withHash tree)
T100R75%I6D100K HB90HD5LB10LD5
The Trimming and Decremental Methods can combine together to form a Hybrid Algorithm
Execution time of Decremental and Trimming Methods VS Uncertain Apriori
for different percentages of items with low existential probability
0% 100%50%
Experimental ResultsHybrid Algorithms
0
200
400
600
800
1000
1200
1400
1600
1800
Percentage of items with low existential probability (R)
Exec
utio
n tim
e (s
)
Uncertain Apriori
Decremental (Aggregate by items)
Decremental (Integrate with Hashtree)
Decremental (Aggregate by items) +Trimming
Decremental (Integrate with Hashtree) + Trimming
Decremental (Integrate with Hashtree) + Trimming + CandidatePruning
T100R75%I6D100K HB90HD5LB10LD5
Execution time of Different Combinations VS Uncertain Apriori
for different percentages of items with low existential probability
0% 100%50%
Combining the 3 proposed methods achieves the smallest execution time.
Experimental ResultsVarying percentage of items with low existential probability
-20
0
20
40
60
80
100
Percentage of items with low existential probability(R)
Tota
l CPU
cos
t sav
ing
(%)
Decremental (Integrate withHash tree) + Trimming +Candidate Pruning
-40
-30
-20
-10
0
10
20
30
40
50
Percentage of items with low existential probability (R)
Tota
l I/O
Sav
ing
(%)
Decremental (Integrate with Hashtree) + Trimming + CandidatePruning
Overall CPU saving of the Hybrid Algorithm for different percentages of items with low existential probability
Overall I/O saving of the Hybrid Algorithm for different percentages of items with low existential probability
T100R75%I6D100K HB90HD5LB10LD5
0% 100%50%
0% 100%50%
CPU cost saving occurs when there are 5% or more items with low existential probability in the dataset.
80% or more CPU cost is saved for dataset with 40% or more items with low existential probability.
I/O cost saving occurs when there are 40% or more items with low existential probability in the dataset.
In fact, this figure only shows that I/O cost saving will increase if more items are trimmed.
Actually the I/O saving should also depends on the length of hidden frequent itemsets, which can be shown by varying the (I) parameter in the dataset generation process.
Conclusion
We have defined the problem of mining frequent itemsets from uncertain database.
Possible World interpretation has be adopted to be the theoretical background of the mining process.
Existing frequent itemsets mining algorithms are either inapplicable or unacceptably inefficient to mine uncertain data.
We have identified the computational bottleneck of Uncertain Apriori, and
Proposed a number of efficient methods to reduce both CPU and I/O cost significantly.
Future Works Sensitivity and scalability test on each
parameters T,I,K,HB,LB…etc Generate Association Rules from uncertain data
What is the meaning of the association rules mined from uncertain data?
Real Case Study Other types of association rules
Quantitative association rules Multidimensional association rules
Papers…Now I am interested in these kind of association rules
80% Eating Disorder => 90% Depression
End
Thank you