Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Efficient discovery of the top- K optimaldependency rules with Fisher’s exact test of
significance
Wilhelmiina Hamalainen
Department of Computer Science
University of Helsinki
Finland
ICDM’10 – p.1/23
Problem
Given a set of binary attributes R = {A1, . . . , Ak}.Search for the most significant positive and neg-ative dependency rules X → A, where X ( R,A ∈ R \ X!
• notations: A = 1 ≡ A and A = 0 ≡ ¬A
• negative dependency between X and A, ifP (XA) < P (X)P (A) ⇔ P (X¬A) > P (X)P (¬A)
• ≡ pos. dependency between X and ¬A⇒ enough to search for positive dependencies forany consequence A = a, A ∈ R, a ∈ {0, 1}
ICDM’10 – p.2/23
Requirements for X → A = a
• sufficiently significant by Fisher’s exact test:
pF (X → A = a) =∑
i
(
m(X)m(XA=a)+i
) (
m(¬X)m(¬XA6=a)+i
)
(
nm(A=a)
)
• non-redundant (X contains no extra attributeswhich do not improve the dependency):∄Y ( X such that pF (X → A = a) ≥ pF (Y → A = a)
No other restrictions (like minimum frequencythresholds)!
ICDM’10 – p.3/23
Example
DataR = {A,B,C,D}n = 100
set freq.ABC¬D 10A¬B¬CD 85¬AB¬CD 5
Best non-redundantrules
rule pF
AD → ¬B 3.9 · 10−18
D → ¬C 5.8 · 10−14
AB → C 5.8 · 10−14
AB → ¬D 5.8 · 10−14
C → B 1.7 · 10−10
D → ¬B 1.7 · 10−10
e.g. AC → B redundant
ICDM’10 – p.4/23
Algorithm: the main idea
• Generate an enumeration tree and keep recordon possible consequences.
• Consequence Ai = ai is impossible in set X, if forall sets Q rule X \ {Ai}Q → Ai = ai is insignificantor redundant.
CB
D
A
AA
A
A
AA
A
DD
D
B
00 1
1 10
0 0A C D
posneg
B
Possible
in node DA:
consequences
ICDM’10 – p.5/23
Algorithm: the stopping criterion
A node (and its subtree) can be pruned out, when nopossible consequences left.
• The algorithm stops, when no more nodes can becreated.
• Note: fr=0 is not a sufficient condition for pruningout a node!
• However, no children are created for it.
ICDM’10 – p.6/23
Algorithm: possible consequences
Given node (set) X, its parents are Yi = X \ {Ai}for all Ai ∈ X.
1. Initialization: combine parents’ consequencevectors (bit-and).
2. Updating: estimate lower bounds forpF (X \ {Ai}Q → Ai = ai) and decide if Ai = ai
impossible in node X.• LB1, when only m(Ai = ai) known;• LB2, when m(Ai = ai) and m(X) known,
Ai /∈ X;• LB3, when m(Ai = ai) and m(X) known,
Ai ∈ X.ICDM’10 – p.7/23
Algorithm: possible consequences
3. Minimality condition: If P (Ai = ai|Yi) = 1.0, then Ai,¬Ai, and all Aj = aj, Aj /∈ X, are impossible.
4. Lapis philosophorum principle: If Ai = ai isimpossible in X, it is impossible in the parentnode Yi = X \ {Ai} and all its children.• efficient pruning!
(Chess, Accidents and Pumsb could not behandled without LP)
ICDM’10 – p.8/23
Simulation (level 1)
Search for the 10 best rules, when pmax = 1.2 · 10−8.
DA
B
1 1 1 1 11
000
00
010
0 11
11
11
00
C
Otherwise, possible consequences are determined by LB2
LB1: A and A are impossible consequences
101 1 11011
ICDM’10 – p.9/23
Simulation (level 2)
DA
BC
1 1 100
00 1
0 111
10
B
Combine parents’bitvectors
1
1 10 111 01 0
000
00 0
1
0
10
11 1
11
1
ICDM’10 – p.10/23
Simulation (level 2)
DA
BC
1 1 110
00 101
0 11 1
0
B
10 Rule C B minimal
(more specific rules would be redundant)01 B and D become impossible consequences
1 111
0 0
010 1 01
11
001
00 1100
ICDM’10 – p.11/23
Simulation (level 2)
DA
BC
1 1 1110
000 1
010
11
1
1
B 1001
LP: set B as impossible in node C00
0 11
00000
00
1 1 1001 1 1
ICDM’10 – p.12/23
Simulation (level 2)
DA
BC
11 1 1
11
000 1
0 1
B
0D0
001
11
Rules D C and C D minimalC, C and D impossible cons.Node CD is removed
0 1
000 0
1
00 0
10
0 0
0 10 0
1 00
0
00 1
10
11 1
11
ICDM’10 – p.13/23
Simulation (level 2)
DA
BC
11
00
00 1
00
1 1
10
B
110
001
0101
Node CD removedLP: set D as impossible in node C andC and C as impossible in node D
00
0
0 0 0000
01 111 1
011 1
1
ICDM’10 – p.14/23
Simulation (level 2)
DA
BC
11 1
100
00 1
010 1
1111
0
B
00
A
Node CA: no rules found
100
0
10
0
000
00
00
0000 1 0 1
0
0 01
11
1
ICDM’10 – p.15/23
Simulation (level 2)
DA
BC
11 1
11001
010 1
1
0
B
00
A
01
D
LB3: B impossible cons.10
Rule D B (not min.)
100
0
10
01
0 0000 1
0 0 0000
00
1
00
10 0
00
00
11
11
1
ICDM’10 – p.16/23
Simulation (level 2)
DA
BC
11100
00 1
010 1
1111
1
B
00
A
0
DLP: B impossible cons.also in node D
011
1
10
0
110
00
00
000
000
10
00 0
1
0 00
00010 0 0
1
ICDM’10 – p.17/23
Simulation (level 2)
DA
BC
11 1
11
1
B A
1
D A A
BA and DA: no rules found
110
011
0 10 11
00
10
0 1000
0 0
0 1
0 00
0 00 0
00
000 0
0 0
00 0
0
0 0 010
0 1 0
1
0 010
0
1001
11 1
1
ICDM’10 – p.18/23
Simulation (level 3)
B A
01
D A
0
DBC
ARule AB C minimalC impossible consequenceNode is removed
0 001
0 1001
LP
A
01 1
1000
000 0 0
00 100
0 00 1 0
10
01
000
0011 0 0 0
00
ICDM’10 – p.19/23
Simulation (level 3)
Rules AD B andD minimalAB
B A
01
D A A
0
DBC
A
0 0
000 100
1 1
1 1
Node is removed
LP
0110
000 0
0
1000
0 100 0
00
0 0
0 00 0
1110 0 0
0 0 0 1
0
ICDM’10 – p.20/23
Results
Implementation kingfisherhttp://www.cs.helsinki.fi/~whamalai
• the algorithm itself suits for all common goodnessmeasures for statistical dependencies
• currently only pF and χ2, but easy to add newmeasures
• very efficient without any extra restrictions (1GBmemory suffices for Pumsb, Accidents, Retail,Chess, etc.)
• pF is more efficient and produces more reliableresults than χ2!
ICDM’10 – p.21/23
Thank you!
ICDM’10 – p.22/23
Possible questions
• Why the rules are called dependency rules andnot association rules?
• Are there any previous solutions to the problem?• Why the consequence vectors contain all
attributes in all nodes? Wouldn’t it be enough tostore only those attributes, which can be addedunder a node?
ICDM’10 – p.23/23