Efﬁcient discovery of the top-K optimal dependency rules ...whamalai/articles/icdm10slides.pdf · Efﬁcient discovery of the top-K optimal dependency rules with Fisher’s exact

Efficient discovery of the top- K optimaldependency rules with Fisher’s exact test of

significance

Wilhelmiina Hamalainen

Department of Computer Science

University of Helsinki

Finland

[email protected]

ICDM’10 – p.1/23

Problem

Given a set of binary attributes R = {A1, . . . , Ak}.Search for the most significant positive and neg-ative dependency rules X → A, where X ( R,A ∈ R \ X!

• notations: A = 1 ≡ A and A = 0 ≡ ¬A

• negative dependency between X and A, ifP (XA) < P (X)P (A) ⇔ P (X¬A) > P (X)P (¬A)

• ≡ pos. dependency between X and ¬A⇒ enough to search for positive dependencies forany consequence A = a, A ∈ R, a ∈ {0, 1}

ICDM’10 – p.2/23

Requirements for X → A = a

• sufficiently significant by Fisher’s exact test:

pF (X → A = a) =∑

i

(

m(X)m(XA=a)+i

) (

m(¬X)m(¬XA6=a)+i

)

(

nm(A=a)

)

• non-redundant (X contains no extra attributeswhich do not improve the dependency):∄Y ( X such that pF (X → A = a) ≥ pF (Y → A = a)

No other restrictions (like minimum frequencythresholds)!

ICDM’10 – p.3/23

Example

DataR = {A,B,C,D}n = 100

set freq.ABC¬D 10A¬B¬CD 85¬AB¬CD 5

Best non-redundantrules

rule pF

AD → ¬B 3.9 · 10−18

D → ¬C 5.8 · 10−14

AB → C 5.8 · 10−14

AB → ¬D 5.8 · 10−14

C → B 1.7 · 10−10

D → ¬B 1.7 · 10−10

e.g. AC → B redundant

ICDM’10 – p.4/23

Algorithm: the main idea

• Generate an enumeration tree and keep recordon possible consequences.

• Consequence Ai = ai is impossible in set X, if forall sets Q rule X \ {Ai}Q → Ai = ai is insignificantor redundant.

CB

D

A

AA

A

A

AA

A

DD

D

B

00 1

1 10

0 0A C D

posneg

B

Possible

in node DA:

consequences

ICDM’10 – p.5/23

Algorithm: the stopping criterion

A node (and its subtree) can be pruned out, when nopossible consequences left.

• The algorithm stops, when no more nodes can becreated.

• Note: fr=0 is not a sufficient condition for pruningout a node!

• However, no children are created for it.

ICDM’10 – p.6/23

Algorithm: possible consequences

Given node (set) X, its parents are Yi = X \ {Ai}for all Ai ∈ X.

1. Initialization: combine parents’ consequencevectors (bit-and).

2. Updating: estimate lower bounds forpF (X \ {Ai}Q → Ai = ai) and decide if Ai = ai

impossible in node X.• LB1, when only m(Ai = ai) known;• LB2, when m(Ai = ai) and m(X) known,

Ai /∈ X;• LB3, when m(Ai = ai) and m(X) known,

Ai ∈ X.ICDM’10 – p.7/23

Algorithm: possible consequences

3. Minimality condition: If P (Ai = ai|Yi) = 1.0, then Ai,¬Ai, and all Aj = aj, Aj /∈ X, are impossible.

4. Lapis philosophorum principle: If Ai = ai isimpossible in X, it is impossible in the parentnode Yi = X \ {Ai} and all its children.• efficient pruning!

(Chess, Accidents and Pumsb could not behandled without LP)

ICDM’10 – p.8/23

Simulation (level 1)

Search for the 10 best rules, when pmax = 1.2 · 10−8.

DA

B

1 1 1 1 11

000

00

010

0 11

11

11

00

C

Otherwise, possible consequences are determined by LB2

LB1: A and A are impossible consequences

101 1 11011

ICDM’10 – p.9/23


DA

BC

1 1 100

00 1

0 111

10

B

Combine parents’bitvectors

1

1 10 111 01 0

000

00 0

1

0

10

11 1

11

1

ICDM’10 – p.10/23


DA

BC

1 1 110

00 101

0 11 1

0

B

10 Rule C B minimal

(more specific rules would be redundant)01 B and D become impossible consequences

1 111

0 0

010 1 01

11

001

00 1100

ICDM’10 – p.11/23


DA

BC

1 1 1110

000 1

010

11

1

1

B 1001

LP: set B as impossible in node C00

0 11

00000

00

1 1 1001 1 1

ICDM’10 – p.12/23


DA

BC

11 1 1

11

000 1

0 1

B

0D0

001

11

Rules D C and C D minimalC, C and D impossible cons.Node CD is removed

0 1

000 0

1

00 0

10

0 0

0 10 0

1 00

0

00 1

10

11 1

11

ICDM’10 – p.13/23


DA

BC

11

00

00 1

00

1 1

10

B

110

001

0101

Node CD removedLP: set D as impossible in node C andC and C as impossible in node D

00

0

0 0 0000

01 111 1

011 1

1

ICDM’10 – p.14/23


DA

BC

11 1

100

00 1

010 1

1111

0

B

00

A

Node CA: no rules found

100

0

10

0

000

00

00

0000 1 0 1

0

0 01

11

1

ICDM’10 – p.15/23


DA

BC

11 1

11001

010 1

1

0

B

00

A

01

D

LB3: B impossible cons.10

Rule D B (not min.)

100

0

10

01

0 0000 1

0 0 0000

00

1

00

10 0

00

00

11

11

1

ICDM’10 – p.16/23


DA

BC

11100

00 1

010 1

1111

1

B

00

A

0

DLP: B impossible cons.also in node D

011

1

10

0

110

00

00

000

000

10

00 0

1

0 00

00010 0 0

1

ICDM’10 – p.17/23


DA

BC

11 1

11

1

B A

1

D A A

BA and DA: no rules found

110

011

0 10 11

00

10

0 1000

0 0

0 1

0 00

0 00 0

00

000 0

0 0

00 0

0

0 0 010

0 1 0

1

0 010

0

1001

11 1

1

ICDM’10 – p.18/23


B A

01

D A

0

DBC

ARule AB C minimalC impossible consequenceNode is removed

0 001

0 1001

LP

A

01 1

1000

000 0 0

00 100

0 00 1 0

10

01

000

0011 0 0 0

00

ICDM’10 – p.19/23


Rules AD B andD minimalAB

B A

01

D A A

0

DBC

A

0 0

000 100

1 1

1 1

Node is removed

LP

0110

000 0

0

1000

0 100 0

00

0 0

0 00 0

1110 0 0

0 0 0 1

0

ICDM’10 – p.20/23

Results

Implementation kingfisherhttp://www.cs.helsinki.fi/~whamalai

• the algorithm itself suits for all common goodnessmeasures for statistical dependencies

• currently only pF and χ2, but easy to add newmeasures

• very efficient without any extra restrictions (1GBmemory suffices for Pumsb, Accidents, Retail,Chess, etc.)

• pF is more efficient and produces more reliableresults than χ2!

ICDM’10 – p.21/23

http://www.cs.helsinki.fi/~whamalai

Thank you!

ICDM’10 – p.22/23

Possible questions

• Why the rules are called dependency rules andnot association rules?

• Are there any previous solutions to the problem?• Why the consequence vectors contain all

attributes in all nodes? Wouldn’t it be enough tostore only those attributes, which can be addedunder a node?

ICDM’10 – p.23/23

Documents

Efﬁcient discovery of the top-K optimal dependency rules ...whamalai/articles/icdm10slides.pdf · Efﬁcient discovery of the top-K optimal dependency rules with Fisher’s exact