Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webb

Intelligent Systems

Exploratory pattern discovery

Geoff Webb

Outline

• Tutorial covers • Data Mining• Exploratory Pattern Discovery• Association rules• Interestingness (objective functions)• False discoveries• Limitations of minimum support• K-most interesting pattern discovery• Itemset discovery• Contrast rule discovery• Impact rules

Part 1:

Data Mining

Data mining

• Data mining seeks to discover unanticipated knowledge from data

• Exponential growth in the quantity of data stored gives urgency to the pursuit of practical analytic approaches that address• Large volumes of data• Low quality data• Post-hoc analysis• Loosely defined analytical objectives

So what’s the big deal?

• Don’t statistics identify patterns in data?• Conventional statistics do not address

• searching quintillions of potential correlations Eg.

• market basket data 2100,000

• US phone calls 2100,000,000

• human genome 23,000,000,000

• selecting most interesting from millions of correlations

Example: Should we stock vitamins?

• Major national retailer with detailed records of customer purchasing behaviour

• Considering deleting a low volume product line

• Does data provide evidence of indirect contribution to bottom line?

Example: Steel rolling mill

• Complex control problem for expensive production process influenced by input materials, desired output and state of equipment

• Currently uses imperfect model

• Objective, use data to identify circumstances in which model is deficientPhoto courtesy G.C. Goodwin, S. Graebe and M. Salgado. Control System Design, Prentice Hall, 2000.

Example: Synchrotron x-ray data analysis

• Synchrotron x-ray scatter patterns reflect micro-structure of material analysed.

Normal Malignant

• Can x-ray scatter plots be used for cancer diagnosis?

A growth area

• The sum of human data stored doubles every 7 years

• Data mining is critical to commerce• Fraud detection

• Information retrieval

and to science• Bioinformatics

• Mass data analysis

Large unmet demand for good PhDs!

Beyond statistics

• Data mining goes beyond the traditional realm of statistics by encompassing • problem formulation • interactions between the business

process and the analytic process• knowledge management• data manipulation

Analytics

Business processes

Other knowledge

sources

Generating models

• The core of the data mining process is generating models from data

Eg neural networks, support vector machines, decision trees

• Most research concentrates on this aspect• Surrounding activities are also very important

• Defining analytic task• Sourcing data• Preprocessing data• Identifying appropriate forms of model • Identifying appropriate techniques for generating models• Interpreting models• Applying models

Part 2:

Exploratory Pattern Discovery

The perils of model selection

• Many data mining techniques seek to identify a single model that best fits the observed data.

• In many applications many models will (almost) equally fit the data

bruises=f & gill-attachment=f & gill-spacing=c & ring-number=o → poisonous[Coverage=0.406 (3296); Support=0.388 (3152); Confidence=0.956]

bruises=f & gill-spacing=c & veil-color=w & ring-number=o → poisonous [Coverage=0.406 (3296); Support=0.388 (3152); Confidence=0.956]

Perils of model selection (cont.)

• Data mining systems often make arbitrary choices• without warning

• A system may have no basis on which to select models, but an expert often will• ease / cost of operatalisation

• comprehensibility / compatibility with existing knowledge and beliefs

• social / legal / ethical / political acceptability

Exploratory pattern discovery

• Exploratory pattern discovery seeks all patterns that satisfy user-defined constraints

• The user can select from these patterns• can use criteria that might be infeasible to

quantify

Patterns

• Rules:• <antecedent> <consequent>

• Itemsets• <condition1> & <condition2> & …

• Sequences• <event1>, <event2>, ….

• Structures

• <antecedent> <consequent>• IF <antecedent> THEN <consequent>

• IF temp >36.8 AND pulse > 120 THEN call doctor• Antecedent

= condition= left hand side, LHS= conditions under which antecedent holds / applies

• Consequent = conclusion= right hand side, RHS= action to perform or conclusion to reach

Theoretical foundations

• Substantial bodies of theory in Formal Logic, Computational Logic, and Artificial Intelligence can be brought to bear to utilise rules once they are inferred.

• If the antecedent entails the consequent and the antecedent is known (believed) then the consequent can be concluded.

• Can be extended to probabilistic basis.• Supports complex reasoning.• Modular knowledge representation.

• can capture knowledge nuggets

Rule discovery as search

• Rule discovery can be viewed as search through a space of expressible rules.

• The rule space (search space / description space) can be partially ordered on generality.

• A C is a generalisation of B C iff B entails A (A must be true if B is true)

• proper generalisation iff A does not also entail B

• If A C is a generalisation of B C then B C is a specialisation of A C.

• Eg. IF age > 30 THEN X is a generalisation of• IF age > 31 THEN X• IF age > 30 AND gender = male THEN X

{A,B} {A,C} {A,D}{B,C} {B,D} {C,D}

{A} {B} {C} {D}

{A,B,C} {A,B,D} {A,C,D} {B,C,D}

Generalization lattice for antecedents

{A,B,C,D}

{A,B} {A,C} {A,D}{B,C} {B,D} {C,D}

{A} {B} {C} {D}

{A,B,C} {A,B,D} {A,C,D} {B,C,D}

Search tree for antecedents

{A,B,C,D}

{}{A,B,C,D}

{A,B}{C,D}

{A,C}{B,D}

{A,D}{B,C}

{B,C}{A,D}

{B,D}{A,C}

{C,D}{A,B}

{A}{B,C,D}

{B}{A,C,D}

{C}{A,B,D}

{D}{A,B,C}

{A,B,C}{D}

{A,B,D}{C}

{A,C,D}{B}

{B,C,D}{A}

Search tree with consequent propagation

{A,B,C,D}{}

Propositional rule discovery

• Antecedent and consequent are propositions

• Often restricted to antecedent and consequent both conjunctions of Boolean terms• IF temp >36.8 AND pulse > 120 THEN

blood pressure > 140 AND condition = critical

Rule discovery is inherently intractable

• If • there are n propositions,

• antecedents can be any set of propositions and

• consequents are a single proposition

• size of search space ≈ n2n

• It is essential to use powerful pruning techniques to limit the search space

Part 3:

Association rules

Association rule discovery

• Developed for market basket analysis• a basket is a collection of products

purchased in a single transaction• an itemset is a set of products

• all baskets are itemsets• market basket analysis seeks to identify

products that are associated with each other• diapers and beer

• Can generalize to itemset = any conjunction of Boolean terms

Transaction and tabular data

• Transaction data• Each record is a set of items involved in a single

transaction• Eg. market basket, web site traversal, amino acids

in a protein• Tabular data

• Each record consists of a vector of values for the predefined attributes or fields

• Eg. A patient’s signs and symptoms, employee details, the amino acids at each site in a protein

• While association rules were developed for transaction data they generalise directly to attribute-value data

Support and confidence

• F(X) = proportion of records that satisfy condition X

• Coverage(AC) = F(A)• Support(AC) = F(A & C)• Confidence(AC) = support(AC) /

coverage(AC) • Maximum likelihood estimate of P(C | A)

Frequent itemsets

• An itemset is frequent if its cover equals or exceeds a user defined minimum

• Downward closure • frequency is anti-monotone

• if an itemset I is not frequent then no specialization of I is frequent

Association rules

• Antecedent and consequent are frequent itemsets

• An association rule indicates that the presence of the antecedent increases the probability that the consequent will be present• bread & butter honey

Association rule discovery

• Requires minimum support constraint• Finds all rules that satisfy minimum

support together with other user specified constraints such as minimum confidence

• Example: 1000 transactions, 100 bread, 100 honey, 50 bread & honey• support(bread honey) = 0.05

• confidence(bread honey) = 0.50

The frequent itemset approach

• Find all frequent itemsets• Generate all association rules therefrom• Assumes

• a minimum support constraint

• sparse data

Finding frequent itemsets

• Once frequent itemsets are found rule generation is straightforward

• Research has concentrated on efficient frequent itemset generation

The Apriori algorithm

Apriori(T, ε)L1 ← frequent 1-itemsets relative to T

k ← 2

while Lk-1 ≠ Ck ← Generate(Lk-1)

for t T

for c Subsets(Ck, t)

count[c]++

Lk ← { c Ck | count[c] ≥ ε }

return L

TRANSACTIONS

PROCESS, ε=2

L1 {{a},{b},{d}}

C2 {{a,b},{a,d},{b,d}}

L2 {{a,b},{a,d}}

C3 {{a,b,d}}

Closed itemsets

• In practice many itemsets cover exactly the same items• Eg pregnant, pregnant & woman

• A closed itemset is the most specific itemset that covers a particular set of items

• More efficient to find all closed frequent itemsets than all frequent itemsets

• Can generate all association rules from closed itemsets

Closed Itemsets Example

Full set of itemsets for gill-size=n, gill-color=b & spore-print-color=w gill-size=n [Coverage=2512]

spore-print-color=w [Coverage=2388]

gill-size=n & spore-print-color=w [Coverage=1824]

gill-color=b [Coverage=1728]

gill-color=b & spore-print-color=w [Coverage=1728]

gill-size=n & gill-color=b [Coverage=1728]

gill-size=n & gill-color=b & spore-print-color=w [Coverage=1728]

Closed itemsetsgill-size=n [Coverage=2512]

spore-print-color=w [Coverage=2388]

gill-size=n & spore-print-color=w [Coverage=1824]

gill-size=n & gill-color=b & spore-print-color=w [Coverage=1728]

Part 4:

Interestingness (objective functions)

Interestingness (Objective Functions)

• Need some means of selecting the most (potentially) interesting patterns

• Many different measures of interestingness may be relevant

• Most measures relate to the degree to which the antecedent and consequent are interdependento P(A & C) – P(A) P(C)

Interestingness measures: lift

• lift = confidence / (cover(consequent)/n)• proportional increase in confidence in

context of antecedent

• Example: 1000 transactions, 100 bread, 100 honey, 50 bread & honey• confidence(bread honey) = 0.50

• lift(bread honey) = 5.00

M-estimates

• Problem: many rules with low support will have unrealistically high confidence and lift

• Example: 1000 records, 500 females, 1 age>=90, 1 female & age>=90

• confidence(age>=90 female) = 1.00• lift(age>=90 female) = 2.00

• M-estimate is Bayesian estimate of true confidence and lift• biases confidence toward prior• confidence estimate = (support + m * prior) / (coverage +

m)• lift estimate = confidence estimate / prior• Eg confidence estimate = (1 + 2 * 0. 5) / (1 + 2) = 0.667

lift estimate = 0.667 / 0. 500 = 1.333

Interestingness measures: leverage

• leverage = support – (cover(antecedent) cover(consequent) / n)

• absolute increase in comparison to expected cases if antecedent and consequent independent

• Also known as interest

• Example: 1000 transactions, 100 bread, 100 honey, 50 bread & honey

• confidence(bread honey) = 0.50• lift(bread honey) = 5.00• leverage(bread honey) = 0.04

• Example2: 1000 transactions, 10 batteries, 5 vodka, 1 batteries & vodka

• lift(batteries vodka) = 20.00• leverage(batteries vodka) = 0.0009

Spurious rules

• If condition X is unrelated to conditions A and B,

• confidence(A & X B) confidence(A B)• lift(A & X B) lift(A B)• Eg pregnant & AI Researcher oedema

• One core rule can result in many spurious rules

• If problem ignored, majority of rules can be spurious!

{}{A,B,C,D}

{A,B}{C,D}

{A,C}{B,D}

{A,D}{B,C}

{B,C}{A,D}

{B,D}{A,C}

{C,D}{A,B}

{A}{B,C,D}

{B}{A,C,D}

{C}{A,B,D}

{D}{A,B,C}

{A,B,C}{D}

{A,B,D}{C}

{A,C,D}{B}

{B,C,D}{A}

Need to test up the generalization lattice

{A,B,C,D}{}

Minimum Improvement

• The improvement of rule X → Y [conf=c] = min(c-k | ZX Z → Y [conf=k])

• A minimum improvement constraint can eliminate many spurious rules

Non redundant rules

xyzsc x → y [conf = 1.0 ] x → z [supp=s, conf=c] x, z → y [supp=s, conf=c]

Eg pregnant → oedema [supp=0.1, conf=0.2] pregnant, female → oedema [supp=0.1, conf=0.2]

• A rule X → Y [supp=s, conf=c] is redundant iff xX X\x → Y [supp=s, conf=c] or yY X → Y\y [supp=s, conf=c]

Eg, pregnant, female → oedema • Closed itemset approaches lead to efficient

generation of non-redundant rules because a rule is non-redundant iff all immediate specialisations are closed itemsets.

• Note, redundant rules have improvement of 0.0.

Effect

dataset

filter

non- improvement

none redundant % > 0 %

bms webview 170 170 100 155 91

covtype 998 815 82 143 14

ipums.la.99 973 959 99 481 49

kddcup98 995 992 100 939 94

letter-recognition 541 524 97 421 78

mush 891 469 53 128 14

retail 590 590 100 519 88

shuttle 666 595 89 312 47

splice-junction 748 727 97 699 93

ticdata-2000 996 996 100 988 99

Part 5:

False discoveries

• Massive search leads to high risk of false discoveries

• eg 100 observations, two independent events each occurring with 0.5 probability,

• the probability of perfect correlation is 7.8x10-31. • if there are 1000 events then there are 21000 =

1.07x10301 antecedent – consequent pairs.• What constitutes a false discovery depends upon

the analytic objective• Usually should include rules where

• antecedent and consequent are independent• antecedent and consequent are independent given a

generalisation of the antecedent

Testing independence

• Cannot perform simple test of independence because of multiple comparisons problem• used previously (eg Webb, Butler &

Newlands, 2003) as a statistically unsound filter

Standard statistical correction

• Bonferroni• To maintain experimentwise risk ≤ α for n tests

• use critical value = α / n

• Holm procedure• To maintain experimentwise risk ≤ α for n tests with p

values ordered from lowest to highest p1 … pn

• Accept tests corresponding to p1 … pk , where k is the

highest value such that 1≤i≤k pi ≤ α / (n – k + 1)

p values 0.0100, 0.0200, 0.0400, 0.0400

critical values 0.0125, 0.0167, 0.0250, 0.0500

accept, accept, reject, reject

Direct adjustment

• I used to think “cannot perform simple adjustment such as Bonferroni or Holm because rule spaces are so large, eg 21000 (> 1.0E+301 )

• would result in unacceptable type-2 error• eg = 5.0E-303”

• However, search is often restricted to small antecedents (eg. ≤ 4) resulting in Bonferonni adjusted critical values of magnitude 1.0E-10 … 1.0E-20.

• With such adjustments often many rules can be found

• Cannot order p values to apply Holm procedure

Discovery as hypothesis generation

• Important to trade-off the risks of both type-1 and type-2 errors

• Perhaps best viewed as hypothesis generation, recognising that ‘discovered’ patterns require independent assessment

Hypothesis testing: proposal

• Why not automate such assessment?

Explor- atory

Holdout

ExploratoryPattern

Discovery

Patterns

StatisticalEvaluation

SoundPatterns

Smallset

prefer-able

Holm adjustment

Any hypothesi

s test

Limited type-2 error

Direct adjustment vs Holdout

Direct adjustment• All data used for

exploration and evaluation

• Bonferroni adjustment

• Larger adjustment• Adjustment alters

with size of search space

Holdout• Half data used for

each of exploration and evaluation

• Holm procedure

• Smaller adjustment• Adjustment alters

with number of rules found

Case study: Ten widely used data sets

Name Description RecordsAttribute-

values

BMS webview products viewed at a commercial website 59,601 497

covtype forest cover data 581,012 125

ipums.la.99 Los Angeles census data 88,443 1,874

kddcup98 charity donors 52,256 19,662

letter-recog’n digital image recognition 20,000 74

mush identification of poisonous mushrooms 8,124 127

retail retail market basket data 88,162 16,470

shuttle records of space shuttle flight data 58,000 34

splice-junction DNA sequence records 3,177 243

ticdata-2000 insurance risk assessment 5,822 689

Detecting spurious rules

• Assuming interest only in positive associations• P(C | A) > P(C)

• For any rule A C, want to assess whether it has higher confidence than all its generalisations

• Eg, is confidence(pregnant & female B) >• confidence(pregnant B)• confidence(female B)• confidence(true B)

Detecting spurious rules (cont)

• Perform one-tailed Fisher exact tests with respect to each generalisation• Reject if any test does not exceed critical

value• no need to adjust for multiple comparisons

with respect to the multiple tests for a single rule

• Use Holm adjustment for strict control of type-1 error

Spurious rules case study: high support & confidence non-redundant rules

Name RecordsAttribute-values # Rules # Accepted %

bms webview 59,601 497 22,135 1,747 8

covtype 581,012 125 10,018 0 0

ipums.la.99 88,443 1,874 9,857 288 3

kddcup98 52,256 19,662 9,863 40 <1

letter-recognition 20,000 74 7,978 952 12

mush 8,124 127 8,957 1,266 14

retail 88,162 16,470 11,656 97 1

shuttle 58,000 34 9,760 876 9

splice-junction 3,177 243 8,937 132 1

ticdata-2000 5,822 689 10,438 30 <1

KDDCUP98: 99.5% of rules rejected

The following 40 rules passed holdout evaluation…ETH12<=0 HC15<=0 [Coverage=0.987 (25786); Support=0.946 (24722); Confidence=0.959; Lift=1.00]…The following 9843 rules failed holdout evaluation, adjusted critical value = 5.09E-06…NOEXCH=0 & ETH12<=0 HC15<=0 [Coverage=0.984 (25703); Support=0.943 (24644); Confidence=0.959; Lift=1.00]…NOEXCH=0 & ETH12<=0 & MDMAUD_F=X HC15<=0 [Coverage=0.981 (25629); Support=0.940 (24573); Confidence=0.959; Lift=1.00]…NOEXCH=0 & ETH12<=0 & ADATE_2>=9706 & MDMAUD_R=X HC15<=0 [Coverage=0.981 (25623); Support=0.940 (24567); Confidence=0.959; Lift=1.00]…

Comparison of direct adjustment and holdout tests on artificial data

True Discoveries False Discoveries Experimentwise Error

Averages over 100 runs, 84 true rules at antecedent size 4

Comparison on real data

Letter Recognition

2.33E+03 1.32E+05 2.29E+06 2.68E+07 2.27E+08 1.47E+09

Search Space Size

Direct Holdout

Retail

1.36E+08 2.23E+12 1.23E+16 5.05E+19 1.66E+23 4.56E+26

Search space size

Direct Holdout

Part 6:

Limitations of minimum support

• Discontinuity in ‘interestingness’ function• The vodka and caviar problem

• some high value associations are infrequent• Feast or famine

• minimum support is a crude control mechanism• often results in too few or too many associations

• Cannot handle dense data• Cannot prune search space using constraints on

relationship between antecedent and consequent• eg confidence

• Minimum support may not be relevant• cannot be sufficiently low to capture all valid rules• cannot be sufficiently high to exclude all spurious rules

Very low support rules can be significant

Data file: Brijs retail.itl [50% sample]

44081 cases / 44081 holdout cases / 16470 items

The following 5 rules passed holdout evaluation

168 & 4685 → 1 [Coverage=0.000 (3); Support=0.000 (3); Confidence estimate=0.601; Lift estimate=192.06]

Very high support rules can be spurious

Data file: covtype.data 581012 cases / 125 valuesST15=0 → ST07=0 [Coverage=1.000 (581009); Support=1.000 (580904); Confidence=1.000; Lift=1.00]ST07=0 → ST15=0 [Coverage=1.000 (580907); Support=1.000 (580904); Confidence=1.000; Lift=1.00]ST15=0 → ST36=0 [Coverage=1.000 (581009); Support=1.000 (580890); Confidence=1.000; Lift=1.00]ST36=0 → ST15=0 [Coverage=1.000 (580893); Support=1.000 (580890); Confidence=1.000; Lift=1.00]ST15=0 → ST08=0 [Coverage=1.000 (581009); Support=1.000 (580830); Confidence=1.000; Lift=1.00]ST08=0 → ST15=0 [Coverage=1.000 (580833); Support=1.000 (580830); Confidence=1.000; Lift=1.00]….. 197,183,686 such rules have highest support

Roles of constraints

1. Select most relevant patterns• patterns that are likely to be interesting

2. Control the number of patterns that the user must consider

3. Make computation feasible

Minimum support can get overloaded!

Select most relevant

Control the number

Make com

putation feasible

Part 6:

K-most interesting pattern discovery

• Find k patterns that maximise a measure of interest within other constraints that the user may specify

• removes need for minimum support constraint• efficient with dense data• empowers user to use relevant measure of interest• user specifies number of patterns to be returned• does not require either monotone or anti-monotone

constraints• Relies on efficient search

• must be able to retain all data in memory• constraints must sufficiently constraint the search

Part 7:

Itemset discovery

• In some contexts it is the collection of variables that are correlated that are of interest and the rule structure is superfluous.

• If A is associated with B then B must be associated with A (in the sense of the presence of the antecedent increasing the probability of the presence of the consequent).

• Discovering interesting itemsets is an area that has been little explored.

Part 8:

Contrast discovery

Contrast sets (emerging patterns)

• Sometimes it is interesting to identify differences between contrasting groups

• Eg: how do purchasing patterns differ on weekends to weekdays?

• Contrast sets find sets of conditions that differ significantly between groups

)|P()|P( ji GcsetGcsetij ),support(),support(max jiij GcsetGcset

Contrast sets (cont.)

• Different analytic objective to association rules• more directed

• focus on differences between groups instead of associations between variables

• Different to classification rules• not discriminative

• no attempt to distinguish all individuals of each group

• find all contrasts rather than sufficient discriminators

Can be discovered by existing techniques!

• Contrast / emerging pattern discovery is strictly equivalent to standard exploratory rule discovery with the consequent restricted to the group variable

)|P()|P()|P()|P( csetGcsetGijGcsetGcsetijjiji

Part 9:

Impact rules

Impact rules (quantitative association rules)

• Most rule discovery techniques require that numeric variables be discretised.

• This often loses important information.• Impact rules associate an antecedent with a

distribution on a numeric variable.• The user specifies what makes a distribution

interesting • eg largest mean, smallest standard deviation, …

• System finds rules that maximise the measure of interest within other user-specified constraints

Impact rule discovery example

LengthOfStay: mean = 10.6; min = -6; max = 1687; sum = 367781

COUNTRYOFBIRTH=1100 -> LengthOfStay: Coverage=0.054 (1861); Mean=22.2; Min=-4; Max=1687; Sum=41314; Impact=21612.4

ADMITDay=Wednesday -> LengthOfStay: Coverage=0.159 (5518); Mean=13.3; Min=0; Max=1548; Sum=73389; Impact=15307.6

Summary

• Exploratory pattern discovery empowers the user to select the patterns that are most useful

• Rules provide a modular and powerful knowledge representation formalism

• Association rules discover associations between qualitative variables that are frequent

• K-optimal rules discover associations between qualitative variables that optimise a measure of interest

• Impact rules discover associations between qualitative and quantitative variables

• Contrasts discover differences in distributions over variables between different groups

• If you mine for patterns without appropriate statistical evaluation, expect to find fool’s gold!

Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

Documents

Interiors : works by Gregor Bell, Geoff Kleem, Ted Riggs ... 1990.pdf · Interiors Works By Gregor Bell Geoff Kleem Ted Riggs Derek Smith Alex Wanders Geoff Weary Curated by Geoff

Sorselemapp Webb

The Webb Bulletin Name Index - Volume 1, 2010 Webb Bulletin...2 THE WEBB BULLETIN Volume 1, 2010 Index John Webb, Westhorpe 3-21 John Webb, Wiltshire 3-20, 3-21 John Webley, Gloucestershire

Geoff Murphy

Geoff Parcell - Opening Ceremony by Geoff Parcell

Geoff Cooper

Helene Webb

WEBB PIERCE HUNDRED YEAR WEBB - andmorebears.com

Zero to hero - Geoff Webb

Training prepared by Geoff Webb Information Security & Governance Consultant Data Protection isn’t a choice, it’s the law What all CPH staff must do 17/07/2013

By Geoff Kleinman Geoff@kleinman.com @GeoffK

Geoff roberts

DEVELOPING LEADERS GEOFF SURRATT. DEVELOPING LEADERS GEOFF SURRATT

Mobil webb

Geoff Gillette & Co Chartered Accountants GEOFF GILLETTE

Gardner-Webb University Digital Commons @ Gardner-Webb

Geoff Platt

Geoff kelley

2017-18 Middle School Exploratory Offerings Exploratory ... › acps › students › Documents › MS-Ex… · 2017-18 Middle School Exploratory Offerings Exploratory Grade Burley

James Webb Space Telescope - Webb/NASA