20
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Embed Size (px)

Citation preview

Page 1: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Detecting Group Differences: Mining Contrast Sets

Author: Stephen D. BayAdvisor: Dr. HsuGraduate: Yan-Cheng Lin

Page 2: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Outline

Motivation Objective Research Review Search for Contrast Sets Filtering for Summarizing Contrast

Set Evaluation Conclusion

Page 3: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Motivation

Learning group differences a central problem in many domains

Contrasting groups especially important in social science research

Page 4: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Objective

Automatically detect differences between contrasting groups from observational multivariate data

Page 5: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Research Review

time series research multiple observations

traditional statistical methods rule learner and decision tree

miss group differences association rule mining

multiple group and different search criteria

Page 6: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Problem Definition

itemset concept extends to contrast setDefinition 1:

Let A1,A2,...,Ak be a set of k variables called attributes.

Each Ai can take on values from the set {Vi1,Vi2,...Vim}.

Contrast set a conjunction of attribute –value pairs defined on groups G1,G2,...,Gn with no Ai occurring more than once.

Page 7: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Define support of contrast set Definition 2:

The support of a contrast set with respect to a group G is the percentage of examples in G where the contrast set is true.

minimum support difference δ user defined threshold

Page 8: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Search for Contrast Sets

find contrast sets meet our criteria though search

explore all possible contrast sets return only sets meet our criteria

STUCCO (Search and Testing for Understandable Consistent Contrasts): breadth-first search incorporates several efficiently mining techniques

Page 9: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Framework use set-enumeration trees use breadth-first search counting phase organize nodes into candidate

groups

Page 10: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Finding Significant Contrast Sets testing the null hypothesis across all groups support counts from contingency tables

Page 11: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Controlling Search Error

data mining test many hypotheses family of tests control Type I error Bonferroni inequality:given any set of events

e1,e2,...,en, the probability of their union is less than or equal to the sum of the individual probabilities

Page 12: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Pruning

prune when contrast sets fail to meet effect size or statistical significance criteria

prune when lead to uninteresting contrast sets

Effect Size Pruning prune nodes when bound maximum support differ

ence groups below δ Statistical Significance Pruning

pruned when too few data or maximum value X2 too small

Page 13: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Interest Based Pruning contrast sets are not interesting when have

identical support or relation between groups is fixed

Specializations with Identical Support marital-status=husband marital-status=husband ^ Sex = male

Page 14: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Fixed Relations

Fixed Relations prune node as contrast set specializations do

not add new information

Page 15: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Relation to Itemset Mining

minimum support difference criterion implies constraints support levels in individual groups

eliminate large portions of the search space based on:

subset infrequency pruning effect size pruning

superset frequency pruning interest based pruning

ab abc

Page 16: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Filtering for Summarizing Contrast Set

past approaches limit the rules shown by constraint the

variables or items compare discovered rules, show only

unexpected results new methods

expectation based statistical approach identify and select linear trend contrast

sets

Page 17: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Statistical Surprise

show most general contrast sets first, more complicated conjunctions if surprising based on previously shown sets

IPF(Iterative Proportional Fitting) find maximum likelihood estimates

Page 18: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Detecting Linear Trends identical to finding change over time detect significant contrast set by using the chi-

square test use regression techniques to find the portion of

the x2

Page 19: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Evaluation three research points:

low support difference few high support attribute-value pairs, lower bounds can’

t take advantage pruning rules

δ -> 0 statistical significance pruning is more important filtering rules

Page 20: Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Conclusion

STUCCO algorithm combined statistical hypothesis testing with search for mining contrast sets

STUCOO has pruning rules efficient mining at low

support differences guaranteed control over false positives linear trend detection compact summarization of result