40
Instructor : Prof. Marina Gavrilova

Chapter 26: Data Mining Part II

  • Upload
    preston

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Chapter 26: Data Mining Part II. Instructor : Prof. Marina Gavrilova. Goal. Goal of this presentation is to discuss in detail how data mining methods are used in market analysis. Outline of Presentation. Motivation based on types of learning (supervised/unsupervised) Market Based Analysis - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 26: Data  Mining  Part II

Instructor : Prof. Marina Gavrilova

Page 2: Chapter 26: Data  Mining  Part II

GoalGoal of this presentation is to discuss in detail

how data mining methods are used in market analysis.

Page 3: Chapter 26: Data  Mining  Part II

Outline of Presentation Motivation based on types of learning

(supervised/unsupervised) Market Based Analysis Association Rule Algorithms More abstract problem Redux Breadth-first search Depth-first search Summary

Page 4: Chapter 26: Data  Mining  Part II

What to Learn/Discover?Statistical SummariesGeneratorsDensity EstimationPatterns/RulesAssociations Clusters/Groups Exceptions/OutliersChanges in Patterns Over Time or

Location

Page 5: Chapter 26: Data  Mining  Part II

Market Basket AnalysisConsider shopping cart filled with several

itemsMarket basket analysis tries to answer the

following questions:Who makes purchases?What do customers buy together?In what order do customers purchase items?

Page 6: Chapter 26: Data  Mining  Part II

Market Basket AnalysisGiven:A database of

customer transactions

Each transaction is a set of items

Example:Transaction with TID 111 contains items {Pen, Ink, Milk, Juice}

TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice 4

Page 7: Chapter 26: Data  Mining  Part II

Market Basket Analysis (Contd.)Coocurrences

80% of all customers purchase items X, Y and Z together.

Association rules60% of all customers who purchase X and Y

also buy Z.Sequential patterns

60% of customers who first buy X also purchase Y within three weeks.

Example: Face recognition for vending machine product recommendation

Page 8: Chapter 26: Data  Mining  Part II

Confidence and SupportWe prune the set of all possible association

rules using two interesting measures:Support of a rule:

X Y has support s : P(XY) = s (X AND Y PURCHASED TOGETHER)

Confidence of a rule:X Y has confidence c : P(Y|X) = c (Y

FOLLOWED X)

Page 9: Chapter 26: Data  Mining  Part II

ExampleExamples:{Pen} => {Milk}

Support: 75%Confidence: 75%

{Ink} => {Pen}Support: 100%Confidence: 100%

TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice 4

Page 10: Chapter 26: Data  Mining  Part II

ExampleFind all itemsets

withsupport >= 75%?

TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice 4

Page 11: Chapter 26: Data  Mining  Part II

ExampleFind all association

rules with support >= 50%

TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice 4

Page 12: Chapter 26: Data  Mining  Part II

Market Basket Analysis: Applications

Sample ApplicationsDirect marketingFraud detection for medical insuranceFloor/shelf planningWeb site layoutCross-selling

Page 13: Chapter 26: Data  Mining  Part II

Applications of Frequent ItemsetsMarket Basket AnalysisAssociation RulesClassification (especially: text, rare classes)Seeds for construction of Bayesian NetworksWeb log analysisCollaborative filtering

Page 14: Chapter 26: Data  Mining  Part II

Association Rule AlgorithmsAbstract problem reduxBreadth-first searchDepth-first search

Page 15: Chapter 26: Data  Mining  Part II

Problem ReduxAbstract: A set of items {1,2,…,k} A dabase of transactions

(itemsets) D={T1, T2, …, Tn},Tj subset {1,2,…,k}

GOAL:Find all itemsets that appear in at

least x transactions

(“appear in” == “are subsets of”)I subset T: T supports I

For an itemset I, the number of transactions it appears in is called the support of I.

x is called the minimum support.

Concrete: I = {milk, bread, cheese, …} D = { {milk,bread,cheese},

{bread,cheese,juice}, …}

GOAL:Find all itemsets that appear in

at least 1000 transactions

{milk,bread,cheese} supports {milk,bread}

Page 16: Chapter 26: Data  Mining  Part II

Problem Redux (Cont.)Definitions:An itemset is frequent if it

is a subset of at least x transactions. (FI.)

An itemset is maximally frequent if it is frequent and it does not have a frequent superset. (MFI.)

GOAL: Given x, find all frequent (maximally frequent) itemsets (to be stored in the FI (MFI)).

Obvious relationship:MFI subset FI

Example:D={ {1,2,3}, {1,2,3},

{1,2,3}, {1,2,4} }Minimum support x = 3

{1,2} is frequent{1,2,3} is maximal frequentSupport({1,2}) = 4

All maximal frequent itemsets: {1,2,3}

Page 17: Chapter 26: Data  Mining  Part II

The Itemset Lattice{}

{2}{1} {4}{3}

{1,2} {2,3}{1,3} {1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

Page 18: Chapter 26: Data  Mining  Part II

Frequent Itemsets

Frequent itemsets

Infrequent itemsets

{}

{2}{1} {4}{3}

{1,2} {2,3}{1,3} {1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

Page 19: Chapter 26: Data  Mining  Part II

Breath First Search: 1-Itemsets{}

{2}{1} {4}{3}

{1,2} {2,3}{1,3} {1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

InfrequentFrequentCurrently examinedDon’t know

Page 20: Chapter 26: Data  Mining  Part II

Breath First Search: 2-Itemsets{}

{2}{1} {4}{3}

{1,2} {2,3}{1,3} {1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

InfrequentFrequentCurrently examinedDon’t know

Page 21: Chapter 26: Data  Mining  Part II

Breath First Search: 3-Itemsets{}

{2}{1} {4}{3}

{1,2} {2,3}{1,3} {1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

InfrequentFrequentCurrently examinedDon’t know

Page 22: Chapter 26: Data  Mining  Part II

Breadth First Search: RemarksWe prune infrequent itemsets and avoid

to count themTo find an itemset with k items, we need

to count all 2k subsets

Page 23: Chapter 26: Data  Mining  Part II

Depth First Search (1){}

{2}{1} {4}{3}

{1,2} {2,3}{1,3} {1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

InfrequentFrequentCurrently examinedDon’t know

Page 24: Chapter 26: Data  Mining  Part II

Depth First Search (2){}

{2}{1} {4}{3}

{1,2} {2,3}{1,3} {1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

InfrequentFrequentCurrently examinedDon’t know

Page 25: Chapter 26: Data  Mining  Part II

Depth First Search (3){}

{2}{1} {4}{3}

{1,2} {2,3}{1,3} {1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

InfrequentFrequentCurrently examinedDon’t know

Page 26: Chapter 26: Data  Mining  Part II

Depth First Search (4){}

{2}{1} {4}{3}

{1,2} {2,3}{1,3} {1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

InfrequentFrequentCurrently examinedDon’t know

Page 27: Chapter 26: Data  Mining  Part II

Depth First Search (5){}

{2}{1} {4}{3}

{1,2} {2,3}{1,3} {1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

InfrequentFrequentCurrently examinedDon’t know

Page 28: Chapter 26: Data  Mining  Part II

BFS Versus DFSBreadth First SearchPrunes infrequent

itemsetsUses anti-

monotonicity: Every superset of an infrequent itemset is infrequent

Depth First SearchPrunes frequent

itemsetsUses monotonicity:

Every subset of a frequent itemset is frequent

Page 29: Chapter 26: Data  Mining  Part II

ExtensionsImposing constraints

Only find rules involving the dairy departmentOnly find rules involving expensive productsOnly find “expensive” rulesOnly find rules with “whiskey” on the right hand

sideOnly find rules with “milk” on the left hand sideHierarchies on the itemsCalendars (every Sunday, every 1st of the month)

Page 30: Chapter 26: Data  Mining  Part II

Item set ConstraintsDefinition: A constraint is an arbitrary property of itemsets.

Examples:The itemset has support greater than 1000. No element of the itemset costs more than $40.The items in the set average more than $20.

Goal: Find all itemsets satisfying a given constraint P.

“Solution”: If P is a support constraint, use the Apriori Algorithm.

Page 31: Chapter 26: Data  Mining  Part II

Two Trivial ObservationsApriori can be applied to any constraint P

(). Start from the empty set.Prune supersets of sets that do not satisfy P.

Itemset lattice is a boolean algebra, so Apriori also applies to Q ().Start from set of all items instead of empty set.Prune subsets of sets that do not satisfy Q.

Page 32: Chapter 26: Data  Mining  Part II

Negative Pruning a Monotone Q{}

{2}{1} {4}{3}

{2,3}{1,3} {1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

{1,2}

Satisfies QDoesn’t satisfy QCurrently examinedDon’t know

Page 33: Chapter 26: Data  Mining  Part II

Positive Pruning in Apriori{}

{2}{1} {4}{3}

{2,3}{1,3}{1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

{1,2}

FrequentInfrequentCurrently examinedDon’t know

Page 34: Chapter 26: Data  Mining  Part II

Positive Pruning in Apriori

{2,3}

{}

{2}{1} {4}{3}

{1,3}{1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

{1,2}

FrequentInfrequentCurrently examinedDon’t know

Page 35: Chapter 26: Data  Mining  Part II

Positive Pruning in Apriori{}

{2}{1} {4}{3}

{2,3}{1,3}{1,4} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

{1,2}

FrequentInfrequentCurrently examinedDon’t know

Page 36: Chapter 26: Data  Mining  Part II

The Problem Current Techniques:Approximate the difficult constraints.

New Goal:Given constraints P and Q, with P (support) and

Q (statistical constraint). Find all itemsets that satisfy both P and Q.

Recent solutions:Newer algorithms can handle both P and Q

Page 37: Chapter 26: Data  Mining  Part II

Satisfies Q

Satisfies P & Q

Satisfies P

{}

D

All supersets satisfy Q

All subsets satisfy P

Page 38: Chapter 26: Data  Mining  Part II

ApplicationsSpatial association rulesWeb miningMarket basket analysisUser/customer profiling

Page 39: Chapter 26: Data  Mining  Part II

Review QuestionsWhat is Supervised and Un- supervised learning ? Is clustering – supervised or un supervised type of

learning?What are Association Rule Algorithms?Differentiate with help of an example Breadth-first search

and Depth-first search

Page 40: Chapter 26: Data  Mining  Part II

Useful linkshttp://www.oracle.com/technology/

industries/life_sciences/pdf/ls_sup_unsup_dm.pdf

http://www.autonlab.org/tutorials/http://www.bandmservices.com/Clustering/

Clustering.htmhttp://www.cs.sunysb.edu/~skiena/

combinatorica/animations/search.htmlhttp://www.codeproject.com/KB/java/

BFSDFS.aspx