Upload
abigayle-georgia-holland
View
223
Download
5
Embed Size (px)
Citation preview
© Prentice Hall 1
DATA MININGDATA MININGIntroductory and Advanced TopicsIntroductory and Advanced Topics
Part II Part II
Margaret H. DunhamMargaret H. DunhamDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering
Southern Methodist UniversitySouthern Methodist University
Companion slides for the text by Dr. M.H.Dunham, Companion slides for the text by Dr. M.H.Dunham, Data Mining, Data Mining, Introductory and Advanced TopicsIntroductory and Advanced Topics, Prentice Hall, 2002., Prentice Hall, 2002.
© Prentice Hall 2
Data Mining OutlineData Mining Outline PART IPART I
– IntroductionIntroduction– Related ConceptsRelated Concepts– Data Mining TechniquesData Mining Techniques
PART IIPART II– ClassificationClassification– ClusteringClustering– Association RulesAssociation Rules
PART IIIPART III– Web MiningWeb Mining– Spatial MiningSpatial Mining– Temporal MiningTemporal Mining
© Prentice Hall 3
Classification OutlineClassification Outline
Classification Problem OverviewClassification Problem Overview Classification TechniquesClassification Techniques
– RegressionRegression– DistanceDistance– Decision TreesDecision Trees– RulesRules– Neural NetworksNeural Networks
Goal:Goal: Provide an overview of the classification Provide an overview of the classification problem and introduce some of the basic problem and introduce some of the basic algorithmsalgorithms
© Prentice Hall 4
Classification ProblemClassification Problem Given a database D={tGiven a database D={t11,t,t22,…,t,…,tnn} and a set } and a set
of classes C={Cof classes C={C11,…,C,…,Cmm}, the }, the Classification ProblemClassification Problem is to define a is to define a mapping f:Dmapping f:DC where each tC where each tii is assigned is assigned to one class.to one class.
Actually divides D into Actually divides D into equivalence equivalence classesclasses..
PredictionPrediction isis similar, but may be viewed similar, but may be viewed as having infinite number of classes.as having infinite number of classes.
© Prentice Hall 5
Classification ExamplesClassification Examples
Teachers classify students’ grades as Teachers classify students’ grades as A, B, C, D, or F. A, B, C, D, or F.
Identify mushrooms as poisonous or Identify mushrooms as poisonous or edible.edible.
Predict when a river will flood.Predict when a river will flood. Identify individuals with credit risks. Identify individuals with credit risks. Speech recognitionSpeech recognition Pattern recognitionPattern recognition
© Prentice Hall 6
Classification Ex: GradingClassification Ex: Grading
If x >= 90 then grade If x >= 90 then grade =A.=A.
If 80<=x<90 then If 80<=x<90 then grade =B.grade =B.
If 70<=x<80 then If 70<=x<80 then grade =C.grade =C.
If 60<=x<70 then If 60<=x<70 then grade =D.grade =D.
If x<50 then grade =F.If x<50 then grade =F.
>=90<90
x
>=80<80
x
>=70<70
x
F
B
A
>=60<50
x C
D
© Prentice Hall 7
Classification Ex: Letter Classification Ex: Letter RecognitionRecognition
View letters as constructed from 5 components:
Letter C
Letter E
Letter A
Letter D
Letter F
Letter B
© Prentice Hall 8
Classification TechniquesClassification Techniques
Approach:Approach:1.1. Create specific model by evaluating Create specific model by evaluating
training data (or using domain training data (or using domain experts’ knowledge).experts’ knowledge).
2.2. Apply model developed to new data.Apply model developed to new data. Classes must be predefinedClasses must be predefined Most common techniques use DTs, Most common techniques use DTs,
NNs, or are based on distances or NNs, or are based on distances or statistical methods.statistical methods.
© Prentice Hall 9
Defining ClassesDefining Classes
Partitioning Based
Distance Based
© Prentice Hall 10
Issues in ClassificationIssues in Classification
Missing DataMissing Data– IgnoreIgnore– Replace with assumed valueReplace with assumed value
Measuring PerformanceMeasuring Performance– Classification accuracy on test dataClassification accuracy on test data– Confusion matrixConfusion matrix– OC CurveOC Curve
© Prentice Hall 11
Height Example DataHeight Example DataName Gender Height Output1 Output2 Kristina F 1.6m Short Medium Jim M 2m Tall Medium Maggie F 1.9m Medium Tall Martha F 1.88m Medium Tall Stephanie F 1.7m Short Medium Bob M 1.85m Medium Medium Kathy F 1.6m Short Medium Dave M 1.7m Short Medium Worth M 2.2m Tall Tall Steven M 2.1m Tall Tall Debbie F 1.8m Medium Medium Todd M 1.95m Medium Medium Kim F 1.9m Medium Tall Amy F 1.8m Medium Medium Wynette F 1.75m Medium Medium
© Prentice Hall 12
Classification PerformanceClassification Performance
True Positive
True NegativeFalse Positive
False Negative
© Prentice Hall 13
Confusion Matrix ExampleConfusion Matrix Example
Using height data example with Output1 Using height data example with Output1 correct and Output2 actual assignmentcorrect and Output2 actual assignment
Actual Assignment Membership Short Medium Tall Short 0 4 0 Medium 0 5 3 Tall 0 1 2
© Prentice Hall 14
Operating Characteristic CurveOperating Characteristic Curve
© Prentice Hall 15
RegressionRegression Assume data fits a predefined functionAssume data fits a predefined function Determine best values for Determine best values for regression regression
coefficientscoefficients cc00,c,c11,…,c,…,cnn.. Assume an error: y = cAssume an error: y = c00+c+c11xx11+…+c+…+cnnxxnn+ Estimate error using mean squared error for
training set:
© Prentice Hall 16
Linear Regression Poor FitLinear Regression Poor Fit
© Prentice Hall 17
Classification Using RegressionClassification Using Regression
Division:Division: Use regression function to Use regression function to divide area into regions. divide area into regions.
PredictionPrediction: Use regression function to : Use regression function to predict a class membership function. predict a class membership function. Input includes desired class.Input includes desired class.
© Prentice Hall 18
DivisionDivision
© Prentice Hall 19
PredictionPrediction
© Prentice Hall 20
Classification Using DistanceClassification Using Distance Place items in class to which they are Place items in class to which they are
“closest”.“closest”. Must determine distance between an item Must determine distance between an item
and a class.and a class. Classes represented byClasses represented by
– Centroid:Centroid: Central value. Central value.– Medoid:Medoid: Representative point. Representative point.– Individual pointsIndividual points
Algorithm: KNNAlgorithm: KNN
© Prentice Hall 21
K Nearest Neighbor (KNN):K Nearest Neighbor (KNN):
Training set includes classes.Training set includes classes. Examine K items near item to be Examine K items near item to be
classified.classified. New item placed in class with the most New item placed in class with the most
number of close items.number of close items. O(q) for each tuple to be classified. O(q) for each tuple to be classified.
(Here q is the size of the training set.)(Here q is the size of the training set.)
© Prentice Hall 22
KNNKNN
© Prentice Hall 23
KNN AlgorithmKNN Algorithm
© Prentice Hall 24
Classification Using Decision Classification Using Decision TreesTrees
Partitioning based:Partitioning based: Divide search Divide search space into rectangular regions.space into rectangular regions.
Tuple placed into class based on the Tuple placed into class based on the region within which it falls.region within which it falls.
DT approaches differ in how the tree is DT approaches differ in how the tree is built: built: DT InductionDT Induction
Internal nodes associated with attribute Internal nodes associated with attribute and arcs with values for that attribute.and arcs with values for that attribute.
Algorithms: ID3, C4.5, CARTAlgorithms: ID3, C4.5, CART
© Prentice Hall 25
Decision TreeDecision TreeGiven: Given:
– D = {tD = {t11, …, t, …, tnn} where t} where tii=<t=<ti1i1, …, t, …, tihih> > – Database schema contains {ADatabase schema contains {A11, A, A22, …, A, …, Ahh}}– Classes C={CClasses C={C11, …., C, …., Cmm}}
Decision or Classification TreeDecision or Classification Tree is is a tree associated a tree associated with D such thatwith D such that– Each internal node is labeled with attribute, AEach internal node is labeled with attribute, A ii
– Each arc is labeled with predicate which can be Each arc is labeled with predicate which can be applied to attribute at parentapplied to attribute at parent
– Each leaf node is labeled with a class, CEach leaf node is labeled with a class, C jj
© Prentice Hall 26
DT InductionDT Induction
© Prentice Hall 27
DT Splits Area DT Splits Area
Gender
Height
M
F
© Prentice Hall 28
Comparing DTsComparing DTs
BalancedDeep
© Prentice Hall 29
DT IssuesDT Issues
Choosing Splitting AttributesChoosing Splitting Attributes Ordering of Splitting AttributesOrdering of Splitting Attributes SplitsSplits Tree StructureTree Structure Stopping CriteriaStopping Criteria Training DataTraining Data PruningPruning
© Prentice Hall 30
Decision Tree Induction is often based on Decision Tree Induction is often based on Information TheoryInformation Theory
SoSo
© Prentice Hall 31
InformationInformation
© Prentice Hall 32
DT Induction DT Induction
When all the marbles in the bowl are When all the marbles in the bowl are mixed up, little information is given. mixed up, little information is given.
When the marbles in the bowl are all When the marbles in the bowl are all from one class and those in the other from one class and those in the other two classes are on either side, more two classes are on either side, more information is given.information is given.
Use this approach with DT Induction !Use this approach with DT Induction !
© Prentice Hall 33
Information/EntropyInformation/Entropy Given probabilitites pGiven probabilitites p11, p, p22, .., p, .., pss whose sum is whose sum is
1, 1, EntropyEntropy is defined as:is defined as:
Entropy measures the amount of randomness Entropy measures the amount of randomness or surprise or uncertainty.or surprise or uncertainty.
Goal in classificationGoal in classification– no surpriseno surprise– entropy = 0entropy = 0
© Prentice Hall 34
EntropyEntropy
log (1/p) H(p,1-p)
© Prentice Hall 35
ID3ID3 Creates tree using information theory Creates tree using information theory
concepts and tries to reduce expected concepts and tries to reduce expected number of comparison..number of comparison..
ID3 chooses split attribute with the highest ID3 chooses split attribute with the highest information gain:information gain:
© Prentice Hall 36
ID3 Example (Output1)ID3 Example (Output1) Starting state entropy:Starting state entropy:4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.43844/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 Gain using gender:Gain using gender:
– Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764– Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) =
0.43920.4392– Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) =
0.341520.34152– Gain: 0.4384 – 0.34152 = 0.09688Gain: 0.4384 – 0.34152 = 0.09688
Gain using height:Gain using height:0.4384 – (2/15)(0.301) = 0.39830.4384 – (2/15)(0.301) = 0.3983
Choose height as first splitting attributeChoose height as first splitting attribute
© Prentice Hall 37
C4.5C4.5 ID3 ID3 favors attributes with large number of favors attributes with large number of
divisionsdivisions Improved version of ID3:Improved version of ID3:
– Missing DataMissing Data– Continuous DataContinuous Data– PruningPruning– RulesRules– GainRatio:GainRatio:
© Prentice Hall 38
CARTCART
Create Binary TreeCreate Binary Tree Uses entropyUses entropy Formula to choose split point, s, for node t:Formula to choose split point, s, for node t:
PPLL,P,PRR probability that a tuple in the training set probability that a tuple in the training set
will be on the left or right side of the tree.will be on the left or right side of the tree.
© Prentice Hall 39
CART ExampleCART Example At the start, there are six choices for At the start, there are six choices for
split point split point (right branch on equality):(right branch on equality):– P(Gender)=P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.2242(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224
– P(1.6) = 0P(1.6) = 0– P(1.7) = P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.1692(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169
– P(1.8) = P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.3852(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385
– P(1.9) = P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.2562(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256
– P(2.0) = P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.322(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
Split at 1.8Split at 1.8
© Prentice Hall 40
Classification Using Neural Classification Using Neural NetworksNetworks
Typical NN structure for classification:Typical NN structure for classification:– One output node per classOne output node per class– Output value is class membership function valueOutput value is class membership function value
Supervised learning Supervised learning For each tuple in training set, propagate it For each tuple in training set, propagate it
through NN. Adjust weights on edges to through NN. Adjust weights on edges to improve future classification. improve future classification.
Algorithms: Propagation, Backpropagation, Algorithms: Propagation, Backpropagation, Gradient DescentGradient Descent
© Prentice Hall 41
NN Issues NN Issues
Number of source nodesNumber of source nodes Number of hidden layersNumber of hidden layers Training dataTraining data Number of sinksNumber of sinks InterconnectionsInterconnections WeightsWeights Activation FunctionsActivation Functions Learning TechniqueLearning Technique When to stop learningWhen to stop learning
© Prentice Hall 42
Decision Tree vs. Neural Decision Tree vs. Neural NetworkNetwork
© Prentice Hall 43
PropagationPropagation
Tuple Input
Output
© Prentice Hall 44
NN Propagation AlgorithmNN Propagation Algorithm
© Prentice Hall 45
Example PropagationExample Propagation
© Prentie Hall
© Prentice Hall 46
NN LearningNN Learning
Adjust weights to perform better with the Adjust weights to perform better with the associated test data.associated test data.
Supervised:Supervised: Use feedback from Use feedback from knowledge of correct classification.knowledge of correct classification.
Unsupervised:Unsupervised: No knowledge of No knowledge of correct classification needed.correct classification needed.
© Prentice Hall 47
NN Supervised LearningNN Supervised Learning
© Prentice Hall 48
Supervised LearningSupervised Learning
Possible error values assuming output from Possible error values assuming output from node i is ynode i is yii but should be d but should be d ii::
Change weights on arcs based on estimated Change weights on arcs based on estimated errorerror
© Prentice Hall 49
NN BackpropagationNN Backpropagation
Propagate changes to weights Propagate changes to weights backward from output layer to input backward from output layer to input layer.layer.
Delta Rule:Delta Rule: w wijij= c x= c xijij (d (dj j – y– yjj)) Gradient Descent:Gradient Descent: technique to modify technique to modify
the weights in the graph.the weights in the graph.
© Prentice Hall 50
BackpropagationBackpropagation
Error
© Prentice Hall 51
Backpropagation AlgorithmBackpropagation Algorithm
© Prentice Hall 52
Gradient DescentGradient Descent
© Prentice Hall 53
Gradient Descent AlgorithmGradient Descent Algorithm
© Prentice Hall 54
Output Layer LearningOutput Layer Learning
© Prentice Hall 55
Hidden Layer LearningHidden Layer Learning
© Prentice Hall 56
Types of NNsTypes of NNs
Different NN structures used for Different NN structures used for different problems.different problems.
PerceptronPerceptron Self Organizing Feature MapSelf Organizing Feature Map Radial Basis Function NetworkRadial Basis Function Network
© Prentice Hall 57
PerceptronPerceptron
Perceptron is one of the simplest NNs.Perceptron is one of the simplest NNs. No hidden layers.No hidden layers.
© Prentice Hall 58
Perceptron ExamplePerceptron Example
Suppose:Suppose:– Summation: S=3xSummation: S=3x11+2x+2x22-6-6
– Activation: if S>0 then 1 else 0Activation: if S>0 then 1 else 0
© Prentice Hall 59
Self Organizing Feature Map Self Organizing Feature Map (SOFM)(SOFM)
Competitive Unsupervised LearningCompetitive Unsupervised Learning Observe how neurons work in brain:Observe how neurons work in brain:
– Firing impacts firing of those nearFiring impacts firing of those near– Neurons far apart inhibit each otherNeurons far apart inhibit each other– Neurons have specific nonoverlapping Neurons have specific nonoverlapping
taskstasks Ex: Kohonen NetworkEx: Kohonen Network
© Prentice Hall 60
Kohonen NetworkKohonen Network
© Prentice Hall 61
Kohonen NetworkKohonen Network
Competitive Layer – viewed as 2D gridCompetitive Layer – viewed as 2D grid Similarity between competitive nodes and Similarity between competitive nodes and
input nodes:input nodes:– Input: X = <xInput: X = <x11, …, x, …, xhh>>
– Weights: <wWeights: <w1i1i, … , w, … , whihi>>
– Similarity defined based on dot productSimilarity defined based on dot product
Competitive node most similar to input “wins”Competitive node most similar to input “wins” Winning node weights (as well as surrounding Winning node weights (as well as surrounding
node weights) increased.node weights) increased.
© Prentice Hall 62
Radial Basis Function NetworkRadial Basis Function Network
RBF function has Gaussian shapeRBF function has Gaussian shape RBF NetworksRBF Networks
– Three LayersThree Layers– Hidden layer – Gaussian activation Hidden layer – Gaussian activation
functionfunction– Output layer – Linear activation functionOutput layer – Linear activation function
© Prentice Hall 63
Radial Basis Function NetworkRadial Basis Function Network
© Prentice Hall 64
Classification Using RulesClassification Using Rules Perform classification using If-Then Perform classification using If-Then
rulesrules Classification Rule:Classification Rule: r = <a,c> r = <a,c>
Antecedent, ConsequentAntecedent, Consequent May generate from from other May generate from from other
techniques (DT, NN) or generate techniques (DT, NN) or generate directly.directly.
Algorithms: Gen, RX, 1R, PRISMAlgorithms: Gen, RX, 1R, PRISM
© Prentice Hall 65
Generating Rules from DTsGenerating Rules from DTs
© Prentice Hall 66
Generating Rules ExampleGenerating Rules Example
© Prentice Hall 67
Generating Rules from NNsGenerating Rules from NNs
© Prentice Hall 68
1R Algorithm1R Algorithm
© Prentice Hall 69
1R Example1R Example
© Prentice Hall 70
PRISM AlgorithmPRISM Algorithm
© Prentice Hall 71
PRISM ExamplePRISM Example
© Prentice Hall 72
Decision Tree vs. Rules Decision Tree vs. Rules
Tree has implied Tree has implied order in which order in which splitting is splitting is performed.performed.
Tree created based Tree created based on looking at all on looking at all classes.classes.
Rules have no Rules have no ordering of ordering of predicates.predicates.
Only need to look at Only need to look at one class to one class to generate its rules.generate its rules.
© Prentice Hall 73
Clustering OutlineClustering Outline
Clustering Problem OverviewClustering Problem Overview Clustering TechniquesClustering Techniques
– Hierarchical AlgorithmsHierarchical Algorithms– Partitional AlgorithmsPartitional Algorithms– Genetic AlgorithmGenetic Algorithm– Clustering Large DatabasesClustering Large Databases
Goal:Goal: Provide an overview of the clustering Provide an overview of the clustering problem and introduce some of the basic problem and introduce some of the basic algorithmsalgorithms
© Prentice Hall 74
Clustering ExamplesClustering Examples
SegmentSegment customer database based on customer database based on similar buying patterns.similar buying patterns.
Group houses in a town into Group houses in a town into neighborhoods based on similar neighborhoods based on similar features.features.
Identify new plant speciesIdentify new plant species Identify similar Web usage patternsIdentify similar Web usage patterns
© Prentice Hall 75
Clustering ExampleClustering Example
© Prentice Hall 76
Clustering HousesClustering Houses
Size BasedGeographic Distance Based
© Prentice Hall 77
Clustering vs. ClassificationClustering vs. Classification
No prior knowledgeNo prior knowledge– Number of clustersNumber of clusters– Meaning of clustersMeaning of clusters
Unsupervised learningUnsupervised learning
© Prentice Hall 78
Clustering IssuesClustering Issues
Outlier handlingOutlier handling Dynamic dataDynamic data Interpreting resultsInterpreting results Evaluating resultsEvaluating results Number of clustersNumber of clusters Data to be usedData to be used ScalabilityScalability
© Prentice Hall 79
Impact of Outliers on Impact of Outliers on ClusteringClustering
© Prentice Hall 80
Clustering ProblemClustering Problem
Given a database D={tGiven a database D={t11,t,t22,…,t,…,tnn} of tuples } of tuples and an integer value k, the and an integer value k, the Clustering Clustering ProblemProblem is to define a mapping is to define a mapping f:Df:D{1,..,k} where each t{1,..,k} where each tii is assigned to is assigned to one cluster Kone cluster Kjj, 1<=j<=k., 1<=j<=k.
A A ClusterCluster, K, Kjj, contains precisely those , contains precisely those tuples mapped to it.tuples mapped to it.
Unlike classification problem, clusters Unlike classification problem, clusters are not known a priori.are not known a priori.
© Prentice Hall 81
Types of Clustering Types of Clustering
HierarchicalHierarchical – Nested set of clusters – Nested set of clusters created.created.
Partitional Partitional – One set of clusters – One set of clusters created.created.
Incremental Incremental – Each element handled – Each element handled one at a time.one at a time.
SimultaneousSimultaneous – All elements handled – All elements handled together.together.
Overlapping/Non-overlappingOverlapping/Non-overlapping
© Prentice Hall 82
Clustering ApproachesClustering Approaches
Clustering
Hierarchical Partitional Categorical Large DB
Agglomerative Divisive Sampling Compression
© Prentice Hall 83
Cluster ParametersCluster Parameters
© Prentice Hall 84
Distance Between ClustersDistance Between Clusters Single LinkSingle Link: smallest distance between points: smallest distance between points Complete Link:Complete Link: largest distance between points largest distance between points Average Link:Average Link: average distance between pointsaverage distance between points Centroid:Centroid: distance between centroidsdistance between centroids
© Prentice Hall 85
Hierarchical ClusteringHierarchical Clustering
Clusters are created in levels actually Clusters are created in levels actually creating sets of clusters at each level.creating sets of clusters at each level.
AgglomerativeAgglomerative– Initially each item in its own clusterInitially each item in its own cluster– Iteratively clusters are merged togetherIteratively clusters are merged together– Bottom UpBottom Up
DivisiveDivisive– Initially all items in one clusterInitially all items in one cluster– Large clusters are successively dividedLarge clusters are successively divided– Top DownTop Down
© Prentice Hall 86
Hierarchical AlgorithmsHierarchical Algorithms
Single LinkSingle Link MST Single LinkMST Single Link Complete LinkComplete Link Average LinkAverage Link
© Prentice Hall 87
DendrogramDendrogram
Dendrogram:Dendrogram: a tree data a tree data structure which illustrates structure which illustrates hierarchical clustering hierarchical clustering techniques.techniques.
Each level shows clusters Each level shows clusters for that level.for that level.– Leaf – individual clustersLeaf – individual clusters– Root – one clusterRoot – one cluster
A cluster at level i is the A cluster at level i is the union of its children clusters union of its children clusters at level i+1.at level i+1.
© Prentice Hall 88
Levels of ClusteringLevels of Clustering
© Prentice Hall 89
Agglomerative ExampleAgglomerative ExampleAA BB CC DD EE
AA 00 11 22 22 33
BB 11 00 22 44 33
CC 22 22 00 11 55
DD 22 44 11 00 33
EE 33 33 55 33 00
BA
E C
D
4
Threshold of
2 3 51
A B C D E
© Prentice Hall 90
MST ExampleMST Example
AA BB CC DD EE
AA 00 11 22 22 33
BB 11 00 22 44 33
CC 22 22 00 11 55
DD 22 44 11 00 33
EE 33 33 55 33 00
BA
E C
D
© Prentice Hall 91
Agglomerative AlgorithmAgglomerative Algorithm
© Prentice Hall 92
Single LinkSingle Link View all items with links (distances) View all items with links (distances)
between them.between them. Finds maximal connected components Finds maximal connected components
in this graph.in this graph. Two clusters are merged if there is at Two clusters are merged if there is at
least one edge which connects them.least one edge which connects them. Uses threshold distances at each level.Uses threshold distances at each level. Could be agglomerative or divisive.Could be agglomerative or divisive.
© Prentice Hall 93
MST Single Link AlgorithmMST Single Link Algorithm
© Prentice Hall 94
Single Link ClusteringSingle Link Clustering
© Prentice Hall 95
Partitional ClusteringPartitional Clustering
NonhierarchicalNonhierarchical Creates clusters in one step as Creates clusters in one step as
opposed to several steps.opposed to several steps. Since only one set of clusters is output, Since only one set of clusters is output,
the user normally has to input the the user normally has to input the desired number of clusters, k.desired number of clusters, k.
Usually deals with static sets.Usually deals with static sets.
© Prentice Hall 96
Partitional AlgorithmsPartitional Algorithms
MSTMST Squared ErrorSquared Error K-MeansK-Means Nearest NeighborNearest Neighbor PAMPAM BEABEA GAGA
© Prentice Hall 97
MST AlgorithmMST Algorithm
© Prentice Hall 98
Squared ErrorSquared Error
Minimized squared errorMinimized squared error
© Prentice Hall 99
Squared Error AlgorithmSquared Error Algorithm
© Prentice Hall 100
K-MeansK-Means Initial set of clusters randomly chosen.Initial set of clusters randomly chosen. Iteratively, items are moved among sets Iteratively, items are moved among sets
of clusters until the desired set is of clusters until the desired set is reached.reached.
High degree of similarity among High degree of similarity among elements in a cluster is obtained.elements in a cluster is obtained.
Given a cluster KGiven a cluster Kii={t={ti1i1,t,ti2i2,…,t,…,timim}, the }, the
cluster meancluster mean is m is mii = (1/m)(t = (1/m)(ti1i1 + … + t + … + timim))
© Prentice Hall 101
K-Means ExampleK-Means Example Given: {2,4,10,12,3,20,30,11,25}, k=2Given: {2,4,10,12,3,20,30,11,25}, k=2 Randomly assign means: mRandomly assign means: m11=3,m=3,m22=4=4 KK11={2,3}, K={2,3}, K22={4,10,12,20,30,11,25}, ={4,10,12,20,30,11,25},
mm11=2.5,m=2.5,m22=16=16 KK11={2,3,4},K={2,3,4},K22={10,12,20,30,11,25}, m={10,12,20,30,11,25}, m11=3,m=3,m22=18=18 KK11={2,3,4,10},K={2,3,4,10},K22={12,20,30,11,25}, ={12,20,30,11,25},
mm11=4.75,m=4.75,m22=19.6=19.6 KK11={2,3,4,10,11,12},K={2,3,4,10,11,12},K22={20,30,25}, m={20,30,25}, m11=7,m=7,m22=25=25 Stop as the clusters with these means are the Stop as the clusters with these means are the
same.same.
© Prentice Hall 102
K-Means AlgorithmK-Means Algorithm
© Prentice Hall 103
Nearest NeighborNearest Neighbor
Items are iteratively merged into the Items are iteratively merged into the existing clusters that are closest.existing clusters that are closest.
IncrementalIncremental Threshold, t, used to determine if items Threshold, t, used to determine if items
are added to existing clusters or a new are added to existing clusters or a new cluster is created.cluster is created.
© Prentice Hall 104
Nearest Neighbor AlgorithmNearest Neighbor Algorithm
© Prentice Hall 105
PAMPAM
Partitioning Around Medoids (PAM) Partitioning Around Medoids (PAM) (K-Medoids)(K-Medoids)
Handles outliers well.Handles outliers well. Ordering of input does not impact results.Ordering of input does not impact results. Does not scale well.Does not scale well. Each cluster represented by one item, Each cluster represented by one item,
called the called the medoid.medoid. Initial set of k medoids randomly chosen.Initial set of k medoids randomly chosen.
© Prentice Hall 106
PAMPAM
© Prentice Hall 107
PAM Cost CalculationPAM Cost Calculation At each step in algorithm, medoids are At each step in algorithm, medoids are
changed if the overall cost is improved.changed if the overall cost is improved. CCjihjih – cost change for an item t – cost change for an item t jj associated associated
with swapping medoid twith swapping medoid t ii with non-medoid t with non-medoid thh..
© Prentice Hall 108
PAM AlgorithmPAM Algorithm
© Prentice Hall 109
BEABEA Bond Energy AlgorithmBond Energy Algorithm Database design (physical and logical)Database design (physical and logical) Vertical fragmentationVertical fragmentation Determine affinity (bond) between attributes Determine affinity (bond) between attributes
based on common usage.based on common usage. Algorithm outline:Algorithm outline:
1.1. Create affinity matrixCreate affinity matrix
2.2. Convert to BOND matrix Convert to BOND matrix
3.3. Create regions of close bondingCreate regions of close bonding
© Prentice Hall 110
BEABEA
Modified from [OV99]
© Prentice Hall 111
Genetic Algorithm ExampleGenetic Algorithm Example
{{A,B,C,D,E,F,G,H}A,B,C,D,E,F,G,H} Randomly choose initial solution:Randomly choose initial solution:
{A,C,E} {B,F} {D,G,H} or{A,C,E} {B,F} {D,G,H} or10101000, 01000100, 0001001110101000, 01000100, 00010011
Suppose crossover at point four and Suppose crossover at point four and choose 1choose 1stst and 3 and 3rdrd individuals: individuals:10100011, 01000100, 0001100010100011, 01000100, 00011000
What should termination criteria be?What should termination criteria be?
© Prentice Hall 112
GA AlgorithmGA Algorithm
© Prentice Hall 113
Clustering Large DatabasesClustering Large Databases
Most clustering algorithms assume a large Most clustering algorithms assume a large data structure which is memory resident.data structure which is memory resident.
Clustering may be performed first on a Clustering may be performed first on a sample of the database then applied to the sample of the database then applied to the entire database.entire database.
AlgorithmsAlgorithms– BIRCHBIRCH– DBSCANDBSCAN– CURECURE
© Prentice Hall 114
Desired Features for Large Desired Features for Large DatabasesDatabases
One scan (or less) of DBOne scan (or less) of DB OnlineOnline Suspendable, stoppable, resumableSuspendable, stoppable, resumable IncrementalIncremental Work with limited main memoryWork with limited main memory Different techniques to scan (e.g. Different techniques to scan (e.g.
sampling)sampling) Process each tuple onceProcess each tuple once
© Prentice Hall 115
BIRCHBIRCH Balanced Iterative Reducing and Balanced Iterative Reducing and
Clustering using HierarchiesClustering using Hierarchies Incremental, hierarchical, one scanIncremental, hierarchical, one scan Save clustering information in a tree Save clustering information in a tree Each entry in the tree contains Each entry in the tree contains
information about one clusterinformation about one cluster New nodes inserted in closest entry in New nodes inserted in closest entry in
treetree
© Prentice Hall 116
Clustering FeatureClustering Feature CT Triple: (N,LS,SS)CT Triple: (N,LS,SS)
– N: Number of points in clusterN: Number of points in cluster– LS: Sum of points in the clusterLS: Sum of points in the cluster– SS: Sum of squares of points in the clusterSS: Sum of squares of points in the cluster
CF TreeCF Tree– Balanced search treeBalanced search tree– Node has CF triple for each childNode has CF triple for each child– Leaf node represents cluster and has CF value Leaf node represents cluster and has CF value
for each subcluster in it.for each subcluster in it.– Subcluster has maximum diameterSubcluster has maximum diameter
© Prentice Hall 117
BIRCH AlgorithmBIRCH Algorithm
© Prentice Hall 118
Improve ClustersImprove Clusters
© Prentice Hall 119
DBSCANDBSCAN
Density Based Spatial Clustering of Density Based Spatial Clustering of Applications with NoiseApplications with Noise
Outliers will not effect creation of cluster.Outliers will not effect creation of cluster. InputInput
– MinPts MinPts – minimum number of points in – minimum number of points in clustercluster
– EpsEps – for each point in cluster there must – for each point in cluster there must be another point in it less than this distance be another point in it less than this distance away.away.
© Prentice Hall 120
DBSCAN Density ConceptsDBSCAN Density Concepts
Eps-neighborhood:Eps-neighborhood: Points within Eps Points within Eps distance of a point.distance of a point.
Core point:Core point: Eps-neighborhood dense enough Eps-neighborhood dense enough (MinPts)(MinPts)
Directly density-reachable:Directly density-reachable: A point p is A point p is directly density-reachable from a point q if the directly density-reachable from a point q if the distance is small (Eps) and q is a core point.distance is small (Eps) and q is a core point.
Density-reachable:Density-reachable: A point si density- A point si density-reachable form another point if there is a path reachable form another point if there is a path from one to the other consisting of only core from one to the other consisting of only core points.points.
© Prentice Hall 121
Density ConceptsDensity Concepts
© Prentice Hall 122
DBSCAN AlgorithmDBSCAN Algorithm
© Prentice Hall 123
CURECURE
Clustering Using RepresentativesClustering Using Representatives Use many points to represent a cluster Use many points to represent a cluster
instead of only oneinstead of only one Points will be well scatteredPoints will be well scattered
© Prentice Hall 124
CURE ApproachCURE Approach
© Prentice Hall 125
CURE AlgorithmCURE Algorithm
© Prentice Hall 126
CURE for Large DatabasesCURE for Large Databases
© Prentice Hall 127
Comparison of Clustering Comparison of Clustering TechniquesTechniques
© Prentice Hall 128
Association Rules OutlineAssociation Rules OutlineGoal: Provide an overview of basic Association Provide an overview of basic Association
Rule mining techniquesRule mining techniques Association Rules Problem OverviewAssociation Rules Problem Overview
– Large itemsetsLarge itemsets Association Rules AlgorithmsAssociation Rules Algorithms
– AprioriApriori– SamplingSampling– PartitioningPartitioning– Parallel AlgorithmsParallel Algorithms
Comparing TechniquesComparing Techniques Incremental AlgorithmsIncremental Algorithms Advanced AR TechniquesAdvanced AR Techniques
© Prentice Hall 129
Example: Market Basket DataExample: Market Basket Data Items frequently purchased together:Items frequently purchased together:
Bread Bread PeanutButterPeanutButter Uses:Uses:
– Placement Placement – AdvertisingAdvertising– SalesSales– CouponsCoupons
Objective: increase sales and reduce Objective: increase sales and reduce costscosts
© Prentice Hall 130
Association Rule DefinitionsAssociation Rule Definitions
Set of items:Set of items: I={I I={I11,I,I22,…,I,…,Imm}}
Transactions:Transactions: D={t D={t11,t,t22, …, t, …, tnn}, t}, tjj I I
Itemset:Itemset: {I {Ii1i1,I,Ii2i2, …, I, …, Iikik} } I I Support of an itemset:Support of an itemset: Percentage of Percentage of
transactions which contain that itemset.transactions which contain that itemset. Large (Frequent) itemset:Large (Frequent) itemset: Itemset Itemset
whose number of occurrences is above whose number of occurrences is above a threshold.a threshold.
© Prentice Hall 131
Association Rules ExampleAssociation Rules Example
I = { Beer, Bread, Jelly, Milk, PeanutButter}
Support of {Bread,PeanutButter} is 60%
© Prentice Hall 132
Association Rule DefinitionsAssociation Rule Definitions
Association Rule (AR): Association Rule (AR): implication X implication X Y where X,Y Y where X,Y I and X I and X Y = Y = ;;
Support of AR (s) X Support of AR (s) X YY: : Percentage of transactions that Percentage of transactions that contain X contain X YY
Confidence of AR (Confidence of AR () X ) X Y: Y: Ratio of Ratio of number of transactions that contain X number of transactions that contain X Y to the number that contain X Y to the number that contain X
© Prentice Hall 133
Association Rules Ex (cont’d)Association Rules Ex (cont’d)
© Prentice Hall 134
Association Rule ProblemAssociation Rule Problem Given a set of items I={IGiven a set of items I={I11,I,I22,…,I,…,Imm} and a } and a
database of transactions D={tdatabase of transactions D={t11,t,t22, …, t, …, tnn} } where twhere tii={I={Ii1i1,I,Ii2i2, …, I, …, Iikik} and I} and Iijij I, the I, the Association Rule ProblemAssociation Rule Problem is to is to identify all association rulesidentify all association rules X X Y Y with with a minimum support and confidence.a minimum support and confidence.
Link AnalysisLink Analysis NOTE:NOTE: Support of Support of X X Y Y is same as is same as
support of X support of X Y. Y.
© Prentice Hall 135
Association Rule TechniquesAssociation Rule Techniques
1.1. Find Large Itemsets.Find Large Itemsets.
2.2. Generate rules from frequent itemsets.Generate rules from frequent itemsets.
© Prentice Hall 136
Algorithm to Generate ARsAlgorithm to Generate ARs
© Prentice Hall 137
AprioriApriori
Large Itemset Property:Large Itemset Property:
Any subset of a large itemset is large.Any subset of a large itemset is large. Contrapositive:Contrapositive:
If an itemset is not large, If an itemset is not large,
none of its supersets are large.none of its supersets are large.
© Prentice Hall 138
Large Itemset PropertyLarge Itemset Property
© Prentice Hall 139
Apriori Ex (cont’d)Apriori Ex (cont’d)
s=30% = 50%
© Prentice Hall 140
Apriori AlgorithmApriori Algorithm
1.1. CC11 = Itemsets of size one in I; = Itemsets of size one in I;
2.2. Determine all large itemsets of size 1, LDetermine all large itemsets of size 1, L1;1;
3. i = 1;
4.4. RepeatRepeat
5.5. i = i + 1;i = i + 1;
6.6. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););
7.7. Count CCount Cii to determine L to determine L i;i;
8.8. until no more large itemsets found;until no more large itemsets found;
© Prentice Hall 141
Apriori-GenApriori-Gen
Generate candidates of size i+1 from Generate candidates of size i+1 from large itemsets of size i.large itemsets of size i.
Approach used: join large itemsets of Approach used: join large itemsets of size i if they agree on i-1 size i if they agree on i-1
May also prune candidates who have May also prune candidates who have subsets that are not large.subsets that are not large.
© Prentice Hall 142
Apriori-Gen ExampleApriori-Gen Example
© Prentice Hall 143
Apriori-Gen Example (cont’d)Apriori-Gen Example (cont’d)
© Prentice Hall 144
Apriori Adv/DisadvApriori Adv/Disadv
Advantages:Advantages:– Uses large itemset property.Uses large itemset property.– Easily parallelizedEasily parallelized– Easy to implement.Easy to implement.
Disadvantages:Disadvantages:– Assumes transaction database is memory Assumes transaction database is memory
resident.resident.– Requires up to m database scans.Requires up to m database scans.
© Prentice Hall 145
SamplingSampling Large databasesLarge databases Sample the database and apply Apriori to the Sample the database and apply Apriori to the
sample. sample. Potentially Large Itemsets (PL):Potentially Large Itemsets (PL): Large Large
itemsets from sampleitemsets from sample Negative Border (BD Negative Border (BD -- ): ):
– Generalization of Apriori-Gen applied to Generalization of Apriori-Gen applied to itemsets of varying sizes.itemsets of varying sizes.
– Minimal set of itemsets which are not in PL, Minimal set of itemsets which are not in PL, butbut whose subsets are all in PL. whose subsets are all in PL.
© Prentice Hall 146
Negative Border ExampleNegative Border Example
PL PL BD-(PL)
© Prentice Hall 147
Sampling AlgorithmSampling Algorithm
1.1. DDss = sample of Database D; = sample of Database D;
2.2. PL = Large itemsets in DPL = Large itemsets in Dss using smalls; using smalls;
3.3. C = PL C = PL BDBD--(PL);(PL);4.4. Count C in Database using s;Count C in Database using s;
5.5. ML = large itemsets in BDML = large itemsets in BD--(PL);(PL);6.6. If ML = If ML = then donethen done
7.7. else C = repeated application of BDelse C = repeated application of BD-;-;
8.8. Count C in Database;Count C in Database;
© Prentice Hall 148
Sampling ExampleSampling Example Find AR assuming s = 20%Find AR assuming s = 20% DDss = { t = { t11,t,t22}} Smalls = 10%Smalls = 10% PL = {{Bread}, {Jelly}, {PeanutButter}, PL = {{Bread}, {Jelly}, {PeanutButter},
{Bread,Jelly}, {Bread,PeanutButter}, {Jelly, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}}PeanutButter}, {Bread,Jelly,PeanutButter}}
BDBD--(PL)={{Beer},{Milk}}(PL)={{Beer},{Milk}} ML = {{Beer}, {Milk}} ML = {{Beer}, {Milk}} Repeated application of BDRepeated application of BD- - generates all generates all
remaining itemsetsremaining itemsets
© Prentice Hall 149
Sampling Adv/DisadvSampling Adv/Disadv
Advantages:Advantages:– Reduces number of database scans to one Reduces number of database scans to one
in the best case and two in worst.in the best case and two in worst.– Scales better.Scales better.
Disadvantages:Disadvantages:– Potentially large number of candidates in Potentially large number of candidates in
second passsecond pass
© Prentice Hall 150
PartitioningPartitioning
Divide database into partitions DDivide database into partitions D11,D,D22,,…,D…,Dpp
Apply Apriori to each partitionApply Apriori to each partition Any large itemset must be large in at Any large itemset must be large in at
least one partition.least one partition.
© Prentice Hall 151
Partitioning AlgorithmPartitioning Algorithm
1.1. Divide D into partitions DDivide D into partitions D11,D,D22,…,D,…,Dp;p;
2. For I = 1 to p do
3.3. LLii = Apriori(D = Apriori(Dii););
4.4. C = LC = L11 … … L Lpp;;
5.5. Count C on D to generate L;Count C on D to generate L;
© Prentice Hall 152
Partitioning ExamplePartitioning Example
D1
D2
S=10%
L1 ={{Bread}, {Jelly}, {Bread}, {Jelly}, {PeanutButter}, {PeanutButter}, {Bread,Jelly}, {Bread,Jelly}, {Bread,PeanutButter}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}}{Bread,Jelly,PeanutButter}}
L2 ={{Bread}, {Milk}, {Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, {Bread,PeanutButter}, {Milk, PeanutButter}, PeanutButter}, {Bread,Milk,PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer}, {Beer,Bread}, {Beer,Milk}}{Beer,Milk}}
© Prentice Hall 153
Partitioning Adv/DisadvPartitioning Adv/Disadv
Advantages:Advantages:– Adapts to available main memoryAdapts to available main memory– Easily parallelizedEasily parallelized– Maximum number of database scans is Maximum number of database scans is
two.two. Disadvantages:Disadvantages:
– May have many candidates during second May have many candidates during second scan.scan.
© Prentice Hall 154
Parallelizing AR AlgorithmsParallelizing AR Algorithms
Based on AprioriBased on Apriori Techniques differ:Techniques differ:
– What is counted at each siteWhat is counted at each site– How data (transactions) are distributedHow data (transactions) are distributed
Data ParallelismData Parallelism– Data partitionedData partitioned– Count Distribution AlgorithmCount Distribution Algorithm
Task ParallelismTask Parallelism– Data and candidates partitionedData and candidates partitioned– Data Distribution AlgorithmData Distribution Algorithm
© Prentice Hall 155
Count Distribution Algorithm(CDA)Count Distribution Algorithm(CDA)1.1. Place data partition at each site.Place data partition at each site.2.2. In Parallel at each site doIn Parallel at each site do3.3. CC11 = Itemsets of size one in I; = Itemsets of size one in I;4.4. Count CCount C1;1;
5.5. Broadcast counts to all sites;Broadcast counts to all sites;6.6. Determine global large itemsets of size 1, LDetermine global large itemsets of size 1, L11;;7. i = 1; 8.8. RepeatRepeat9.9. i = i + 1;i = i + 1;10.10. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););11.11. Count CCount Ci;i;
12.12. Broadcast counts to all sites;Broadcast counts to all sites;13.13. Determine global large itemsets of size i, LDetermine global large itemsets of size i, L ii;;14.14. until no more large itemsets found;until no more large itemsets found;
© Prentice Hall 156
CDA ExampleCDA Example
© Prentice Hall 157
Data Distribution Algorithm(DDA)Data Distribution Algorithm(DDA)1.1. Place data partition at each site.Place data partition at each site.2.2. In Parallel at each site doIn Parallel at each site do3.3. Determine local candidates of size 1 to count;Determine local candidates of size 1 to count;4.4. Broadcast local transactions to other sites;Broadcast local transactions to other sites;5.5. Count local candidates of size 1 on all data;Count local candidates of size 1 on all data;6.6. Determine large itemsets of size 1 for local Determine large itemsets of size 1 for local
candidates; candidates; 7.7. Broadcast large itemsets to all sites;Broadcast large itemsets to all sites;8.8. Determine LDetermine L11;;9. i = 1; 10.10. RepeatRepeat11.11. i = i + 1;i = i + 1;12.12. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););13.13. Determine local candidates of size i to count;Determine local candidates of size i to count;14.14. Count, broadcast, and find LCount, broadcast, and find Lii;;15.15. until no more large itemsets found;until no more large itemsets found;
© Prentice Hall 158
DDA ExampleDDA Example
© Prentice Hall 159
Comparing AR TechniquesComparing AR Techniques TargetTarget TypeType Data TypeData Type Data SourceData Source TechniqueTechnique Itemset Strategy and Data StructureItemset Strategy and Data Structure Transaction Strategy and Data StructureTransaction Strategy and Data Structure OptimizationOptimization ArchitectureArchitecture Parallelism StrategyParallelism Strategy
© Prentice Hall 160
Comparison of AR TechniquesComparison of AR Techniques
© Prentice Hall 161
Hash TreeHash Tree
© Prentice Hall 162
Incremental Association RulesIncremental Association Rules Generate ARs in a dynamic database.Generate ARs in a dynamic database. Problem: algorithms assume static Problem: algorithms assume static
databasedatabase Objective: Objective:
– Know large itemsets for DKnow large itemsets for D– Find large itemsets for D Find large itemsets for D { { D} D}
Must be large in either D or Must be large in either D or D D Save LSave Li i and counts and counts
© Prentice Hall 163
Note on ARsNote on ARs Many applications outside market Many applications outside market
basket data analysisbasket data analysis– Prediction (telecom switch failure)Prediction (telecom switch failure)– Web usage miningWeb usage mining
Many different types of association rulesMany different types of association rules– TemporalTemporal– SpatialSpatial– CausalCausal
© Prentice Hall 164
Advanced AR TechniquesAdvanced AR Techniques
Generalized Association RulesGeneralized Association Rules Multiple-Level Association RulesMultiple-Level Association Rules Quantitative Association RulesQuantitative Association Rules Using multiple minimum supportsUsing multiple minimum supports Correlation RulesCorrelation Rules
© Prentice Hall 165
Measuring Quality of RulesMeasuring Quality of Rules
SupportSupport ConfidenceConfidence InterestInterest ConvictionConviction Chi Squared TestChi Squared Test