Download ppt - CSE 8392 Spring 199951 CSE 8392 SPRING 1999 DATA MINING: CORE TOPICS Classification Professor Margaret H. Dunham Department of Computer Science and Engineering

CSE 8392 Spring 1999 1

CSE 8392 SPRING 1999DATA MINING: CORE TOPICS

Classification

Professor Margaret H. Dunham

Department of Computer Science and Engineering

Southern Methodist University

Dallas, Texas 75275

(214) 768-3087

fax: (214) 768-3085

email: [email protected]

www: http://www.seas.smu.edu/~mhd

January 1999


• “Classify a set of data based on their values in certain attributes” (R[2],p868)

• Each grouping is a class • Equivalence Class

• Given a set K = {k1, k2, k3 …ky}

kx -> (A, B, C, D, E…) where this mapping partitions K

• Similar to estimation

Classification


Classification Examples

• Financial market trends (bull, bear)• Images (raster, vector)• Loan approval (yes, no)• Medical diagnosis• Detecting faults in industry applications


Basic Classification Techniques (Kennedy,Ch3)

• Boundaries for decision regions– Ex: Loan threshold

• Probability Techniques– p(x|C)– Figure 3-4, p 3-5, Kennedy– May also use prior knowledge of class membership

probabilities P(C)– Select class that maximizes P(C)p(x|C)– Probability x is in class C is proportional to the

probability that any input is in C and class C contains x


Supervised Induction Techniques

• Output the probability of class membership based on input values using some estimation technique

(Decision Trees, Neural Nets)– Use a sample database as a training set

– Analyze training data

– Develop model using attributes of data

– Use these class descriptions on rest of data• Note: May be many different class descriptions


Posterior Probabilities (Kennedy)

• P(C|x) - Probability input x belongs to class C• Suppose m classes, look at: P(C1|x), … , P(Cm|x)• Classification by assigning x to the class with the

highest posterior probability• Look at training data and assign posterior

probabilities to example patterns• May not work well with complex tuples in large

databases• Fig 3-7, p3-9


Linear Regression

• Linear mapping of input attributes to desired output• Error may exist:

Here xi are input attributes• Least-Squares Minimization: Sum of the squared error

terms is minimized over database (training set)• Find weights: EQ3,EQ4 Kennedy, p 10-14• May be used as baseline comparison approach• May not work well with complex databases :

Not all data values known & May not be numeric values

errorxwxwwy nn ...110


Similarity Measures

• Describe each input tuple as vector

D1 = <d11, … , d1n>• Define Sim(D1,D2) where:

– Normalize (0-no similarity, 1- identical)– Usually assumes all values are numeric only

• Represent each class with vector Ci

– May be determined as centroid of vectors from training set

• Assign each tuple Dj to class i where Sim(Dj,Ci) is minimized


K Nearest Neighbors (Kennedy)

• Store all input-output pairs in training set

• Define distance function

• When a tuple needs to be classified, determine distance between it and all items in the training set

• Fig 10-12, p10-38

• Base answer on K nearest items in training set. Algorithm p 10-39

• Memory intensive & Slow

CSE 8392 Spring 1999 10

Decision Trees (Kennedy, Section 10.4.10)

• Similar to Twenty Questions: Fig 8-5, p144, Barquin• Internal Nodes: Decision Points based on one attribute• Leaves: Identify classes • Classification Process: Input tuple and move through

tree based on attribute values• Difficult Part - Constructing tree (so that it is efficient)

(Try playing twenty questions with a young child!)

• Training Set used to build tree

CSE 8392 Spring 1999 11

Decision Tree Issues

• Attributes • Splits (Categorical, Discrete, Continuous)• Ordering of attributes in tree• Determining when to stop• Must perform perfectly on training set • Pruning of tree (remove branches)

CSE 8392 Spring 1999 12

Decision Tree Advantages/Disadvantages• Advantages

– Easy to understand– Efficient (time)

• Disadvantages– May be difficult to use with continuous data– Limited to problems solved by dividing into subrectangles– Not flexible (no automatic revisions if incorrect)– No way to handle missing data– May have overfitting– Pruning combats overfitting (but it may induce other errors)

CSE 8392 Spring 1999 13

ID-3 (R[2])

• Decision Tree learning system based on information theory

• Attempts to minimize expected number of tests on a tuple

• Formalizes the approach adults have to twenty questions!

• Picks attributes with highest information gain first

• Entropy ))ln(( ii ppi

CSE 8392 Spring 1999 14

CART (Kennedy,p10-56)

• Builds binary decision tree• Exhaustive search to determine best tree where

best defined by goodness of split :

• Optimal splitting:

classes

jRLRL tjPtjPPPts

1

)/()/(2)/(

))/(()/'( tsMaxts ii

CSE 8392 Spring 1999 15

Neural Networks

• Determine “predictions using ‘neurons’ (computations) and their interconnections (weighted inputs).” (p142,Barquin)

• Example: Fig 8-4, p143, BarquinInput values of attributes at left

Weight associated with links between nodes

Classification produced at output on right

• Neural Net and Decision Tree Peter Cabena, Pablo Hadjinian, Rolf Stadler, Jaap Verhees, and

Alessandro Zanasi, Discovering Data Mining From Concept to Implementation, Prentice-Hall, 1998, p71,p74.

CSE 8392 Spring 1999 16

Neural Nets• Number of processing layers between input and output• Each processing unit (node) connected to all in the next layer• Construct neural net:

Determine network structure (modeler)

Weights “learned” by applying tree to training set Backpropagation used to adjust weights

Desired output provided with training data

Actual network output subtracted from desired output and error produced

Connection weights changed based on a minimization method called gradient descent

CSE 8392 Spring 1999 17

Historical Note:

• Neural network weights are adjusted based on whether or not the prediction is good.

• Earlier IR systems uses “feedback” to adjust weights in document vectors based on precision/recall values

CSE 8392 Spring 1999 18

• Advantages– Low classification error rates

– Robust in noisy environments

– Rules may be more concise if strong relationships in attributes

– Provides high degree of accuracy

• Disadvantages– Multiple passes over database - Very expensive

– Classification process not apparent - “Black Box”• Embedded within graph structure and link weights

• May be difficult to generate rules

• Difficult to infuse domain knowledge in neural net

– May have overfitting

– May fail to converge

Neural Nets Advantages/Disadvantages

CSE 8392 Spring 1999 19

• Cluster• Enumerate ruleset for network outputs• Enumerate ruleset for network inputs• Merge input and output rulesets• Algorithm p958• Neural Net Mining Example

– Lu, section 2.2

Rule Extraction (RX) Algorithm (R[4])

CSE 8392 Spring 1999 20

• Network training time• Possible methods for incremental training of

network• Use of domain experts• Reduction of inputs to network

Neural Net Data Mining Research Issues

CSE 8392 Spring 1999 21

• In Bayesian statistics, we measure the likelihood of observed data y given each value of x, e.g.

f ( y | x )

• Data mining goal: find the most probable set of class descriptions

• NASA AutoClass(Fayyad,Ch6)– Discover automatic classifications in data

– Like clustering

– No prior definition of classes

– Classes may be unknown to experts

Bayesian Classification (Fayyad,Ch6)

CSE 8392 Spring 1999 22

AutoClass

• Records represented as vectors of values• pdf: Gives “probability of observing an instance

possessing any particular attribute value vector.” (p156,Cheeseman)

• Model: finite mixture distributionInterclass Mixture Probability P = (Xi Cj | Vc, Tc, S, I)

• Two levels of search: maximum posterior parameter (MAP), most probable density function

• Each model a product of independent probability distributions of attribute subsets

CSE 8392 Spring 1999 23

AutoClass Case Studies

• IRAS: Spectra Classification

– “Minimum” information vs. full access– “Good” outputs are domain specific

• DNA Codes– Results may extend beyond limits of database

• LandSat Pixels– Parallelization– Undocumented preprocessing

CSE 8392 Spring 1999 24

• Interaction between domain experts and machine– Essential for good results

– Experts and machine each have unique strengths

– Iterative process

• Ockham Factor– If a particular parameter does not noticeably improve

model, reject models with this parameter

AutoClass Issues

CSE 8392 Spring 1999 25

Classification Summary• Watch out for preprocessing: key words:

– Calibration

– Corrected

– Averaged

– Normalized

– Adjusted

– Compressed

– Subsets

– Partitioned

– Representative

• Uncover biases in data collection• Engage in full partnership with experts and obtain domain

specific results!