Review

Review

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

PatReco: Introduction



Fall 2004-2005

PatReco:Applications

Speech/audio/music/sounds Speech recognition, Speaker verification/id,

Image/video OCR, AVASR, Face id, Fingerpring id, Video segmentation

Text/Language Machine translatoin, document class., lnag mod., text underst.

Medical/Biology Disease diagnosis, DNA sequencing, Gene disease models

Other Data User modeling (books/music), Ling analysis (web), Games

Basic Concepts

Why statistical modeling? Variability: differences between two examples of the

same class in training Mismatch: differences between two examples of the

same class (one in training one in testing) Learning modes:

Supervised learning: class labels known Unsupervised learning: class labels unknown Re-inforced learning: only positive/negative feedback

Basic Concepts

Feature selection Separate classes, Low correlation

Model selection Model type, Model order

Prior knowledge E.g., a priori class probability

Missing features/observations Modeling of time series

Correlation in time (model?), segmentation

PatReco: Algorithms

Parametric vs Non-Parametric Supervised vs Unsupervised Basic Algorithms:

Bayesian Non-parametric Discriminant Functions Non-Metric Methods

PatReco: Algorithms

Bayesian methods Formulation (describe class characteristics) Bayes classifier Maximum likelihood estimation Bayesian learning Estimation-Maximization Markov models, hidden Markov models Bayesian Nets

Non-parametric Parzen windows Nearest Neighbour

PatReco: Algorithms

Discriminant Functions Formulation (describe boundary) Learning: Gradient descent Perceptron MSE=minimum squared error LMS=least mean squares Neural Net generalizations Support vector machines

Non-Metric Methods Classification and Regression Trees String Matching

PatReco: Algorithms

Unsupervised Learning: Mixture of Gaussians K-means

Other not-covered Multi-layered Neural Nets Stochastic Learning (Simulated Annealing) Genetic Algorithms Fuzzy Algorithms Etc…

PatReco: Problem Solving

1.Data Collection2.Data Analysis3.Feature Selection4.Model Selection5.Model Training6.Classification7.Classifier Evaluation


1. Data Collection2. Data Analysis3. Feature Selection4. Model Selection5. Model Training6. Classification7. Classifier Evaluation





Evaluation

Training Data Set 1234 examples of class 1 and class 2

Testing/Evaluation Data Set 134 examples of class 1 and class 2

Misclassification Error Rate Training: 11.61% (150 errors) Testing: 13.43% (18 errors)

Correct for chance (Training 22%, Testing 26%) Why?

PatReco: Discriminant Functions for Gaussians



Fall 2004-2005



Discriminant Functions

Define class boundaries (instead of class characteristics)

Dualism: Parametric class description

Bayes classifier Decision boundary

Parametric Discriminant Functions

Normal Density

1D Multi-D

Full covariance Diagonal covariance Diagonal covariance + univariate

Mixture of Gaussians Usually diagonal covariance

Gaussian Discriminant Functions

Same variance ALL classes Hyper-planes

Different variance among classes Hyper-quadratics (hyper-parabolas, hyper-

ellipses etc.)

Hyper-Planes

When the covariance matrix is common across Gaussian classes The decision boundary is a hyper-plane that is

vertical to the line connecting the means of the Gaussian distributions

If the a-priori probabilities of classes are equal the hyper-planes cuts the line connecting the Gaussian means in the middle Euclidean classifier

Gaussian Discriminant Functions

Same variance ALL classes Hyper-planes

Different variance among classes Hyper-quadratics (hyper-parabolas, hyper-

ellipses etc.)

Hyper-Quadratics

When the Gaussian class variances are different the boundary can be hyper-plane, multiple hyper-planes, hyper-sphere, hyper-

parabola, hyper-elipsoid etc. The boundary in general in NOT vertical to the Gaussian

mean connecting line If the a-priori probabilities of classes are equal the

resulting classifier is a Mahalanobois classifier

Conclusions

Parametric statistical models describe class characteristics x

by modeling the observation probabilities p(x|class)

Discriminant functions describe class boundaries

parametrically

Parametric statistical models have an equivalent parametric

discriminant function

For Gaussian p(x|class) distributions the decision boundaries

are hyper-planes or hyper-quadratics

PatReco: Detection



Fall 2004-2005

Detection

Goal: Detect an Event Hit (Success) False Alarm Miss (Failure) False Reject

PatReco: Estimation/Training



Fall 2004-2005

Estimation/Training

Goal: Given observed data (re-)estimate the parameters of the model e.g., for a Gaussian model estimate the mean and

variance for each class

Supervised-Unsupervised

Supervised training: All data has been (manually) labeled, i.e., assigned to classes

Unsupervised training: Data is not assigned a class label

Observable data

Fully observed data: all information necessary for training is available (features, class labels etc.)

Partially observed data: some of the features or some of the class labels are missing

Supervised Training(fully observable data)

Maximum likelihood estimation (ML)

Maximum a posteriori estimation (MAP)

Bayesian estimation (BE)

Training process

Collected data used for training consists of the following examples

D = {x1, x2, … xN} Step 1: Label each example with the corresponding

class label ω1, ω2, ... ωΚ

Step 2: For each of the classes separately estimate the model parameters using ML, MAP, BE and the corresponding training examples D1, D2..DK

Training Process: Step 1

D = {x1, x2, x3, x4, x5, … xN}

Labelmanually ω1, ω2, ... ωΚ

D1 = {x11, x12, x13, … x1N1}D2 = {x21, x22, x23, … x2N2} …………

DK = {xK1, xK2, xK3, … xKNk}

Training Process: Step 2

Maximum Likelihoodθ1 = argmaxΘ P(D1|θ1)

Maximum-a-posterioriθ1 = argmaxΘ P(D1|θ1) P(θ1)

Bayesian estimation

P (x|ω1) = P(x|θ1)P(θ1|D1) dθ1

ML Estimation Assumptions

1. P(x|ωi) follows a parametric distribution with parameters θ

2. Dj tells us nothing about P(x|ωi) (functional independence)

3. Observations x1, x2, x3, … xN are iid (independent identically distributed

4a (ML only!) θ is a quantity whose value is fixed but unknown

ML estimation

θ = argmaxΘ P(θ|D)

= argmaxΘ P(D|θ) P(θ) =4 argmaxΘ P(D|θ) = argmaxΘ P(x1, x2, … xN |θ) =3 argmaxΘ Πj P(xj|θ) =>

Πj P(xj|θ) / θ = 0 => θ = …

ML estimate for Gaussian pdf

If P(x|ω) = Ν(μ,σ2) and θ=(μ,σ2) then 1-D

μ = (1/Ν) Σj=1..N xj

σ2 = (1/Ν) Σj=1..N (xj – μ)2

Multi-D: θ=(μ, Σ) μ = (1/Ν) Σj=1..N xj

Σ = (1/Ν) Σj=1..N (xj – μ)Τ (xj – μ)

Bayesian Estimat. Assumptions

1. P(x|ωi) follows a parametric distribution with parameters θ

2. Dj tells us nothing about P(x|ωi) (functional independence)

3. Observations x1, x2, x3, … xN are iid (independent identically distributed)

4b (MAP, BE) θ is a random variable whose prior distribution p(θ) is known

Bayesian Estimation

P (x|D) = P(x,θ|D) dθ

= P(x|θ,D)P(θ|D) dθ

= P(x|θ)P(θ|D) dθ

STEP 1: P(θ) P(θ|D)

P(x|D) = P(D|θ)P(θ)/P(D)

STEP 2: P(x|θ) P (x|D)

Bayesian Estimate for Gaussian pdf and priors

If P(x|θ) = Ν(μ, σ2) and p(θ) = Ν(μ0, σ02) then

STEP 1: P(θ|D)=Ν(μn, σn2)

STEP 2: P(x|D)=N(μn, σ2+σn2 )

μn = σ02 /(n σ0

2 + σ2) (Σj xj) + σ2 /(n σ02 + σ2) μ0

σn2 = σ2 σ02 /(n σ0

2 + σ2)

For large n (number of training samples) maximum

likelihood and Bayesian estimation equivalent!!!

Conclusions

Maximum likelihood estimation is simple and gives good estimates when the number of training samples is large

Bayesian adaptation gives good estimates even for small amounts of training data provided that a good prior is selected

Bayesian adaptation is hard and often does not have a closed form solution (in which case try: iterative recursive Bayesian estimation)

PatReco: Model and Feature Selection



Fall 2004-2005

Breakdown of Classification Error

Bayes error Model selection error Model estimation error Data mismatch error (training-testing)

True statements about Bayes error (valid within statistical significance)

The Bayes error is ALWAYS smaller than the total

(empirical) classification error

If the model, estimation and mismatch errors are

zero than the total classification error equals the

Bayes error

The ONLY way to reduce the Bayes error is to add

new features in the classifier design

More true statements

Adding new features can only reduce the Bayes error (this is not true about the total classification error!!!)

Adding new features will NOT reduce the Bayes error if the new features are Very bad at discriminating between classes (feature pdfs

overlapping) Highly correlated with existing features

Gaussian classification Bayes Error

For two classes ω1 and ω2 following Gaussian distributions with means μ1, μ2 and the same variance σ2 then the Bayes error is:

P(error) = 1/(2π)0.5 r/2 exp{-u2/2} du

where r = |μ1-μ2|/σ

Feature Selection

If we had infinite amounts of data then The more features the better!

However in practice finite data: More features more parameters to train!!!

Good features: Uncorrelated Able to discriminate among classes

Model selection

Number of model parameters is number of parameters that need to be estimated

Overfiting: too many parameters, too little data!!! Gaussian models-Model selection:

Single Gaussians Mixture of Gaussians Fixed Variance Tied Variance Diagonal Variance

Conclusion

Introducing more features and/or more complex

models can only reduce the classification error (if

infinite amounts of training data are available)

In practice: number of features and number of

model parameters is a function of amount of

training data available (avoid overfiting!)

Good features are uncorrelated and discriminative

PatReco: Expectation Maximization



Fall 2004-2005

When do we use EM?

Partially observable data Missing some features from some samples, e.g.,

D={(1,2),(2,3),(?,4)} Missing class labels, e.g., hidden states of

HMMs Missing class sub-labels, e.g., mixture label for

mixture of Gaussian models

The EM algorithm

The Expectation Maximization algorithm (EM) consists of alternating expectation and maximization steps

During expectation steps the “best estimates of the missing information” are computed

During maximization step maximum likelihood training on all data is performed

EM

Initialization: (0)

for i =1..iterno // usually iterno=2 or 3

E step: Q(i) = EDbad{log(p(D;)|x;Dbad,(i-1)}

M step: (i) =argmax{Q(i)}

end

Pseudo-EM

Initialization: (0)

for i =1..iterno // usually iterno=2 or 3

Expectation step: Dbad=E{Dbad |(i-1)}

Maximization step: (i) =argmax{p(D| (i-1)}

end

Convergence

EM is guaranteed to converge to a local optimum (NOT the global optimum!)

Pseudo-EM has no convergence guarantees but is used often in practice

Conclusions

EM is an iterative algorithm used when there are missing or partially observable training data

EM is a generalization of ML training EM is guaranteed to converge to a local

optimum (NOT the global optimum!)

PatReco: Bayesian Networks



Fall 2004-2005

Definitions

Bayesian networks consist of nodes and (usually directional) arcs

Nodes or states represent a classification class or in general events and are described with a pdf

Arcs represent relations between arcs, e.g., cause and effect, time sequence

Two nodes that are connected via another node are conditionally independent (given that node)

When to use Bayesian nets

Bayesian networks (or networks of inference) are statistical models that are used for classification (or in general pattern recognition) problems where there are dependencies among classes, e.g., time dependencies, cause and effect dependencies

Conditional Independence

Full independence between A and B

P(A|B) = P(A) or

P(A,B) = P(A) P(B) Conditional independence of A, B given C

P(A|BC) = P(A|C) or

P(A,B|C) = P(A|C)P(B|C)

Conditional Independence

A, C independent given B

P(C|BA) = P(C|B)

B,C independent given A

P(B,C|A) = P(B|A)P(C|A)

A,C dependent given B

P(A,C|B) cannot be reduced!

AB

C

ABC

A B C

Three problems

1. Probability computation (use independence)

2. Training/Parameter Estimation Maximum likelihood (ML) if all is observable

Expectation maximization (EM) if missing data

3. Inference (Testing) Diagnosis P(cause|effect) bottom-up

Prediction P(effect|cause) top-down

Probability Computation

For a Bayesian Network that consists of N nodes:

1. Compute P(n1, n2 ..nN) using chain rule starting from the “last/bottom” node and working your way up

P(n1, n2 ..nN) = P(nN| n1, n2 .. nN-1)

P(nN-1 |n1, n2 .. nN-2 ) … P(n2 |n1) P(n1)

2. Identify conditional independence conditions from Bayesian network topology

3. Simplify the conditionals probabilities using independence conditions


Topology:

P(C,S,R,W) = P(W|C,S,R) P(S|CR) P(R|C)P(C)

Independent: (W,C)|S,R (S,R)|C

Dependent: (S,R)|W

P(C,S,R,W) = P(W|S,R) P(S|C) P(R|C) P(C)

CS

WR


There are general algorithms for identifying cliques in the Bayesian net

Cliques are islands of conditional dependence, i.e., terms in the probability computation that cannot be further reduced

SC

WSRRC

Training/Parameter Estimation

Instead of estimating the joint pdf of the whole network the joint pdf of each of the cliques is estimated

For example if the network joint pdf is

P(C,S,R,W) = P(W|S,R) P(S|C) P(R|C) P(C)

instead of computing P(C,S,R,W) we compute each of P(W|S,R), P(S|C), P(R|C), P(C) for all possible values of W, S, R, C (much simpler)


For fully observable data and discrete probabilities compute maximum likelihood estimates of parameters, e.g., for discrete probs

counts(W=1,S=1,R=0)

P(W=1|S=1,R=0)ML = _______________________

counts(W=*,S=1,R=0)


Example: the following observations pairs are given for (W,C,S,R): (1,0,1,0), (0,0,1,0),(1,1,1,0),(0,1,1,0),(1,0,1,0),

(0,1,0,0),(1,0,0,1),(0,1,1,1),(1,1,1,0)

Using Maximum Likelihood Estimation:

P(W=1|S=1,R=0)ML = #(1, *, 1, 0)/#(*,*,1,0) = 2/5 = 0.4


When data is non observable or missing the EM algorithm is employed

There are efficient implementations of the EM algorithm for Bayesian nets that operate on the clique network

When the topology of the Bayesian network is not known structural EM can be used

Inference

There are two types of inference (testing) Diagnosis P(cause|effect) bottom-up Prediction P(effect|cause) top-down Once

Once the parameters of the network are estimated the joint network pdf can be estimated for ALL possible network values

Inference is simply probability computation using the network pdf

Inference

For example

P(W=1|C=1) = P(W=1,C=1) / P(C=1)

where

P(W=1,C=1) = RS P(W=1,C=1,R=*,S=*)

P(C=1) = RWS P(W=*,C=1,R=*,S=*)

Inference

Efficient algorithms exist for performing inference in large networks which operate on the clique network

Inference is often shown as a probability maximization problem, e.g., what is the most probable cause or effect?

argmaxW P(W|C=1)

Continuous Case

In our examples the network nodes represented discrete events (states or classes)

Network nodes often hold continuous variables (observations), e.g., length, energy

For the continuous case parametric pdf are introduced and their parameters are estimated using ML (observed) or EM (hidden)

Some Applications

Medical diagnosis Computer problem diagnosis (MS) Markov chains Hidden Markov Models (HMMs)

Conclusions

Bayesian networks are used to represent dependencies between classes

Network topology defines conditional independence conditions that simplify the network pdf modeling and computation

Three problems: probability computation, estimation/training, inference/testing

PatReco: Hidden Markov Models



Fall 2004-2005

Markov Models: Definition

Markov chains are Bayesian networks that model sequences of events (states)

Sequential events are dependent

Two non-sequential events are conditionally independent given the intermediate events (MM-1)

Markov chains

q1 q1 q4q3q2

q0 q1 q4q3q2

q0 q1 q4q3q2

q0 q1 q4q3q2

MM-0

MM-1

MM-2

MM-3

…

…

…

…

Markov Chains

MM-0: P(q1,q2.. qN) = n=1..N P(qn)

MM-1: P(q1,q2.. qN) = n=1..N P(qn|qn-1)

MM-2: P(q1,q2.. qN) = n=1..N P(qn|qn-1,qn-2)

MM-3: P(q1,q2.. qN) = n=1..N P(qn|qn-1,qn-2,qn-3)

Hidden Markov Models

Hidden Markov chains model sequences of events and corresponding sequences of observations

Events form an Markov chain (MM-1) Observations are conditionally independent given

the sequence of events Each observation is directly connected with a single

event (and conditionally independent with the rest of the events in the network)

Hidden Markov Models

q0 q1 q4q3q2 …

o0 o1 o4o3o2 …

P(o0,o1..oN ,q0,q1..qN) = n=0..N P(qn|qn-1)P(on|qn)

HMM-1

Parameter Estimation

The parameters that have to be estimated are the a-priori probabilities P(q0) transition probabilities P(qn|qn-1) observation probabilities P(on|qn)

For example if there are 3 types of events and continuous 1-D observations that follow a Gaussian distribution there are 18 parameters to estimate: 3 a-priori probabilities 3x3 transition probabilities matrix 3 means and 3 variances (observation probabilities)


If both the sequence of events and sequences of observations are fully observable then ML is used

Usually the sequence of events q0,q1..qN are non-observable in which case EM is used

The EM algorithm for HMMs is the Baum-Welsh or forward-backward algorithm

Inference/Decoding

The main inference problem for HMMs is known as the decoding problem: given a sequence of observations find the best sequence of states:

q = argmaxq P(q|O) = argmaxq P(q,O)

An efficient decoding algorithm is the Viterbi algorithm

Viterbi algorithm

maxq P(q,O) =

maxq P(o0,o1..oN ,q0,q1..qN) =

maxq n=0..N P(qn|qn-1)P(on|qn) =

maxqN {P(oN|qN) maxqN-1{P(qN|qN-1)P(oN-1|qN-1) …

maxq2 {P(q3|q2)P(o2|q2) maxq1 {P(q2|q1)P(o1|q1)

maxq0 {P(q1|q0) P(o0|q0) P(q0)}}}…}}

Viterbi algorithm

1

2

3

4

K

.

.

time

At each node keep only the best (most probable) path from all the paths passing through that node

Deep Thoughts

HMM-0 (HMM with MM-0 event chain) is the Bayes classifier!!!

MMs and HMMs are poor models but simple and efficient computationally How do you fix this? (dependent observations?)

Some Applications

Speech Recognition

Optical Character Recognition

Part-of-Speech Tagging

…

Conclusions

HMMs and MMs are useful modeling tools for dependent sequence of events (states or classes)

Efficient algorithms exist for training HMM parameters (Baum-Welsh) and decoding the most probable sequence of states given an observation sequence (Viterbi)

HMMs have many applications

Non Parametric Classifiers



Fall 2004-2005

Histograms-Parzen Windows

Main idea:

Instead of selecting a parametric distribution (e.g., Gaussian) to describe the properties of the features of a class, compute directly the empirical distribution class feature histogram

Feature Histogram Example

X

# of samplesin each bin

Normalize histogram curve to get feature PDF

Parzen Windows: Issues

When compared to parametric methods empirical distributions are: Better because no specific form of the PDF is assumed Worse because over-fitting can easily occur (too small

histogram bin)

Parzen proposed rules for adapting bin size based on number of samples in each bin to avoid over-fitting

Nearest Neighbor Rule

Main idea (1-NNR): No explicit model (i.e., no training)

For each test sample x the “nearest” sample x’ in

the training set is found, i.e., argminx’ d(x, x’)

and x is classified to the class where x’

belongs

Generalizations

k-NNR: Instead of finding the nearest neighbors we

find k nearest neighbors from the training set; the

sample x is classified to the class where most of the

k neighbors belong

k-l-NNR: Like k-NNR but at least l of the k nearest

neighbor must belong to the same class for a

classification decision to be taken (else no decision)

Example

Training set D1 = {0,-1,-2} and D2 = {1,1,1}

-2 -1 0 1 2 3

1-NNR decision boundary

3-NNR decision boundary

3-3-NNR no decision region

Computational Efficiency

To speed up NNR classification the training set size can be reduced using the condensing algorithm:

The training set is classified using NNR rule misclassified samples are added to the new

(condensed) training set one by one until all training samples are correctly classified

Conclusions

Non parametric classification algorithms are easy to implement are computationally efficient (in training) don’t make any assumptions are prone to over-fitting are hard to adapt (no detailed model)




Fall 2004-2005


Main Idea:

Describe parametrically the decision boundary

(instead of the properties of the class), e.g.,

the two classes are separated by a straight line

a x1 + b x2 + c = 0, with parameters (a,b,c)

(instead of the feature PDFs are 2-D Gaussians)

Example: Two classes, two features

a x1 + b x2 + c = 0

x1

x2

w1

w2

x1

x2

w1

w2

11

22

12

21

N(1,1)

N(2,2)

Model Class Boundary Model Class Characteristics

Duality

DualismParametric class description

Bayes classifier

Decision boundary

Parametric Discriminant Functions

For example modeling class features by Gaussians with same (across-class) variance results in hyper-plane discriminant functions


Discriminant functions gi(x) are functions of the

features x of a class i

A sample x is classified to class c for which gi(x) is

maximized, i.e., c = argmaxi{gi(x)}

The function gi(x) = gj(x) defines class boundaries

for each pair of (different) classes i and j

Linear Discriminant Functions

Two class problem: A single discriminant function is defined as:

g(x) = g1(x) – g2(x)

If g(x) is a linear function

g(x) = wT x + w0

then the boundary is a hyper-plane (point, line, plane for 1-D, 2-D, 3-D features respectively)

Linear Discriminant Functions

a x1 + b x2 + c = 0

x1

x2

w = (a,b)

-c/b

-c/a

Non Linear Discriminant Functions

Quadratic discriminant functions

g(x) = w0 + i wi xi + ij wij xi xj

for examples for a two class 2-D problem

g(x) = a + b x1 + c x2 + d x12

Any non-linear discriminant function can become linear by increasing the dimensionality, e.g., y1 = x1, y2 = x2, y3 = x1

2

(2D nonlinear 3D linear)

g(y) = a + b y1 + c y2 + d y3


The parameters w are estimated by functional minimization

The function to be minimized J models the average distance of training samples from the decision boundary for either Misclassifier training samples All training samples

The function J is minimized using gradient descent

Gradient Descent

Iterative procedure towards a local minimum

a(k+1) = a(k) – n(k) J(a(k))

where k is the iteration number, n(k) is the learning rate and J(a(k)) is the gradient of the function to be minimized evaluated at a(k)

Newton descent is the gradient descent with learning rate equal to the inverse Hessian matrix

Distance Functions

Perceptron Criterion Function

Jp (a) = misclassified ( - aT y) Relaxation With Margin b

Jr (a) = misclassified (aT y - b)2 / ||y|| 2

Least Mean square (LMS)

Js (a) = all samples (aT yi - bi)2

Ho-Kashyap rule

Js (a,b) = all samples (aT yi - bi)2


Working on misclassified samples only (Perceptron, Relaxation with Margin) provides better results but converges only for separable training sets

High Dimensionality

Using non-linear discriminant functions and

linearizing them in a high dimensional space can make ANY training set separable

large # of parameters (curse of dimensionality)

Support vector machines: A smart way to select

appropriate terms (dimensions) is needed

Non-Metric Methods: Decision Trees



Fall 2004-2005

Decision Trees

Motivation: There are features (discrete) that don’t have an obvious notion of similarity or ordering (nominal data), e.g., book type, shape, sound type

Taxonomies (i.e., trees with is-a relationship) are the oldest form of classification

Decision Trees: Definition

Decision Trees are classifiers that classify samples based on a set of questions that are asked hierarchically (tree of questions)

Example questions is color red? is x < 0.5?

Terminology: root, leaf, node, arc, branch, parent, children, branching factor, depth

Fruit classifier

Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sour

Fruit classification

Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sour

CHERRY


Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sour

CHERRY


Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sour

CHERRY


Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sour

CHERRY

Fruit classifier

Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sourwatermelon

grape grapefruit cherry grape

Binary Trees

Binary trees: each parent node has exactly two children nodes (branching factor = 2)

Any tree can be represented as a binary tree by changing set of questions and by increasing the tree depth

e.g., Color?

green

yellow

red

Color = green?

Color = yellow?

Y N

Y N

Decision Trees: Problems

1. List of questions (features) All possible questions are considered

2. Which questions to split first (best split) The questions that split the data best (reduce

impurity at each node) are asked first

3. Stopping criteria (pruning criteria) Stop when further splits don’t reduce imprurity

Best Split example

Two class problem with 100 examples from w1 and w2

Three binary questions Q1, Q2 and Q3 that split the data as follows:

1. Node 1: (50,50) Node 2: (50,50)

2. Node 1: (100,0) Node 2: (0,100)

3. Node 1: (80,0) Node 2: (20,100)

Impurity Measures

Impurity measures the degree of homogeneity of a node; a node is pure if it consists of training examples from a single class

Impurity Measures

Entropy Impurity: i(N) = - i P(wi) log2(P(wi))

Variance (two-class): i(N) = P(w1) P(w2)

Gini Impurity: i(N) = 1- i P2(wi)

Misclassification: i(N) = 1- maxi P(wi)

Total Impurity

Total Impurity at Depth 0:

i(depth =0) = i(N)

Total Impurity at Depth 1:

i(depth =1) = p(NL ) i(NL) + p(NR ) i(NR)

Nyes no

NL NR

Depth 0

Depth 1

Impurity Example

Node 1: (80,0) Node 2: (20,100)

I(node 1) = 0

I(node 2) = - 20/120 log2(20/120) - 100/120 log2(100/120) = 0.65

P(node 1) = 80/200 = 0.4

P(node 2) = 120/200 = 0.6

I(total) = P(node 1) I(node 1) + P(node 2) I(node 2) =

= 0 + 0.6*0.65 = 0.39

Continuous Example

For continuous features: questions are of the type x<a where x is the feature and a is a constant

Decision Boundaries (two class, 2-D example):

R1

R2

R2

R1

R2

x1

x2

Summary

Decision trees are useful categorical classification tools especially for nominal (non-metric) data

CART creates trees that minimize impurity on the training set at each node

Decision region shape

CART is a useful tool for feature selection

Unsupervised Training and Clustering



Fall 2004-2005

Unsupervised Training

Definition: The training set samples are unlabelled (unclassified)

Motivation: Labeling is hard/time consuming Fully automatic adaptation of models (in the

field)

Maximum Likelihood Training

Given: N training examples drawn from c classes, i.e.,D = {x1, x2, … xN} (no class assignments are

given!)

Estimate: Class priors: p(wi)

Feature PDF parameters θ : p(x| θi, wi)

Sometimes the number of classes c is not given and has to be also estimated

Unsupervised ML estimation

k p(wi|xk,θ) i log p(xk| wi θi) = 0

Compared with supervised ML: additional term P(wi|xk,θ)

P(wi|xk,θ) class membership function for each sample xk

Unsupervised ML is a version of EM

Pseudo-EM: P(wi|xk,θ) is binary 0 or 1

Mixture of Gaussians Estimates

Linear combination of Gaussians with weights ai

p(xk) = i ai N(xk ; i , i )

ML estimates:

ai = (1/N) k p(wi|xk)

i = (k p(wi|xk) xk ) / k p(wi|xk)

i = (k p(wi|xk) (xk- i) (xk- i)T) / k p(wi|xk)

Clustering

Basic Isodata:1. Select initial partition of data into c classes and

compute cluster means2. Classify training samples using a classification

criterion (Euclidean distance)3. Recompute cluster means based on training set

classification decisions4. If no change in sample means stop else go to step

2

Iterative clustering algorithms

Top down algorithms: Start from a single class (all data) Split class (e.g., std) Continue splitting the “largest” class until

desired number of clusters is reached Bottom up algorithms

Each training sample a different class Start merging classes (e.g., using a NNR

criterion) until desired number of classes is reached

Documents

Review