154
Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

Review

  • Upload
    leoma

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Review. Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005. PatReco: Introduction. Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005. PatReco:Applications. Speech/audio/music/sounds Speech recognition, Speaker verification/id, Image/video - PowerPoint PPT Presentation

Citation preview

Page 1: Review

Review

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 2: Review

PatReco: Introduction

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 3: Review

PatReco:Applications

Speech/audio/music/sounds Speech recognition, Speaker verification/id,

Image/video OCR, AVASR, Face id, Fingerpring id, Video segmentation

Text/Language Machine translatoin, document class., lnag mod., text underst.

Medical/Biology Disease diagnosis, DNA sequencing, Gene disease models

Other Data User modeling (books/music), Ling analysis (web), Games

Page 4: Review

Basic Concepts

Why statistical modeling? Variability: differences between two examples of the

same class in training Mismatch: differences between two examples of the

same class (one in training one in testing) Learning modes:

Supervised learning: class labels known Unsupervised learning: class labels unknown Re-inforced learning: only positive/negative feedback

Page 5: Review

Basic Concepts

Feature selection Separate classes, Low correlation

Model selection Model type, Model order

Prior knowledge E.g., a priori class probability

Missing features/observations Modeling of time series

Correlation in time (model?), segmentation

Page 6: Review

PatReco: Algorithms

Parametric vs Non-Parametric Supervised vs Unsupervised Basic Algorithms:

Bayesian Non-parametric Discriminant Functions Non-Metric Methods

Page 7: Review

PatReco: Algorithms

Bayesian methods Formulation (describe class characteristics) Bayes classifier Maximum likelihood estimation Bayesian learning Estimation-Maximization Markov models, hidden Markov models Bayesian Nets

Non-parametric Parzen windows Nearest Neighbour

Page 8: Review

PatReco: Algorithms

Discriminant Functions Formulation (describe boundary) Learning: Gradient descent Perceptron MSE=minimum squared error LMS=least mean squares Neural Net generalizations Support vector machines

Non-Metric Methods Classification and Regression Trees String Matching

Page 9: Review

PatReco: Algorithms

Unsupervised Learning: Mixture of Gaussians K-means

Other not-covered Multi-layered Neural Nets Stochastic Learning (Simulated Annealing) Genetic Algorithms Fuzzy Algorithms Etc…

Page 10: Review

PatReco: Problem Solving

1.Data Collection2.Data Analysis3.Feature Selection4.Model Selection5.Model Training6.Classification7.Classifier Evaluation

Page 11: Review
Page 12: Review
Page 13: Review

PatReco: Problem Solving

1. Data Collection2. Data Analysis3. Feature Selection4. Model Selection5. Model Training6. Classification7. Classifier Evaluation

Page 14: Review
Page 15: Review
Page 16: Review
Page 17: Review

PatReco: Problem Solving

1. Data Collection2. Data Analysis3. Feature Selection4. Model Selection5. Model Training6. Classification7. Classifier Evaluation

Page 18: Review
Page 19: Review
Page 20: Review
Page 21: Review
Page 22: Review

PatReco: Problem Solving

1. Data Collection2. Data Analysis3. Feature Selection4. Model Selection5. Model Training6. Classification7. Classifier Evaluation

Page 23: Review

Evaluation

Training Data Set 1234 examples of class 1 and class 2

Testing/Evaluation Data Set 134 examples of class 1 and class 2

Misclassification Error Rate Training: 11.61% (150 errors) Testing: 13.43% (18 errors)

Correct for chance (Training 22%, Testing 26%) Why?

Page 24: Review

PatReco: Discriminant Functions for Gaussians

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 25: Review

PatReco: Problem Solving

1. Data Collection2. Data Analysis3. Feature Selection4. Model Selection5. Model Training6. Classification7. Classifier Evaluation

Page 26: Review

Discriminant Functions

Define class boundaries (instead of class characteristics)

Dualism: Parametric class description

Bayes classifier Decision boundary

Parametric Discriminant Functions

Page 27: Review

Normal Density

1D Multi-D

Full covariance Diagonal covariance Diagonal covariance + univariate

Mixture of Gaussians Usually diagonal covariance

Page 28: Review
Page 29: Review

Gaussian Discriminant Functions

Same variance ALL classes Hyper-planes

Different variance among classes Hyper-quadratics (hyper-parabolas, hyper-

ellipses etc.)

Page 30: Review
Page 31: Review
Page 32: Review

Hyper-Planes

When the covariance matrix is common across Gaussian classes The decision boundary is a hyper-plane that is

vertical to the line connecting the means of the Gaussian distributions

If the a-priori probabilities of classes are equal the hyper-planes cuts the line connecting the Gaussian means in the middle Euclidean classifier

Page 33: Review

Gaussian Discriminant Functions

Same variance ALL classes Hyper-planes

Different variance among classes Hyper-quadratics (hyper-parabolas, hyper-

ellipses etc.)

Page 34: Review
Page 35: Review
Page 36: Review
Page 37: Review
Page 38: Review

Hyper-Quadratics

When the Gaussian class variances are different the boundary can be hyper-plane, multiple hyper-planes, hyper-sphere, hyper-

parabola, hyper-elipsoid etc. The boundary in general in NOT vertical to the Gaussian

mean connecting line If the a-priori probabilities of classes are equal the

resulting classifier is a Mahalanobois classifier

Page 39: Review

Conclusions

Parametric statistical models describe class characteristics x

by modeling the observation probabilities p(x|class)

Discriminant functions describe class boundaries

parametrically

Parametric statistical models have an equivalent parametric

discriminant function

For Gaussian p(x|class) distributions the decision boundaries

are hyper-planes or hyper-quadratics

Page 40: Review

PatReco: Detection

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 41: Review

Detection

Goal: Detect an Event Hit (Success) False Alarm Miss (Failure) False Reject

Page 42: Review
Page 43: Review
Page 44: Review

PatReco: Estimation/Training

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 45: Review

Estimation/Training

Goal: Given observed data (re-)estimate the parameters of the model e.g., for a Gaussian model estimate the mean and

variance for each class

Page 46: Review

Supervised-Unsupervised

Supervised training: All data has been (manually) labeled, i.e., assigned to classes

Unsupervised training: Data is not assigned a class label

Page 47: Review

Observable data

Fully observed data: all information necessary for training is available (features, class labels etc.)

Partially observed data: some of the features or some of the class labels are missing

Page 48: Review

Supervised Training(fully observable data)

Maximum likelihood estimation (ML)

Maximum a posteriori estimation (MAP)

Bayesian estimation (BE)

Page 49: Review

Training process

Collected data used for training consists of the following examples

D = {x1, x2, … xN} Step 1: Label each example with the corresponding

class label ω1, ω2, ... ωΚ

Step 2: For each of the classes separately estimate the model parameters using ML, MAP, BE and the corresponding training examples D1, D2..DK

Page 50: Review

Training Process: Step 1

D = {x1, x2, x3, x4, x5, … xN}

Labelmanually ω1, ω2, ... ωΚ

D1 = {x11, x12, x13, … x1N1}D2 = {x21, x22, x23, … x2N2} …………

DK = {xK1, xK2, xK3, … xKNk}

Page 51: Review

Training Process: Step 2

Maximum Likelihoodθ1 = argmaxΘ P(D1|θ1)

Maximum-a-posterioriθ1 = argmaxΘ P(D1|θ1) P(θ1)

Bayesian estimation

P (x|ω1) = P(x|θ1)P(θ1|D1) dθ1

Page 52: Review

ML Estimation Assumptions

1. P(x|ωi) follows a parametric distribution with parameters θ

2. Dj tells us nothing about P(x|ωi) (functional independence)

3. Observations x1, x2, x3, … xN are iid (independent identically distributed

4a (ML only!) θ is a quantity whose value is fixed but unknown

Page 53: Review

ML estimation

θ = argmaxΘ P(θ|D)

= argmaxΘ P(D|θ) P(θ) =4 argmaxΘ P(D|θ) = argmaxΘ P(x1, x2, … xN |θ) =3 argmaxΘ Πj P(xj|θ) =>

Πj P(xj|θ) / θ = 0 => θ = …

Page 54: Review

ML estimate for Gaussian pdf

If P(x|ω) = Ν(μ,σ2) and θ=(μ,σ2) then 1-D

μ = (1/Ν) Σj=1..N xj

σ2 = (1/Ν) Σj=1..N (xj – μ)2

Multi-D: θ=(μ, Σ) μ = (1/Ν) Σj=1..N xj

Σ = (1/Ν) Σj=1..N (xj – μ)Τ (xj – μ)

Page 55: Review

Bayesian Estimat. Assumptions

1. P(x|ωi) follows a parametric distribution with parameters θ

2. Dj tells us nothing about P(x|ωi) (functional independence)

3. Observations x1, x2, x3, … xN are iid (independent identically distributed)

4b (MAP, BE) θ is a random variable whose prior distribution p(θ) is known

Page 56: Review

Bayesian Estimation

P (x|D) = P(x,θ|D) dθ

= P(x|θ,D)P(θ|D) dθ

= P(x|θ)P(θ|D) dθ

STEP 1: P(θ) P(θ|D)

P(x|D) = P(D|θ)P(θ)/P(D)

STEP 2: P(x|θ) P (x|D)

Page 57: Review

Bayesian Estimate for Gaussian pdf and priors

If P(x|θ) = Ν(μ, σ2) and p(θ) = Ν(μ0, σ02) then

STEP 1: P(θ|D)=Ν(μn, σn2)

STEP 2: P(x|D)=N(μn, σ2+σn2 )

μn = σ02 /(n σ0

2 + σ2) (Σj xj) + σ2 /(n σ02 + σ2) μ0

σn2 = σ2 σ02 /(n σ0

2 + σ2)

For large n (number of training samples) maximum

likelihood and Bayesian estimation equivalent!!!

Page 58: Review

Conclusions

Maximum likelihood estimation is simple and gives good estimates when the number of training samples is large

Bayesian adaptation gives good estimates even for small amounts of training data provided that a good prior is selected

Bayesian adaptation is hard and often does not have a closed form solution (in which case try: iterative recursive Bayesian estimation)

Page 59: Review

PatReco: Model and Feature Selection

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 60: Review

Breakdown of Classification Error

Bayes error Model selection error Model estimation error Data mismatch error (training-testing)

Page 61: Review
Page 62: Review

True statements about Bayes error (valid within statistical significance)

The Bayes error is ALWAYS smaller than the total

(empirical) classification error

If the model, estimation and mismatch errors are

zero than the total classification error equals the

Bayes error

The ONLY way to reduce the Bayes error is to add

new features in the classifier design

Page 63: Review

More true statements

Adding new features can only reduce the Bayes error (this is not true about the total classification error!!!)

Adding new features will NOT reduce the Bayes error if the new features are Very bad at discriminating between classes (feature pdfs

overlapping) Highly correlated with existing features

Page 64: Review

Gaussian classification Bayes Error

For two classes ω1 and ω2 following Gaussian distributions with means μ1, μ2 and the same variance σ2 then the Bayes error is:

P(error) = 1/(2π)0.5 r/2 exp{-u2/2} du

where r = |μ1-μ2|/σ

Page 65: Review

Feature Selection

If we had infinite amounts of data then The more features the better!

However in practice finite data: More features more parameters to train!!!

Good features: Uncorrelated Able to discriminate among classes

Page 66: Review

Model selection

Number of model parameters is number of parameters that need to be estimated

Overfiting: too many parameters, too little data!!! Gaussian models-Model selection:

Single Gaussians Mixture of Gaussians Fixed Variance Tied Variance Diagonal Variance

Page 67: Review

Conclusion

Introducing more features and/or more complex

models can only reduce the classification error (if

infinite amounts of training data are available)

In practice: number of features and number of

model parameters is a function of amount of

training data available (avoid overfiting!)

Good features are uncorrelated and discriminative

Page 68: Review

PatReco: Expectation Maximization

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 69: Review

When do we use EM?

Partially observable data Missing some features from some samples, e.g.,

D={(1,2),(2,3),(?,4)} Missing class labels, e.g., hidden states of

HMMs Missing class sub-labels, e.g., mixture label for

mixture of Gaussian models

Page 70: Review

The EM algorithm

The Expectation Maximization algorithm (EM) consists of alternating expectation and maximization steps

During expectation steps the “best estimates of the missing information” are computed

During maximization step maximum likelihood training on all data is performed

Page 71: Review

EM

Initialization: (0)

for i =1..iterno // usually iterno=2 or 3

E step: Q(i) = EDbad{log(p(D;)|x;Dbad,(i-1)}

M step: (i) =argmax{Q(i)}

end

Page 72: Review

Pseudo-EM

Initialization: (0)

for i =1..iterno // usually iterno=2 or 3

Expectation step: Dbad=E{Dbad |(i-1)}

Maximization step: (i) =argmax{p(D| (i-1)}

end

Page 73: Review

Convergence

EM is guaranteed to converge to a local optimum (NOT the global optimum!)

Pseudo-EM has no convergence guarantees but is used often in practice

Page 74: Review

Conclusions

EM is an iterative algorithm used when there are missing or partially observable training data

EM is a generalization of ML training EM is guaranteed to converge to a local

optimum (NOT the global optimum!)

Page 75: Review

PatReco: Bayesian Networks

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 76: Review

Definitions

Bayesian networks consist of nodes and (usually directional) arcs

Nodes or states represent a classification class or in general events and are described with a pdf

Arcs represent relations between arcs, e.g., cause and effect, time sequence

Two nodes that are connected via another node are conditionally independent (given that node)

Page 77: Review

When to use Bayesian nets

Bayesian networks (or networks of inference) are statistical models that are used for classification (or in general pattern recognition) problems where there are dependencies among classes, e.g., time dependencies, cause and effect dependencies

Page 78: Review

Conditional Independence

Full independence between A and B

P(A|B) = P(A) or

P(A,B) = P(A) P(B) Conditional independence of A, B given C

P(A|BC) = P(A|C) or

P(A,B|C) = P(A|C)P(B|C)

Page 79: Review

Conditional Independence

A, C independent given B

P(C|BA) = P(C|B)

B,C independent given A

P(B,C|A) = P(B|A)P(C|A)

A,C dependent given B

P(A,C|B) cannot be reduced!

AB

C

ABC

A B C

Page 80: Review

Three problems

1. Probability computation (use independence)

2. Training/Parameter Estimation Maximum likelihood (ML) if all is observable

Expectation maximization (EM) if missing data

3. Inference (Testing) Diagnosis P(cause|effect) bottom-up

Prediction P(effect|cause) top-down

Page 81: Review

Probability Computation

For a Bayesian Network that consists of N nodes:

1. Compute P(n1, n2 ..nN) using chain rule starting from the “last/bottom” node and working your way up

P(n1, n2 ..nN) = P(nN| n1, n2 .. nN-1)

P(nN-1 |n1, n2 .. nN-2 ) … P(n2 |n1) P(n1)

2. Identify conditional independence conditions from Bayesian network topology

3. Simplify the conditionals probabilities using independence conditions

Page 82: Review

Probability Computation

Topology:

P(C,S,R,W) = P(W|C,S,R) P(S|CR) P(R|C)P(C)

Independent: (W,C)|S,R (S,R)|C

Dependent: (S,R)|W

P(C,S,R,W) = P(W|S,R) P(S|C) P(R|C) P(C)

CS

WR

Page 83: Review

Probability Computation

There are general algorithms for identifying cliques in the Bayesian net

Cliques are islands of conditional dependence, i.e., terms in the probability computation that cannot be further reduced

SC

WSRRC

Page 84: Review

Training/Parameter Estimation

Instead of estimating the joint pdf of the whole network the joint pdf of each of the cliques is estimated

For example if the network joint pdf is

P(C,S,R,W) = P(W|S,R) P(S|C) P(R|C) P(C)

instead of computing P(C,S,R,W) we compute each of P(W|S,R), P(S|C), P(R|C), P(C) for all possible values of W, S, R, C (much simpler)

Page 85: Review

Training/Parameter Estimation

For fully observable data and discrete probabilities compute maximum likelihood estimates of parameters, e.g., for discrete probs

counts(W=1,S=1,R=0)

P(W=1|S=1,R=0)ML = _______________________

counts(W=*,S=1,R=0)

Page 86: Review

Training/Parameter Estimation

Example: the following observations pairs are given for (W,C,S,R): (1,0,1,0), (0,0,1,0),(1,1,1,0),(0,1,1,0),(1,0,1,0),

(0,1,0,0),(1,0,0,1),(0,1,1,1),(1,1,1,0)

Using Maximum Likelihood Estimation:

P(W=1|S=1,R=0)ML = #(1, *, 1, 0)/#(*,*,1,0) = 2/5 = 0.4

Page 87: Review

Training/Parameter Estimation

When data is non observable or missing the EM algorithm is employed

There are efficient implementations of the EM algorithm for Bayesian nets that operate on the clique network

When the topology of the Bayesian network is not known structural EM can be used

Page 88: Review

Inference

There are two types of inference (testing) Diagnosis P(cause|effect) bottom-up Prediction P(effect|cause) top-down Once

Once the parameters of the network are estimated the joint network pdf can be estimated for ALL possible network values

Inference is simply probability computation using the network pdf

Page 89: Review

Inference

For example

P(W=1|C=1) = P(W=1,C=1) / P(C=1)

where

P(W=1,C=1) = RS P(W=1,C=1,R=*,S=*)

P(C=1) = RWS P(W=*,C=1,R=*,S=*)

Page 90: Review

Inference

Efficient algorithms exist for performing inference in large networks which operate on the clique network

Inference is often shown as a probability maximization problem, e.g., what is the most probable cause or effect?

argmaxW P(W|C=1)

Page 91: Review

Continuous Case

In our examples the network nodes represented discrete events (states or classes)

Network nodes often hold continuous variables (observations), e.g., length, energy

For the continuous case parametric pdf are introduced and their parameters are estimated using ML (observed) or EM (hidden)

Page 92: Review

Some Applications

Medical diagnosis Computer problem diagnosis (MS) Markov chains Hidden Markov Models (HMMs)

Page 93: Review

Conclusions

Bayesian networks are used to represent dependencies between classes

Network topology defines conditional independence conditions that simplify the network pdf modeling and computation

Three problems: probability computation, estimation/training, inference/testing

Page 94: Review

PatReco: Hidden Markov Models

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 95: Review

Markov Models: Definition

Markov chains are Bayesian networks that model sequences of events (states)

Sequential events are dependent

Two non-sequential events are conditionally independent given the intermediate events (MM-1)

Page 96: Review

Markov chains

q1 q1 q4q3q2

q0 q1 q4q3q2

q0 q1 q4q3q2

q0 q1 q4q3q2

MM-0

MM-1

MM-2

MM-3

Page 97: Review

Markov Chains

MM-0: P(q1,q2.. qN) = n=1..N P(qn)

MM-1: P(q1,q2.. qN) = n=1..N P(qn|qn-1)

MM-2: P(q1,q2.. qN) = n=1..N P(qn|qn-1,qn-2)

MM-3: P(q1,q2.. qN) = n=1..N P(qn|qn-1,qn-2,qn-3)

Page 98: Review

Hidden Markov Models

Hidden Markov chains model sequences of events and corresponding sequences of observations

Events form an Markov chain (MM-1) Observations are conditionally independent given

the sequence of events Each observation is directly connected with a single

event (and conditionally independent with the rest of the events in the network)

Page 99: Review

Hidden Markov Models

q0 q1 q4q3q2 …

o0 o1 o4o3o2 …

P(o0,o1..oN ,q0,q1..qN) = n=0..N P(qn|qn-1)P(on|qn)

HMM-1

Page 100: Review

Parameter Estimation

The parameters that have to be estimated are the a-priori probabilities P(q0) transition probabilities P(qn|qn-1) observation probabilities P(on|qn)

For example if there are 3 types of events and continuous 1-D observations that follow a Gaussian distribution there are 18 parameters to estimate: 3 a-priori probabilities 3x3 transition probabilities matrix 3 means and 3 variances (observation probabilities)

Page 101: Review

Parameter Estimation

If both the sequence of events and sequences of observations are fully observable then ML is used

Usually the sequence of events q0,q1..qN are non-observable in which case EM is used

The EM algorithm for HMMs is the Baum-Welsh or forward-backward algorithm

Page 102: Review

Inference/Decoding

The main inference problem for HMMs is known as the decoding problem: given a sequence of observations find the best sequence of states:

q = argmaxq P(q|O) = argmaxq P(q,O)

An efficient decoding algorithm is the Viterbi algorithm

Page 103: Review

Viterbi algorithm

maxq P(q,O) =

maxq P(o0,o1..oN ,q0,q1..qN) =

maxq n=0..N P(qn|qn-1)P(on|qn) =

maxqN {P(oN|qN) maxqN-1{P(qN|qN-1)P(oN-1|qN-1) …

maxq2 {P(q3|q2)P(o2|q2) maxq1 {P(q2|q1)P(o1|q1)

maxq0 {P(q1|q0) P(o0|q0) P(q0)}}}…}}

Page 104: Review

Viterbi algorithm

1

2

3

4

K

.

.

time

At each node keep only the best (most probable) path from all the paths passing through that node

Page 105: Review

Deep Thoughts

HMM-0 (HMM with MM-0 event chain) is the Bayes classifier!!!

MMs and HMMs are poor models but simple and efficient computationally How do you fix this? (dependent observations?)

Page 106: Review

Some Applications

Speech Recognition

Optical Character Recognition

Part-of-Speech Tagging

Page 107: Review

Conclusions

HMMs and MMs are useful modeling tools for dependent sequence of events (states or classes)

Efficient algorithms exist for training HMM parameters (Baum-Welsh) and decoding the most probable sequence of states given an observation sequence (Viterbi)

HMMs have many applications

Page 108: Review

Non Parametric Classifiers

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 109: Review

Histograms-Parzen Windows

Main idea:

Instead of selecting a parametric distribution (e.g., Gaussian) to describe the properties of the features of a class, compute directly the empirical distribution class feature histogram

Page 110: Review

Feature Histogram Example

X

# of samplesin each bin

Normalize histogram curve to get feature PDF

Page 111: Review

Parzen Windows: Issues

When compared to parametric methods empirical distributions are: Better because no specific form of the PDF is assumed Worse because over-fitting can easily occur (too small

histogram bin)

Parzen proposed rules for adapting bin size based on number of samples in each bin to avoid over-fitting

Page 112: Review

Nearest Neighbor Rule

Main idea (1-NNR): No explicit model (i.e., no training)

For each test sample x the “nearest” sample x’ in

the training set is found, i.e., argminx’ d(x, x’)

and x is classified to the class where x’

belongs

Page 113: Review

Generalizations

k-NNR: Instead of finding the nearest neighbors we

find k nearest neighbors from the training set; the

sample x is classified to the class where most of the

k neighbors belong

k-l-NNR: Like k-NNR but at least l of the k nearest

neighbor must belong to the same class for a

classification decision to be taken (else no decision)

Page 114: Review

Example

Training set D1 = {0,-1,-2} and D2 = {1,1,1}

-2 -1 0 1 2 3

1-NNR decision boundary

3-NNR decision boundary

3-3-NNR no decision region

Page 115: Review

Computational Efficiency

To speed up NNR classification the training set size can be reduced using the condensing algorithm:

The training set is classified using NNR rule misclassified samples are added to the new

(condensed) training set one by one until all training samples are correctly classified

Page 116: Review

Conclusions

Non parametric classification algorithms are easy to implement are computationally efficient (in training) don’t make any assumptions are prone to over-fitting are hard to adapt (no detailed model)

Page 117: Review

Discriminant Functions

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 118: Review

Discriminant Functions

Main Idea:

Describe parametrically the decision boundary

(instead of the properties of the class), e.g.,

the two classes are separated by a straight line

a x1 + b x2 + c = 0, with parameters (a,b,c)

(instead of the feature PDFs are 2-D Gaussians)

Page 119: Review

Example: Two classes, two features

a x1 + b x2 + c = 0

x1

x2

w1

w2

x1

x2

w1

w2

11

22

12

21

N(1,1)

N(2,2)

Model Class Boundary Model Class Characteristics

Page 120: Review

Duality

DualismParametric class description

Bayes classifier

Decision boundary

Parametric Discriminant Functions

For example modeling class features by Gaussians with same (across-class) variance results in hyper-plane discriminant functions

Page 121: Review
Page 122: Review

Discriminant Functions

Discriminant functions gi(x) are functions of the

features x of a class i

A sample x is classified to class c for which gi(x) is

maximized, i.e., c = argmaxi{gi(x)}

The function gi(x) = gj(x) defines class boundaries

for each pair of (different) classes i and j

Page 123: Review

Linear Discriminant Functions

Two class problem: A single discriminant function is defined as:

g(x) = g1(x) – g2(x)

If g(x) is a linear function

g(x) = wT x + w0

then the boundary is a hyper-plane (point, line, plane for 1-D, 2-D, 3-D features respectively)

Page 124: Review

Linear Discriminant Functions

a x1 + b x2 + c = 0

x1

x2

w = (a,b)

-c/b

-c/a

Page 125: Review

Non Linear Discriminant Functions

Quadratic discriminant functions

g(x) = w0 + i wi xi + ij wij xi xj

for examples for a two class 2-D problem

g(x) = a + b x1 + c x2 + d x12

Any non-linear discriminant function can become linear by increasing the dimensionality, e.g., y1 = x1, y2 = x2, y3 = x1

2

(2D nonlinear 3D linear)

g(y) = a + b y1 + c y2 + d y3

Page 126: Review

Parameter Estimation

The parameters w are estimated by functional minimization

The function to be minimized J models the average distance of training samples from the decision boundary for either Misclassifier training samples All training samples

The function J is minimized using gradient descent

Page 127: Review

Gradient Descent

Iterative procedure towards a local minimum

a(k+1) = a(k) – n(k) J(a(k))

where k is the iteration number, n(k) is the learning rate and J(a(k)) is the gradient of the function to be minimized evaluated at a(k)

Newton descent is the gradient descent with learning rate equal to the inverse Hessian matrix

Page 128: Review

Distance Functions

Perceptron Criterion Function

Jp (a) = misclassified ( - aT y) Relaxation With Margin b

Jr (a) = misclassified (aT y - b)2 / ||y|| 2

Least Mean square (LMS)

Js (a) = all samples (aT yi - bi)2

Ho-Kashyap rule

Js (a,b) = all samples (aT yi - bi)2

Page 129: Review

Discriminant Functions

Working on misclassified samples only (Perceptron, Relaxation with Margin) provides better results but converges only for separable training sets

Page 130: Review

High Dimensionality

Using non-linear discriminant functions and

linearizing them in a high dimensional space can make ANY training set separable

large # of parameters (curse of dimensionality)

Support vector machines: A smart way to select

appropriate terms (dimensions) is needed

Page 131: Review

Non-Metric Methods: Decision Trees

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 132: Review

Decision Trees

Motivation: There are features (discrete) that don’t have an obvious notion of similarity or ordering (nominal data), e.g., book type, shape, sound type

Taxonomies (i.e., trees with is-a relationship) are the oldest form of classification

Page 133: Review

Decision Trees: Definition

Decision Trees are classifiers that classify samples based on a set of questions that are asked hierarchically (tree of questions)

Example questions is color red? is x < 0.5?

Terminology: root, leaf, node, arc, branch, parent, children, branching factor, depth

Page 134: Review

Fruit classifier

Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sour

Page 135: Review

Fruit classification

Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sour

CHERRY

Page 136: Review

Fruit classification

Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sour

CHERRY

Page 137: Review

Fruit classification

Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sour

CHERRY

Page 138: Review

Fruit classification

Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sour

CHERRY

Page 139: Review

Fruit classifier

Color?

greenyellow

red

Size? Shape? Size?

Size? Taste?big med

round

thin

bigsmall

med

big small

med

sweet sourwatermelon

grape grapefruit cherry grape

Page 140: Review

Binary Trees

Binary trees: each parent node has exactly two children nodes (branching factor = 2)

Any tree can be represented as a binary tree by changing set of questions and by increasing the tree depth

e.g., Color?

green

yellow

red

Color = green?

Color = yellow?

Y N

Y N

Page 141: Review

Decision Trees: Problems

1. List of questions (features) All possible questions are considered

2. Which questions to split first (best split) The questions that split the data best (reduce

impurity at each node) are asked first

3. Stopping criteria (pruning criteria) Stop when further splits don’t reduce imprurity

Page 142: Review

Best Split example

Two class problem with 100 examples from w1 and w2

Three binary questions Q1, Q2 and Q3 that split the data as follows:

1. Node 1: (50,50) Node 2: (50,50)

2. Node 1: (100,0) Node 2: (0,100)

3. Node 1: (80,0) Node 2: (20,100)

Page 143: Review

Impurity Measures

Impurity measures the degree of homogeneity of a node; a node is pure if it consists of training examples from a single class

Impurity Measures

Entropy Impurity: i(N) = - i P(wi) log2(P(wi))

Variance (two-class): i(N) = P(w1) P(w2)

Gini Impurity: i(N) = 1- i P2(wi)

Misclassification: i(N) = 1- maxi P(wi)

Page 144: Review

Total Impurity

Total Impurity at Depth 0:

i(depth =0) = i(N)

Total Impurity at Depth 1:

i(depth =1) = p(NL ) i(NL) + p(NR ) i(NR)

Nyes no

NL NR

Depth 0

Depth 1

Page 145: Review

Impurity Example

Node 1: (80,0) Node 2: (20,100)

I(node 1) = 0

I(node 2) = - 20/120 log2(20/120) - 100/120 log2(100/120) = 0.65

P(node 1) = 80/200 = 0.4

P(node 2) = 120/200 = 0.6

I(total) = P(node 1) I(node 1) + P(node 2) I(node 2) =

= 0 + 0.6*0.65 = 0.39

Page 146: Review

Continuous Example

For continuous features: questions are of the type x<a where x is the feature and a is a constant

Decision Boundaries (two class, 2-D example):

R1

R2

R2

R1

R2

x1

x2

Page 147: Review

Summary

Decision trees are useful categorical classification tools especially for nominal (non-metric) data

CART creates trees that minimize impurity on the training set at each node

Decision region shape

CART is a useful tool for feature selection

Page 148: Review

Unsupervised Training and Clustering

Alexandros Potamianos

Dept of ECE, Tech. Univ. of Crete

Fall 2004-2005

Page 149: Review

Unsupervised Training

Definition: The training set samples are unlabelled (unclassified)

Motivation: Labeling is hard/time consuming Fully automatic adaptation of models (in the

field)

Page 150: Review

Maximum Likelihood Training

Given: N training examples drawn from c classes, i.e.,D = {x1, x2, … xN} (no class assignments are

given!)

Estimate: Class priors: p(wi)

Feature PDF parameters θ : p(x| θi, wi)

Sometimes the number of classes c is not given and has to be also estimated

Page 151: Review

Unsupervised ML estimation

k p(wi|xk,θ) i log p(xk| wi θi) = 0

Compared with supervised ML: additional term P(wi|xk,θ)

P(wi|xk,θ) class membership function for each sample xk

Unsupervised ML is a version of EM

Pseudo-EM: P(wi|xk,θ) is binary 0 or 1

Page 152: Review

Mixture of Gaussians Estimates

Linear combination of Gaussians with weights ai

p(xk) = i ai N(xk ; i , i )

ML estimates:

ai = (1/N) k p(wi|xk)

i = (k p(wi|xk) xk ) / k p(wi|xk)

i = (k p(wi|xk) (xk- i) (xk- i)T) / k p(wi|xk)

Page 153: Review

Clustering

Basic Isodata:1. Select initial partition of data into c classes and

compute cluster means2. Classify training samples using a classification

criterion (Euclidean distance)3. Recompute cluster means based on training set

classification decisions4. If no change in sample means stop else go to step

2

Page 154: Review

Iterative clustering algorithms

Top down algorithms: Start from a single class (all data) Split class (e.g., std) Continue splitting the “largest” class until

desired number of clusters is reached Bottom up algorithms

Each training sample a different class Start merging classes (e.g., using a NNR

criterion) until desired number of classes is reached