Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang

Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers

with Quadratic Discriminant Functions

Yongqiang Wang1,2 , Qiang Huo1

1Microsoft Research Asia, Beijing, China2The University of Hong Kong, Hong Kong, China

([email protected])

ICASSP-2010, Dallas, Texas, U.S.A., March 14-19, 2010

Outline

• Background

• What’s our new approach

• How does it work

• Conclusions

Background of Minimum Classification Error (MCE) Formulation for Pattern Classification

• Pioneered by Amari and Tsypkin in late 1960s– S. Amari, “A theory of adaptive pattern classifiers,” IEEE Trans. On Electronic Computers, Vol. EC-16, No. 3,

pp.299-307, 1967.– Y. Z. Tsypkin, Adaptation and learning in automatic systems, 1971.– Y. Z. Tsypkin, Foundations of the theory of learning systems, 1973.

• Proposed originally for supervised online adaptation of a pattern classifier – to minimize the expected risk (cost)– via a sequential probabilistic descent (PD) algorithm

• Extended by Juang and Katagiri in early 1990s– B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. on Signal

Processing, Vol. 40, No. 12, pp.3043-3054, 1992.

MCE Formulation by Juang and Katagiri (1)

• Define a proper discriminant function of an observation

for each pattern class

• To enable a maximum discriminant decision rule for pattern classification

• Largely an art and application dependent


• Define a misclassification measure for each observation– to embed the decision process in the overall MCE formulation– to characterize the degree of confidence (or margin) in making decision for this

observation– a differentiable function of the classifier parameters

• A popular choice:

where

• Many possible ways => which one is better? => an open problem!


• Define a loss (cost) function for each observation– a differentiable and monotonically increasing function of the misclassification

measure – many possibilities => sigmoid function most popular for approximating MCE

• MCE training via minimizing– empirical average loss (cost)

by an appropriate optimization procedure, e.g., gradient descent (GD), Quickprop, Rprop, etc., or

– expected loss (cost)

by a sequential probabilistic descent (PD) algorithm (a.k.a. GPD)

Some Remarks

• Combinations of different choices for each of the previous three steps and optimization methods lead to various MCE training algorithms.

• The power of MCE training has been demonstrated by many research groups for different pattern classifiers in different applications.

• How to improve the generalization capability of an MCE-trained classifier?

One Possible Solution: SSM-based MCE Training

• Sample Separation Margin (SSM)– Defined as the smallest distance of an observation to the classification boundary

formed by the true class and the most competing class– There is a closed-form solution for piecewise linear classifier

• Define misclassification measure as negative SSM– Other parts of the formulation is the same as “traditional” MCE

• A happy result – Minimized empirical error rate, and– Improved generalization

• Correctly recognized training samples have a large margin from the decision boundaries!

• For more info:– T. He and Q. Huo, “ A study of a new misclassification measure for minimum classification error

training of prototype-based pattern classifiers, ’’ in Proc. ICPR-2008

What’s New in This Study?

• Extend SSM-based MCE training to pattern classifier with a quadratic discriminant function (QDF)

– No closed-form solution to calculate SSM

• Demonstrate its effectiveness on a large-scale Chinese handwriting recognition task– Modified QDF (MQDF) is widely used in

state-of-the-art Chinese handwriting recognition systems

Two Technical Issues

• How to calculate the SSM efficiently?– Formulated as a nonlinear programming problem

– Can be solved efficiently because it is a quadratically constrained quadratic programming (QCQP) problem with a very special structure:

• A convex objective function with one quadratic equality constraint

• How to calculate the derivative of the SSM?– Using a technique known as sensitivity analysis in nonlinear programming– Calculated by using the solution to the problem in Eq. (1)

• Please refer to our paper for details

Experimental Setup• Vocabulary:

– 6763 simplified Chinese characters

• Dataset:– Training: 9,447,328 character samples

• # of samples per class: 952 – 5,600– Testing: 614,369 character samples

• Feature extraction:– 512 “8-directional features”– Use LDA to reduce dimension to 128

• Use MQDF for each character class– # of retained eigenvectors: 5 and 10

• SSM-based MCE Training– Use maximum likelihood (ML) trained model as seed model– Update mean vectors only in MCE training– Optimize MCE objective function by batch-mode Quickprop (20 epochs)

30%70%

RegularCursive

Distribution of writing styles in testing data

Experimental Results (1)

• MQDF, K=5

Regular Cursive 0

1

2

3

4

5

6

7

8

9

MLMCESSM-MCE

Regular (error in %)

ML 1.73

MCE 1.29

SSM-MCE 1.19

Cursive (error in %)

8.34

7.34

7.00


• MQDF, K=10

Regular Cursive 0

1

2

3

4

5

6

7

8

MLMCESSM-MCE

Regular (error in %)

ML 1.39

MCE 1.30

SSM-MCE 1.07

Cursive (error in %)

7.03

6.54

6.29


• Histogram of SSMs on training set– SSM-based MCE-trained classifier vs. conventional MCE-trained one

– Training samples are pushed away from decision boundaries– Bigger the SSM, better the generalization

Conclusion and Discussions

• SSM-based MCE training offers an implicit way of minimizing empirical error rate and maximizing sample separation margin simutaneously– Verified for quadratic classifiers in this study– Verified for piecewise linear classifiers previously (He&Huo, ICPR-2008)

• Ongoing and future works– SSM-based MCE training for discriminative feature extraction– SSM-based MCE training for more flexible classifiers based on GMM and HMM– Searching for other (hopefully better) methods to combine MCE training and

maximum margin training

Documents

Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang