AUTOMATIC FEATURE LEARNING AND PARAMETER ...ufdcimages.uflib.ufl.edu/UF/E0/04/11/57/00001/zhang_x.pdfIn this dissertation, two new methods are explored to simultaneously learn parameters

AUTOMATIC FEATURE LEARNING AND PARAMETER ESTIMATION FOR HIDDENMARKOV MODELS USING MCE AND GIBBS SAMPLING

By

XUPING ZHANG

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2009

c© 2009 Xuping Zhang

2

To my Parents because they taught me everything,

to my Brother because he was there when I needed help,

to my professors because they pointed me in the right direction when I was lost,

thank you

3

ACKNOWLEDGMENTS

I would like to thank Dr. Paul Gader, Dr. Joe Wilson, Dr. Gerhard Ritter, Dr. David

Wilson, and Dr. Tamer Kahveci for their patience and understanding. I would also like to

thank my co-workers at the lab, Raazia Mazhar, Alina Zare, Jeremy Bolton, Seniha Esen

Yuksel, Gyeongyong Heo, Andres Mendez-Vazquez, Arthur Barnes, Ryan Close, Ryan

Busser, Kenneth Watford, Hyo-Jin Suh, Wen-Hsiung Lee, John McElroy, Taylor Glenn,

Sean Matthews, and Ganesan Ramachandran, for their help and understanding.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1 Statement of Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3 Overview of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1 Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.3 Sparsity Promotion . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.4 Information Theoretic Learning . . . . . . . . . . . . . . . . . . . . 222.1.5 Transformation Methods . . . . . . . . . . . . . . . . . . . . . . . . 242.1.6 Convolutional Neural Network and Shared-Weight Neural Network 292.1.7 Morphological Transform . . . . . . . . . . . . . . . . . . . . . . . . 302.1.8 Bayesian Nonparametric Latent Feature Model . . . . . . . . . . . 31

2.2 Hidden Markov Model (HMM) . . . . . . . . . . . . . . . . . . . . . . . . . 322.2.1 Definition and Basic Concepts . . . . . . . . . . . . . . . . . . . . . 322.2.2 Applications of the HMM . . . . . . . . . . . . . . . . . . . . . . . . 332.2.3 Learning HMM Parameters . . . . . . . . . . . . . . . . . . . . . . 332.2.4 Minimum Classification Error (MCE) for HMM . . . . . . . . . . . . 34

2.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 CONCEPTUAL APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2 Feature Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.1 Ordered Weighted Average (OWA)-based Generalized MorphologicalFeature Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.2 Convolutional Feature Models . . . . . . . . . . . . . . . . . . . . . 423.2.2.1 Feature model for loosely coupled sampling feature learning

HMM (LSampFealHMM) . . . . . . . . . . . . . . . . . . 42

5

3.2.2.2 Feature model for tightly coupled sampling feature learningHMM (TSampFealHMM) . . . . . . . . . . . . . . . . . . 43

3.3 Feature Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4 Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.1 MCE-HMM Model for Feature Learning . . . . . . . . . . . . . . . . 463.4.2 Gibbs Sampling Method for Continuous HMM with Multivariate

Gaussian Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4.3 Loosely Coupled Gibbs Sampling Model for Feature Learning . . . 52

3.4.3.1 Gibbs sampler for LSampFeaLHMM . . . . . . . . . . . . 523.4.3.2 Initialization and modified Viterbi learning . . . . . . . . . 57

3.4.4 Tightly Coupled Gibbs Sampling Model for Feature Learning . . . . 59

4 EMPIRICAL ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2.1 SynData1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2.2 GPRArid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2.3 GPRTwoSite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2.4 SynData2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2.5 GPRTemp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2.6 Handwritten Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . 93

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6

LIST OF TABLES

Table page

4-1 Algorithms and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4-2 Confusion matrix for digit pair 0 and 1 for TSampFeaLHMM . . . . . . . . . . . 78

4-3 Confusion matrix for digits 0, 1, 2, 4 for TSampFeaLHMM . . . . . . . . . . . . 78

4-4 Confusion matrix for digits 0, 1, 2, 4 for HMM with human-made masks . . . . 78

7

LIST OF FIGURES

Figure page

1-1 General classification model with diagram. . . . . . . . . . . . . . . . . . . . . . 15

2-1 The dashed line is the PCA projection, but the vertical dotted line representsthe best projection to separate two clusters . . . . . . . . . . . . . . . . . . . . 38

2-2 The top plot has β close to zero to maximize the variation of projections (horizontalaxis) of all observations, and the bottom plot has β close to one to minimizethe variation of the projections (vertical axis) of the observations in same cluster. 38

3-1 Feature extraction process for feature learning . . . . . . . . . . . . . . . . . . 63

3-2 MCE-based training process for feature learning . . . . . . . . . . . . . . . . . 64

3-3 Gibbs sampling HMM training process . . . . . . . . . . . . . . . . . . . . . . . 64

3-4 Feature model for LSampFealHMM . . . . . . . . . . . . . . . . . . . . . . . . . 65

3-5 LSampFealHMM Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 65

3-6 Initial feature model for TSampFealHMM . . . . . . . . . . . . . . . . . . . . . . 66

3-7 Final feature model for TSampFealHMM . . . . . . . . . . . . . . . . . . . . . . 66

3-8 TSampFealHMM Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 67

4-1 Ten samples from each class of dataset SynData1. . . . . . . . . . . . . . . . . 79

4-2 Samples from each class of dataset GPRArid. . . . . . . . . . . . . . . . . . . 80

4-3 Hit-miss pairs for initial masks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4-4 Initial OWA weights for hit and miss. The top weights correspond to the hitmask and the bottom weights correspond to the miss mask. . . . . . . . . . . . 81

4-5 Hit-miss masks after feature learning corresponding to initial masks in Figure4-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4-6 OWA weights after feature learning. . . . . . . . . . . . . . . . . . . . . . . . . 81

4-7 Final masks learned for the landmine data. Each row represents different feature.Row 1 positive, Row 2 positive, Row 3 negative, Row 4 negative. . . . . . . . . 82

4-8 Receiver operating characteristic curves comparing McFeaLHMM with twodifferent initializations to the standard HMM. . . . . . . . . . . . . . . . . . . . . 82

4-9 Left: ascending edge, flag edge, and descending edge sequences. Right: sequencesfrom noise background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8

4-10 Hit masks after Gibbs feature learning. . . . . . . . . . . . . . . . . . . . . . . . 83

4-11 Result for 135-degree state after Gibbs feature learning with hit-miss masks. . 84

4-12 Hit-miss masks after Gibbs feature learning. . . . . . . . . . . . . . . . . . . . . 84

4-13 Result with shifted training images after Gibbs feature learning with hit-missmasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4-14 Hit-miss masks after Gibbs feature learning with shifted training images. . . . . 85

4-15 25 sample sequences extracted from mine images from dataset GPRTemp. . . 86

4-16 Hit masks after Gibbs feature learning with four-state HMM setting. . . . . . . . 86

4-17 Hit masks after Gibbs feature learning with three-state HMM setting. . . . . . . 87

4-18 Receiver operating characteristic curves comparing LSampFeaLHMM andSampHMM algorithms with the standard HMM algorithm. . . . . . . . . . . . . 87

4-19 18 samples for each digit from MNIST. . . . . . . . . . . . . . . . . . . . . . . . 88

4-20 Two samples for each digit to show zone splitting. . . . . . . . . . . . . . . . . 89

4-21 Hit masks and transition matrix after Gibbs feature learning for digits. . . . . . . 89

4-22 Ten human-made masks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4-23 HMMSamp vs DTXTHMM on GPR A1 at site S1 while clutter == mine. . . . . . 90

4-24 HMMSamp vs DTXTHMM on GPR A1 at site S1 while clutter ∼= mine. . . . . 90





9

LIST OF SYMBOLS, NOMENCLATURE, OR ABBREVIATIONS

EM expectation-maximization

GPR ground penetrating radar

HMM hidden Markov model

LDA linear discriminant analysis

LSampFealHMM loosely coupled sampling feature learning HMM

MCE minimum classification error

McFeaLHMM MCE feature learning HMM

MCMC Markov chain Monte Carlo

ML maximum-likelihood

OWA ordered weighted averaging

PD probability of detection

PFA probability of false alarm

ROC receiver operating characteristic

SampHMM sampling HMM

TSampFeaLHMM tightly coupled sampling feature learning HMM

10

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

AUTOMATIC FEATURE LEARNING AND PARAMETER ESTIMATION FOR HIDDENMARKOV MODELS USING MCE AND GIBBS SAMPLING

By

Xuping Zhang

December 2009

Chair: Dr. Paul GaderMajor: Computer Engineering

Hidden Markov models (HMMs) are useful tools for landmine detection using

Ground Penetrating Radar (GPR), as well as many other applications. The performance

of HMMs and other feature-based methods depends not only on the design of the

classifier but also on the features. Few studies have investigated both classifiers and

feature sets as a whole system. Features that accurately and succinctly represent

discriminating information in an image or signal are very important to any classifier.

In addition, when the system that generated those original images has to change to

fit different environments, the features usually have to be modified. The process of

modification can be laborious and may require a great deal of application domain

specific knowledge. Hence, it is worthwhile to investigate methods of automatic feature

learning for purposes of automating algorithm development. Even if the discrimination

capability is unchanged, there is still value to feature learning in terms of time saved.

In this dissertation, two new methods are explored to simultaneously learn

parameters intended to extract features and learn parameters for image-based

classifiers. The notion of an image is general here. For example, a sequence of time or

frequency domain features. We have developed a generalized, parameterized model

of feature extraction based on morphological operations. More specifically, the model

includes hit-and-miss masks to extract the shape of interests in the images. In one

method, we use the minimum classification error (MCE) method with generalized

11

probabilistic descent algorithm to learn the parameters. Since our model is based

on gradient decent methods, the MCE method cannot guarantee a global optimal

solution and is very sensitive to initialization. We propose a new learning method

based on Gibbs sampling to learn the parameters. The new learning method samples

parameters from their individual conditional probability distribution instead to maximize

the probability directly. This new method is more robust to initialization, and can

generally find a better solution.

We also developed a new learning method based on Gibbs sampling to learn

parameters for continuous hidden Markov models with multivariate Gaussian mixtures.

Because hidden Markov models with multivariate Gaussian mixtures are commonly

used HMM models in applications, we propose a learning method based on Gibbs

sampling. The proposed method is empirically shown to be more robust than comparable

expectation-maximization algorithms.

We performed experiments using both synthetic and real data. The results show

that both methods work better than the standard HMM methods used in landmine

detection applications. Experiments with handwritten digits are also presented. The

results show that the HMM-model framework with the automatic learning feature

algorithm again performed better than the same framework with the man-made feature.

12

CHAPTER 1INTRODUCTION

1.1 Statement of Problem

A classifier is an algorithm that takes a feature vector as input and produces class

labels or confidences. Significant effort has been expended in the area of classifier

design and learning algorithms to determine parameters of classifiers. Less attention

has been focused, however, on the equally important problem of learning features

for classification. For most image- and signal-based classifier problems, experts use

their knowledge to find the features for the shapes of objects of interest in the images.

Despite this, these features may not be optimal, and human-based feature selection is

time-consuming, ad-hoc, and expensive. The problem addressed in this research is the

problem of automatically identifying and learning features for image-based classification.

1.2 Classification Model

The general classification problem (Figure 1-1) is difficult to investigate in its full

generality. Hence, the sub-problem involving a hidden Markov model (HMM) with

morphological features is considered since this methodology type is based on ad-hoc

methods that have shown promise in the past (Gader et al.; Zhao et al., 2003).

1.3 Overview of Research

The proposed approach involves developing a generalized, parameterized model of

feature extraction based on morphological or linear convolution operations. Two feature

learning algorithms are proposed to learn the parameters of the feature model and the

HMM. One algorithm is based on minimum classification error and the other is based on

Gibbs sampling. Additionally, a new method based on Gibbs sampling is proposed to

learn the parameters of continuous HMM with multivariate Gaussian mixtures.

The feature learning approach involves parameterizing the feature extraction

process. In landmine detection, the existing feature extraction methodology for the

HMMs proceeds by performing morphological operations on small windows of positive

13

and negative derivative images, so the degree to which a diagonal or anti-diagonal

shape fits within the window is computed. Specifically, the morphological operation

is an erosion, which can be calculated as a local minimum operation. These degrees

are aggregated using a maximum operation over a vertical column (Gader et al.). In

addition, linear convolution-based features will also be considered.

In one proposed algorithm, we define a minimum classification error (MCE)

objective function. The function depends not only on the HMM classification parameters,

but also on the feature extraction parameters. Optimizing this objective function over

both parameter sets simultaneously yields both feature extraction and classification

parameters. The feature model generalizes the morphological model using ordered

weighted average (OWA) operations. OWA operators are used to parameterize both

the morphological operations and the aggregation operations as they form a family of

operators that can represent maximum, minimum, and many other operators.

For another proposed algorithm, we apply a Bayesian framework to the feature

learning models. Rather than defining an objective function and maximizing that

function, we try to find the full probabilities distribution of parameters and data. By

sampling the parameters from the individual conditional probability distributions, we can

obtain better solutions. We use the Gibbs sampler as the tool to simulate the probability

distribution. Gibbs sampling is a common Markov Chain Monte Carlo (MCMC) sampling

method. It is a straightforward, powerful sampling method (Section 2.3).

Next, a new learning method for continuous HMM with multivariate Gaussian

mixtures is proposed. HMMs with multivariate Gaussian mixtures are widely used

models in many applications (Rabiner, 1989; Zhao et al., 2003). MCMC sampling

methods have the advantage of generally finding better optima than traditional methods,

such as expectation maximization algorithm (Dunmur and Titterington, 1997; Ryden

and Titterington, 1998). Although there are some learning methods proposed for HMM

based on MCMC sampling, these methods are either for a discrete HMM problem, or

14

for some uncommon HMM models, such as trajectory HMM models (Zen et al., 2006),

non-parametric HMM models (Thrun et al., 1999), or nonstationary HMM models (Zhu

et al., 2007). We propose a new learning method based on Gibbs sampling that focuses

on this specific HMM problem.

The rest of the dissertation is organized as follows. In Chapter 2, we review the

literature of various feature learning methods, HMM algorithms, and Gibbs sampling.

In Chapter 3, we present the three new learning algorithms. In Chapter 4, we show the

results of applying these algorithms. Conclusions and future work are in Chapter 5.

Data Features Classifier Class

Confidence

Figure 1-1. General classification model with diagram.

15

CHAPTER 2LITERATURE REVIEW

2.1 Feature Learning

What is a feature? We give the definition of features here. A signal X is a function

of N variables with real or complex or vectors of real or complex values. A feature

is a real or complex value calculated from a signal X. Through mapping function

f : X → Cn, features act as representatives of signals for later processing. Feature

learning defines or finds the map function to get the appropriate data representatives

with respect to the goal of later processing. Feature learning is the focus of research

in many applications (Belongie et al., 1998; Gader and Khabou, 1996; Guyon and

Elisseeff, 2003; Tamburino and Rizki, 1989a; Yu and Bhanu, 2006). Good features

result in better classification accuracy. The classifiers with better features are fast

to compute, and are more cost effective. In addition, better features help humans

understand the underlying process of data generation. For instance, identifying the

genes responsible for certain diseases can help humans to better understand the cause

of some cancers (Lee et al., 2003). We will give an overview of the most commonly used

algorithms in feature extraction in this section. Feature learning can involve selecting a

subset of features from a large set of candidates, learning or estimating parameters of

parameterized operations, or selecting operators from a set of candidates and learning

parameters. The first is referred to as feature selection. The focus of this research is the

second category, learning or estimating parameters of parameterized operations. We

provide reviews of all three types.

2.1.1 Feature Selection

Many applications already have tens, hundreds, or more variables ready to

represent the data itself, whether they are available directly from hardware output or

defined by the expert. One way to obtain the data representatives is to directly select

a subset of variables as features. Generally, there are three major feature selection

16

methods (Blum and Langley, 1997; Guyon and Elisseeff, 2003; Guyon et al., 2004; Liu

and Motoda, 2007): the filter, the wrapper, and the embedded method.

The filter method is independent of later processing. The simplest approach is to

rank the features via some criteria. The rank criteria can be the correlation between the

variables and the labels (Guyon and Elisseeff, 2003), or the mutual information between

variable and the labels (Globerson and Tishby, 2003). The high rank variables are

selected as the final data representatives. The algorithm is fast and easy to implement

and has been successfully used in many cases (Rogati and Yang, 2002; Stoppiglia

et al., 2003). However, some high rank features can be redundant because they can

be highly correlated. Some low rank features may help to improve the performance of

classifiers when they are included (Stoppiglia et al., 2003).

Instead of ranking variables individually, another class of filter methods involves

ranking subsets of all the variables. Some rank criteria are based on mutual information,

such as the minimal-redundancy-maximal-relevance (mRMR) criterion method (Peng

et al., 2005). One variable is chosen in each step to increase the size of the feature

subset, then the mutual information D(S, c) of the label c and the subset S with the next

new variable is computed as D = 1|S|

∑xi∈S I(xi; c), and the mutual information R(S)

of variables in the subset S is computed as R = 1|S|2

∑xi,xj∈S I(xi; xj), where I(x, y) is

mutual information of variables x and y. The mRMR criterion (maxS(D(S, c) − R(S)))

is used to do this incremental search. The result is not a globally optimal solution with

respect to maximizing mutual information among all subsets, as exhaustive evaluation

is not possible, but the method does reduce the number of redundant features and may

select helpful features that are useless by themselves.

Another set-based filter method, based on forward orthogonal search, has been

proposed (Wei and Billings, 2007). An incremental search is conducted using the

squared-correlation between the two variables as the ranking criterion. Suppose we

have a data set of N data samples, and each sample has n feature candidates. The

17

goal is to select d features from these n feature candidates. At first we have a set of

n variables ~xi = [xi(1), ..., xi(N)]T for i = 1, . . . , n, where xi(k) is the i-th feature of

the k-th sample. Now given the current subset of variables that are from the current

best m features (m ¿ n), these variables are transformed into a group of orthogonal

accessory variables. For a new variable with a new feature candidate, the residue of

the variable over the projection of those orthogonal accessory variables is computed.

Then the average value of the squared-correlation between each variable in the current

subset and the new variable’s residues are ranked. The highest one is included in the

new subset. In this method, the efficient feature subsets are selected with clear physical

interpretation. However, the algorithm assumes there is a linear relationship between

variables and sometimes this assumption may not hold.

Since a filter method is independent of later processing, it may not improve the final

performance. On the other hand, wrapper methods incorporate later processing directly.

Wrapper methods select subsets of variables based on their effect on performance of

later processing (Guyon and Elisseeff, 2003).

Since any classifier or other learning algorithm can be used in later processing, the

wrapper method is powerful when applied to the selection of features. However, a filter

method is usually faster than a wrapper method, because wrapper methods need to

perform learning algorithms for every feature subset candidate. Therefore, an efficient

search strategy is needed. Greedy search strategies are commonly used in wrapper

methods.

For supervised learning, the class label is given. It is natural to use the classification

performance to evaluate the relevance of feature sets (Najjar et al., 2003). Unsupervised

learning, which usually involves clustering, is not as straightforward. Instead of

classification errors, other criteria are used, such as maximum likelihood (ML), scatter

separability (Dy and Brodley, 2004), or a discriminant criterion (Roth and Lange, 2004).

ML criterion maximizes the likelihood of the data given the model (feature set and

18

parameters). Scatter separability criterion uses a within-class scatter matrix and a

between-class scatter matrix to measure separation of clusters. Discriminant criterion

uses linear discriminant analysis (LDA) to find the optimal solution.

Embedded methods: Embedded methods are similar to wrapper methods, except

that embedded methods perform feature selection in the process of training. They are

usually specific to a given learning algorithm such as finding features for support vector

machines (SVMs) (Weston et al., 2001).

2.1.2 Feature Weighting

Sometimes we do not explicitly select a subset of features, but assign a real-valued

number to each variable. The value represents the degree to which the variable is

relevant or important. Feature selection can be thought of as a special case of feature

weighting in which the weights are 0 or 1. However, the ability to permit the weights

to vary continuously allows for a wider variety of techniques. Perhaps the best known

algorithms for feature weights are the Winnow algorithm (Littlestone, 1987) and the

RELIEF algorithm (Kira and Rendell, 1992).

The Winnow algorithm was developed to update the weights in a multiplicative

manner. The idea of the Winnow algorithm is to update the weights by presenting the

positive and negative examples iteratively. Given an example denoted by (x1, ..., xn) and

the weights denoted by (w1, ..., wn), where n is the number of features, the algorithm

predicts 1 if w1x1 + ... + wnxn > θthreshold, otherwise it predicts 0. Then, in each iteration,

the weights are updated if the prediction of the algorithm is incorrect. If the algorithm

predicts a negative value for the positive example, the value of wi is increased by the

scale of a promotion parameter for each xi equal to 1. If the algorithm predicts a positive

value for a negative example, the value of wi is decreased by the scale of a demotion

parameter for each xi equal to 1. The promotion and demotion parameters are set by

experiments. The Winnow algorithm is not difficult to implement, and it scales well to

19

high dimensional space, but the convergence of the algorithm is only guaranteed for

linearly separable data (Golding and Roth, 1999).

The RELIEF algorithm (Dietterich, 1997) estimates feature weights iteratively

according to their ability to discriminate between neighboring patterns. In each iteration,

a pattern x is randomly selected. Then the two nearest neighbors of x are found. One is

from the same class as x (termed the nearest hit, denoted by NH(x)) and the other is

from a different class (termed the nearest miss, denoted by NM(x)). The weight of the

i-th feature is then updated as: wnewi = wold

i + |x(i) − NM(i)(x)| − |x(i) − NH(i)(x)|. It is

proven (Sun, 2007) that RELIEF is an online algorithm that solves a convex optimization

problem with a 1-Nearest Neighborhood objective function. Therefore, the RELIEF

algorithm performs better as a nonlinear classifier to search for informative features

compared with filter methods. In addition, it can be implemented very efficiently, as no

exhaustive search is applied. This makes it suitable for large-scale problems. However, it

calculates nearest neighbors in the original feature space rather than in weighted feature

space, which hurts its performance. Moreover, it is not robust with respect to outliers.

The IRELIEF algorithm (Sun, 2007) was proposed to improve the RELIEF algorithm.

It calculates the probabilities of data points in NM(x) and NH(x), and represents the

probability that a data point is an outlier as a binary random hidden variable. It updates

the weights following the principle of the expectation-maximization (EM) algorithm.

The result shows that IRELIEF improves the RELIEF algorithm because it is robust

against mislabeling noise, and is able to find useful features. Because it follows the

EM algorithm, the proper choice of tuning parameters is important to achieve good

performance.

2.1.3 Sparsity Promotion

Sparsity promotion is very important in feature learning. It can control the

complexity of learning functions and can avoid over fitting, yet achieve good generalization.

Moreover, a sparse model is easy to implement, easy to interpret, more stable, and

20

more robust to noise, but it increases the complexity of computation, and sometimes the

optimal function is analytically unsolvable.

Penalization-based or regularization methods are common sparsity promotion

techniques. These methods rely on minimizing a penalty term applied to a set of

parameters. The L2 norm is a very commonly used penalty term. Suppose we promote

the sparsity over a set of n parameters w1, ..., wn. The penalty is defined as α∑

i w2i ,

where α is a decay constant, also termed a weight decay penalty. In a linear model, this

form of weight decay penalty is equivalent to ridge regression. It is good at controlling

model complexity by shrinking all coefficients toward zero, but they all stay in the model,

since it is rare that parameters go to zero.

The L0 norm is defined as the number of nonzero parameters. The L0 norm penalty

is simple to apply, and it promotes sparsity directly (Wipf and Rao, 2005). However,

in general, solving an optimization problem with an L0 norm penalty is an NP-hard

problem; thus convex relaxation regularization by the L1 norm is often considered

(Mørup et al., 2008; Wolf and Shashua, 2003). The L1 norm is defined as∑

i |wi|. The

least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996) imposes

the L1 norm constraint on the parameters of a problem. It shows that the L1 norm

constraint is equivalent to assuming the parameters have Laplace priors and the L2

norm constraint is equivalent to assuming the parameters have Gaussian priors. Since

Laplace functions quickly peak at zero, the tail of Laplace functions drops slowly, and the

L1 norm constraint would push the parameters to either zero or large values.

The Laplace prior is commonly used for sparsity promotion in the Bayesian

approach (Williams, 1995). Because there is a computational difficulty due to non-

differentiability of the Laplace function at the origin, an alternative hierarchical formulation

was proposed (Figueiredo, 2003), where it is shown that a zero-mean Gaussian prior

is equivalent to a Laplace prior when the variance of the Gaussian prior has a regular

21

exponential distribution. This model has good computability, and promotes great sparsity

(Krishnapuram, 2004; Krishnapuram et al., 2004).

There are other penalties used to promote sparsity. Instead of selecting a subset

of features, the minimum message length (MML) criterion is used to estimate a set of

real-valued (usually in [0, 1]) quantities (one for each feature), which are called feature

saliencies. An MML penalty is adopted to avoid all the feature saliencies to achieve

maximum possible value (Law et al., 2004; Mackey, 2003). The MML criterion is given

(Figueiredo and Jain, 2000) by − log p(θ)− log p(Y |θ)+ 12log |I(θ)+ c

2(1+ log 1

12), where Y

is the set of data samples, θ is the set of parameters of the model, c is the dimension of

θ, and I(θ) = −E[D2θ log p(Y |θ)] is the Fisher information matrix (the negative expected

value of the Hessian of the log-likelihood). The MML criterion encourages the saliencies

of irrelevant features to go to zero, and allows the method to prune the feature set.

However, the Fisher information matrix used in the MML criterion is very difficult to

obtain analytically. Approximate methods are usually needed to find the optimal solution

(Figueiredo and Jain, 2000).

2.1.4 Information Theoretic Learning

Information theory is a mathematical theory that originates from the very essence

of the communication process. Information theoretic learning (ITL) has been proposed

(Mackey, 2003; Principe et al., 1998, 2000) for machine learning. The concepts of

entropy and mutual information are needed to pose and solve optimization problems

with information theoretic criteria.

Here we review the expressions for entropy and mutual information according

to Shannon (1948) and Kullback and Leibler (1951). Shanon’s entropy is defined as

H(y) = − ∫p(y) log p(y)dy, where p(y) is the PDF of the random variable y. Although

Shannon’s entropy is the only one that possesses all the postulated properties for an

information measure, other forms, such as Renyi’s entropy (HR(y) = 12log

∫p2(y)dy), are

equivalent with respect to entropy maximization. The conditional entropy of a random

22

variable x given a random variable y is defined as H(x|y) = − ∫p(x, y) log p(x|y)dy.

The mutual information between random variables x and y is I(x, y) = H(x) − H(x|y),

which can be written as I(x, y) =∫

x

∫yp(x, y) log p(x,y)

p(x)p(y)dydx. The mutual information can

also be seen as the Kullback-Leibler divergence measure between p(x, y) and p(x)p(y).

In general, for two probability densities f(y) and g(y), this is written as K(f, g) =∫

ylog f(y)

g(y)dy.

Mutual information is commonly used in feature selection or feature extraction

methods (Hild II et al., 2006; Leiva-Murillo and Artes-Rodriguez, 2007; Torkkola, 2003).

These methods usually train feature extractors by maximizing an approximation of

mutual information between the class labels and the output of the feature extractor.

Different methods vary for different entropy forms, different computational formulas of

mutual information, or different maximization methods. Hild II et al. (2006) presented

a method using a nonparametric estimation of Renyi’s entropy to learn features and

train the classifier. They use Parzen windows to estimate the probability density p(x)

of data X, where the density of X is estimated as a sum of spherical Gaussians, each

centered at a sample xi. So, p(x) u 1N

∑i=1 NG(x − xi, σ

2I), where N is the number

of samples and G(x, σ2I) is a Gaussian kernel with diagonal, isotropic covariance.

The entropy estimator H2(x) is given by H2(x) u − log 1N

∑i=1 NG(xi − xi−1, 2σ

2I).

Torkkola (2003) presented a method for learning discriminative feature transforms.

Instead of a commonly used mutual information measure based on Kullback-Leibler

divergence, a quadratic divergence measure is used, which is defined as D(f, g) =∫

y(f(y) − g(y))2dy. A linear feature extraction method for classification, based on the

maximization of the mutual information between the features extracted and the classes,

was proposed (Leiva-Murillo and Artes-Rodriguez, 2007). They use Gram-Charlier

expansion to estimate the probability density of data. Thus, the entropy h(z) is computed

as h(z) = hG(z) − J(z), where hG(z) is the entropy with Gaussian assumption, and J(z)

23

is the negentropy, where hG(z) ≈ log((2πe)1/2σz) and J(z) ≈ k1

(E

{z exp

(−z2

2

)})2

+

k2

(E

{z exp

(−z2

2

)}−

√1/2

)2

with the constants k1 and k2.

Experimental results (Leiva-Murillo and Artes-Rodriguez, 2007) show that mutual

information-based methods can outperform existing supervised feature extraction

methods and require no prior assumptions about data or class densities. However, since

these methods usually use nonparametric estimation of density of data, such as Parzen

density estimation, they require a large data set and very high computation time. These

methods are sensitive to the choice of window size and they do not work well on high

dimensional data.

2.1.5 Transformation Methods

Feature extraction via linear transform uses the following methodology. A matrix

T is used to transform original vector ~x into vectors ~y = T~x. The vector ~y may be

lower-dimensional than ~x, in which case ~y is usually chosen as the feature vector. For

other transforms, ~y may have the same dimensionality as ~x. If so, another schema

is used to select elements of ~y as features. Here we review some commonly used

transformations, such as principal component (PCA), singular value decomposition

(SVD), Fisher linear discriminant (FLDA), independent component analysis (ICA),

discrete Fourier transform (DFT), discrete cosine transform (DCT), and Gabor filter.

Karhuner-Loeve Transform (or PCA). PCA is the best known linear transformation

method (Burges, 2004; Smith, 2002; Torkkola, 2003; Wang and Paliwal, 2003; Yang

and Yang, 2002). The PCA method uses the eigenvectors corresponding to the

largest eigenvalues of the covariance matrix of the data as the transform matrix.

After transformation, it generates mutually uncorrelated features. This transformation

is optimal in terms of minimal mean-square error between the original data and the

data reconstructed from the features. It also maximizes mutual information between

the original data vector and its feature representation for the data from the Gaussian

24

distribution, but it can be shown that it is not good for unsupervised clustering (Yeung

and Ruzzo, 2001), as shown in Figure 2-1.

There are some variations of PCA to improve the performance or to fit different

learning frameworks. The probabilistic principal component analysis (PPCA) method

(Tipping and Bishop, 1998) modifies the original PCA to fit it into a Bayesian framework.

PPCA introduces a zero mean Gaussian distribution latent variable ~y to the regular

PCA model, such that ~x = W~y + ~µ + ~ε, where vector ~x is the observation, ~µ is the

vector parameter that represents the mean of data, ~y ∼ N(0, I), and ~ε ∼ N(0, σ2I).

Given the model, the EM algorithm is used to find the optimal transform matrix to

maximize the likelihood of the observations. EM-PCA (Roweis, 1998) uses similar

ideas. PPCA/EM-PCA have the same advantages as PCA, because they also find

more informative uncorrelated features. They can also assign low probabilities for some

outliers far away from most data. Unfortunately, they have similar disadvantages to PCA.

They are not good for finding optimal features for classification performance. Another

shortcoming of these two methods is that PPCA and EM-PCA are batch algorithms

(Choi, 2004).

Informed PCA (Cohn, 2003) is another variation of PCA that incorporates the

information of labels or categories into the definition of the transformation. PCA only

penalizes according to squared distance of an observation from its projection. Informed

PCA is based on the assumption that if a set of observations Si = {x1, x2, . . . , xn} are in

the same class i, then they should share a common source. For a hyperplane H defined

by the orthogonal matrix C, which consists of the eigenvectors of the covariance matrix

of Si, the maximum likelihood source is the mean of Si’s projections onto H, denoted

by Si. If we denote xj as the projection of the jth observation by the transform matrix

C, the likelihood should be penalized not only based on the variance of observations

around their projections (∑

j ||xj − xj||2), which is same as PCA, but also on the variance

of the projections around their set means (∑

i

∑xj∈Si

||xj − Si||2). With a trade-off hyper

25

parameter β, the square error term is Eβ = (1−β)∑

j ||xj− xj||2 +β∑

i

∑xj∈Si

||xj− Si||2.The EM algorithm is used to find the optimal transform matrix C. Informed PCA uses

the label information to get better features for classification performance than PCA,

but it assumes that clusters are compact, which is not always true. The trade-off hyper

parameter has to be carefully tuned to achieve good performance, as shown in Figure

2-2.

Singular Value Decomposition (SVD). The SVD method (Wall et al., 2003) has

the same goal as PCA in that it finds projections that minimize the squared error in

reconstructing original data. It calculates the eigenvectors of the covariance matrix of

the original data by singular value decomposition (X = USV T , where U and V are the

orthogonal matrix, S is the diagonal matrix, which has nonzero diagonal elements.). It

has more efficient algorithms available than PCA to find the eigenvectors, and some

implementations find just the top N eigenvectors. However, it is still computationally

expensive in the case of high dimension data. It has the same disadvantages as PCA in

that it is not accurate in finding optimal features for classification performance.

Fisher Linear Discriminant Analysis (FLDA). The Fisher linear discriminant

transformation method is a commonly used linear transformation method for supervised

learning (Chen and Yang, 2004; Petridis and Perantonis, 2004; Yang and Yang, 2003;

Zhao et al., 2006). It is optimally discriminative for certain cases (Torkkola, 2003). LDA

finds the eigenvectors of C = S−1w Sb, where Sb is the between-class covariance matrix,

and Sw is the sum of within-class covariance matrices. The matrix S−1w captures the

compactness of each class, and Sb represents the separation of the class means.

Eigenvectors corresponding to the largest eigenvalues of C form the columns of

transform matrix W . New discriminative features y are derived from the original data

x by y = W tx. It performs best on data with Gaussian density for each class, and well

separated means between classes. In addition, since a Fisher criterion is not directly

related to the classification accuracy, it is not optimal in terms of the classification error.

26

Independent Component Analysis (ICA). ICA is a relatively recent method. The

goal is to find a linear representation of non-Gaussian data so that the components

of the representation are statistically independent or as independent as possible.

Such a representation seems to capture the essential structure of the data in many

applications, including feature extraction and signal separation (Hyvarinen and Oja,

2000). The fundamental restriction in ICA is that the independent components must

be non-Gaussian, since the key to estimating the ICA model is non-Gaussianity.

To use non-Gaussianity in ICA estimation, there should be a quantitative measure

of non-Gaussianity of a random variable. The classical measure of non-Gaussianity is

kurtosis, or the fourth-order cumulant. Another very important measure of non-Gaussianity

is given by negentropy. Negentropy is based on the information-theoretic quantity

of differential entropy. Because a fundamental result of information theory is that a

Gaussian variable has the largest entropy among all random variables of equal variance,

entropy could be used as a measure of non-Gaussianity.

ICA can also be considered a variant of projection pursuit (Huber, 1985). Projection

pursuit is a technique developed in statistics for finding the most “interesting” projections

of multidimensional data. Some researchers (Huber, 1985; Jones and Sibson, 1987)

argued that Gaussian distribution is the least interesting one, and that the most

interesting directions are those that show the least Gaussian distribution. By computing

the nongaussian projection pursuit directions, the independent components can be

estimated, which is the concept of ICA, but if the non-Gaussianity model does not hold

for the data, ICA does not work.

Discrete Fourier Transform (DFT) and Discrete Cosine Transform (DCT). One

of the most frequently used transformations in signal/image processing is DFT (Wang

et al., 2007). Given a data sample [f(x), x = 0, 1, . . . , N − 1], the discrete Fourier

transform is F (u) = 1N

∑N−1x=0 f(x)[cos 2πux/N − j sin 2πux/N ], u = 0, 1, 2, . . . , N − 1,

and f(x) = 1N

∑N−1u=0 F (u)[cos 2πux/N + j sin 2πux/N ], x = 0, 1, 2, . . . , N − 1. It

27

transforms the original spatial or time domain data into a frequency domain. It uses

fixed basis vectors, thus it has low computational cost. Other linear transformation

methods in image processing, such as DCT (Saha, 2000), use cosine functions as

basis functions. The discrete cosine transform is D(u) = α(u)∑N−1

x=0 f(x) cos (2x+1)uπ2N

,

and f(x) =∑N−1

u=0 α(u)D(u) cos (2x+1)uπ2N

, where α(0) =√

1/N and α(u) =√

2/N, 0 <

u < N . The discrete wavelet transform (DWT) (Nanavati and Panigrahi, 2005; Saha,

2000) is another widely used transform in image processing. It uses wavelets as basis

functions. Wavelets are functions of limited duration, and have an average value of

zero. These basis functions are obtained from a single prototype function by dilations,

or contractions (scaling) and translations (shifts). The coefficients of these basis

vectors/functions are used as the representatives of original signals, such as F (u)

and D(u). Because these basis functions, such as cosine functions, have compact

energy on low frequencies, and natural images have mostly low-frequency features,

images can be represented by a small number of coefficients without much loss of

information. They have good information packing properties for signal compression

and reconstruction. One disadvantage of DFT and DCT is that their basic functions are

periodic continuous functions, such as sinusoids; they may not be good at generating

more localized features such as edge information. One disadvantage of the DWT

(Nanavati and Panigrahi, 2005) is the problem of selecting basis functions for a given

application, because a particular wavelet is suited only for a particular purpose.

Gabor Filter. Gabor (1946) formulated an oriented bandpass filter that represents

an optimal compromise to the uncertainty relationship between positional and spatial

frequency localization. The Gabor functions g(x, y) = s(x, y)wr(x, y) are defined as the

product of a radially symmetric Gaussian function wr(x, y) and a complex sinusoidal

wave function s(x, y) (Movellan, 2006; Rizki et al., 1993). A complex sinusoidal function

is defined as s(x, y) = exp(j(2π(u0x + v0y0) + P )), where (u0, v0) and P define the

spatial frequency and the phase of the sinusoidal, respectively. A distinct advantage of

28

Gabor functions is their optimality in time and frequency, or in two dimensions, space

and spatial-frequency. They can provide the smallest possible pieces of information

about time-frequency events. Any well behaving function can be represented as a linear

combination of Gabor functions. With properly chosen filter parameters, Gabor filters

have similar characteristics to Gabor functions (Kyrki et al., 2004). The Gabor filter

features are useful for textual analysis, as they have tunable orientation, radial frequency

bandwidths, and tunable center frequencies (Greenspan et al., 1991; Yu and Bhanu,

2006). A disadvantage to using a Gabor transform is that the outputs of Gabor filter

banks are not mutually orthogonal. Thus, the features extracted may be correlated. In

addition, Gabor filters usually require proper tuning of filter parameters.

2.1.6 Convolutional Neural Network and Shared-Weight Neural Network

Convolutional neural networks (CNN) provide an efficient method to constrain

the complexity of feedforward neural networks by weight sharing and restriction to

local connections. This network topology has been applied in particular to image

classification in order to avoid sophisticated preprocessing and to classify raw images

directly (Nebauer, 1998). It has been suggested that this topology is more similar to

biological networks based on receptive fields and improved tolerance to local distortions

(Nebauer, 1998). In addition, the number of weights and the complexity of models are

efficiently reduced by weight sharing. When images with high-dimensional input vectors

are presented directly to the network, this method can avoid explicitly defined feature

extraction and data reduction methods usually applied before classification (LeCun

et al., 1989, 1998). In other words, this method does feature extraction and classification

simultaneously. The disadvantage of this method is that the network is easy to over-fit,

and it is difficult to interpret the meaning of inner nodes and the structure of the network.

It is not easy to incorporate prior knowledge into the network.

Convolutional networks usually have three architectural schemes: local receptive

fields, shared weights, and subsampling. Some degree of scale shift and distortion

29

invariance is accomplished by the combination of these three schemes. One typical

convolutional network for recognizing characters is LeNet-5 (LeCun et al., 1998).

The images of characters in the input layer are approximately size normalized and

centered first. Each node in a layer is connected to a set of nodes located in a small

neighborhood in the previous layer, and then all the weights are learned with back

propagation. Because the number of free parameters is reduced by the weight sharing

technique, the “capacity” of the machine and the gap between test error and training

error is reduced (Chen et al., 2006; Garcia and Delakis, 2004).

A shared weight network (Gader et al., 1995; Porter et al., 2003) is another name

for a CNN and emphasizes the weight-sharing properties of networks. The network is

viewed as nonlinear combinations of linear filters.

2.1.7 Morphological Transform

Erosion (Tamburino and Rizki, 1989a,b), dilation, and hit-miss (Gader and Khabou,

1996; Zmuda and Tamburino, 1996) are commonly used morphological transforms in

image processing. Some other mathematical morphologies, known as granulometries,

are studied as well (Serra, 1983; Urbach et al., 2007).

The morphological transform can be applied to neural network, such as a

morphological neural network. It is a network with a morphological feature extraction

layer. An example is described by Haun et al. (2000). Raw images in the input layer

are first undersampled to decrease computation intensity. Then hit/miss transforms are

used to map the pixels within the images to feature maps. Each feature map is produced

by one hit/miss weight matrix pair. The transforms are essentially the targets eroded

(hit) and the backgrounds dilated (miss). We assume both hit matrix H and miss matrix

I are 3×3 matrices. During the transform process, a 3×3 window slides on the input

image. Given a 3×3 matrix I of pixels from the input image, where I22 is the origin, a

difference matrix D is produced by D = I − H and a sum matrix S is produced by

S = I + M . Then the value f of the pixel at the origin in the feature map is computed by

30

f = min D−max S. A following regular feed forward network is used for the classification

where feature maps are the input.

To improve the computation speed of morphological transformations, some fast

algorithms are studied to compute min, median, max, or any other order statistic filter

transform (Gil and Wcrman, 1993). An efficient and deterministic algorithm is proposed

to compute the one-dimensional dilation and erosion (max and min) sliding window

filters (Gil and Kimmel, 2002).

Morphological Share-weighted Neural Network (MSNN). MSNNs combine the

feature extraction capability of mathematical morphology with the function-mapping

capability of neural networks in a single trainable architecture (Gader et al., 2000;

Khabou and Gader, 2000; Khabou et al., 2000; Sahoolizadeh et al., 2008; Won and

Gader, 1995). It is a two-stage network, with a standard feed forward classification stage

followed by a feature extraction stage. The feature extraction stage is composed of one

or more feature extraction layers. Each layer is composed of one or more feature maps.

Associated with each feature map, is a pair of structuring elements–one for erosion

and one for dilation. The values of a feature map are the result of performing a hit-miss

operation with the pair of structuring elements on a map in the previous layer. The

values of the feature maps on the last layer are fed to the feed-forward classification

stage of the MSNN with gray-scale hit-miss transform (Gader and Khabou, 1996; Gader

et al., 1995; Haun et al., 2000).

2.1.8 Bayesian Nonparametric Latent Feature Model

A Bayesian nonparametric latent feature model (Ghahramani et al., 2007; Griffiths

and Ghahramani, 2005) is a flexible nonparametric approach to latent variable modeling

in which the number of latent variables is unbounded. This approach is based on a

probability distribution over equivalence classes of binary matrices with a finite number

of rows, corresponding to the data points, and an unbounded number of columns,

corresponding to the latent variables. Each data point can be associated with a subset

31

of the possible latent variables, which are referred to as the latent features of that data

point. The binary variables in the matrix indicate which latent feature is possessed by

which data point, and there is a potentially infinite array of features. The distribution over

unbounded binary matrices is derived by taking the limit of a distribution over N × K

binary matrices as K →∞.

2.2 Hidden Markov Model (HMM)

2.2.1 Definition and Basic Concepts

HMMs are stochastic models for complex, nonstationary doubly stochastic

processes with an underlying process (Markov chain) and another stochastic process

that produces time sequences of random observations according to the Markov chain.

At each observation time, the Markov chain may be in one of several states, and given

that the chain is in a certain state, there are probabilities of moving to other states.

These probabilities are called transition probabilities. The word “hidden” in HMM comes

from the fact that the states are hidden or not observable. Given an observation vector

at a specific time, there are probabilities that the chain is in each state. The actual state

is described by a probability density function. The probabilities of the observations are

conditioned on the chain being in the associated state. Thus, an HMM is characterized

by three sets of probability density functions: the transition probabilities, the state

probability density functions, the initial probabilities.

The notation we use generally follows Rabiner (1989) and is as follows:

• T, the length of the observation sequence or state sequence, with the time instance

denoted by t = 1, 2, . . . , T.

• N, the total number of states in the model.

• S, the set of states, with S = {S1, S2, . . . , SN}.• Q, a state sequence, with Q = {q1, q2, . . . , qT}.• A, the state transition probability distribution, with A = {aij}.• B, the set of emitting probability densities, with B = {bj(xt)}.

32

• π, the initial state distribution.

• x, the observation sequence, with x = x1, x2, . . . , xT .

An HMM is generally represented as a three-tuple Λ = (A,B, π) .

2.2.2 Applications of the HMM

The HMM model has been applied to many areas, such as speech recognition

(Rabiner, 1989), machine translation (Inoue and Ueda, 2003), and gene prediction

(Stanke and Waack, 2003). Of particular interest are those applications with images as

input (image classification (Ma et al., 2007), handwriting recognition, and mine detection

(Gader and Popescu, 2002; Gader et al.; Zhao et al., 2003)).

2.2.3 Learning HMM Parameters

Expectation Maximization. The expectation maximization (EM) algorithm is the

most widely used algorithm to learn HMM parameters (Bilmes, 1997). The EM algorithm

aims to find maximum-likelihood (ML) estimates for settings where this appears to be

difficult. The concept of the EM algorithm is to map the given data to complete data from

which it is well-known how to get ML estimates. The EM algorithm performs iterations

over the given data. Each iteration has two steps. An expectation (E) step is followed

by a maximization (M) step. In the E step, the EM algorithm calculates the expectation

of unknown (hidden) data given an instance of a probabilistic model. In the M step, the

EM algorithm calculates an ML estimate over the current case of complete-data (known

data and the expectation of hidden data). The EM algorithm is guaranteed to converge

to a local maximum (Prescher, 2004). Disadvantages include it may not find the global

optimal solution and it is sensitive to the initialization.

Discriminative Training. Discriminative training refers to a training algorithm

aimed at minimum classification error (MCE). It has been applied to estimate the HMM

parameters (Ma, 2004; Zhao et al., 2003). It has a loss function as an objective function.

The function is usually a sigmoid function of a misclassification measure. Gradient

33

descent is often used to optimize the objective function. We will use the MCE algorithm

in our research, so it is described in detail in the next section.

2.2.4 Minimum Classification Error (MCE) for HMM

An HMM is generally represented as a three-tuple Λ = (A,B, π). The log probability

of a given state sequence given a model is denoted by g(x|Λ) and is computed as

follows:

g(x|Λ) = log((x,Q|Λ)) =T∑

t=1

[log aqt−1qt + log bqt(xt)

]+ log πq0 , (2–1)

where D is the dimension of the observation x.

In this work, we use continuous HMMs with Gaussian mixture models representing

the emitting probability densities that are therefore given by

bj(xt) =K∑

k=1

cjkbjk(xt), (2–2)

where

bjk(xt) = (2π)−D/2 |Rjk|−1/2 exp

{1

2

D∑

l=1

(xtl − µjkl

σjkl

)2}

,

and

cjk, µjk, Rjk = diag(σ2jk1, . . . , σ

2jkD)

are the mixture proportion, mean, and covariance of the k-th Gaussian component,

respectively, of the Gaussian mixture density representing state j. Note that we use

diagonal covariance matrices.

To estimate the HMM parameters, we use the MCE method with generalized

probabilistic descent (GPD) (Ma, 2004). The goal of MCE training is to be able to

discriminate among samples of different classes correctly rather than to estimate the

distributions of each class accurately. The MCE objective function is a loss function

that depends on a misclassification measure. The misclassification measure for the

two-class problem used here is defined as follows:

d(i)(x) = (−1)i[g(x|Λ(1))− g(x|Λ(2))

]if x ∈ class i, for i = 1, 2, (2–3)

34

where Λ = (Λ1, Λ2). The loss function associated with the misclassification measure is

defined using the following sigmoid function:

l(x|Λ) = l(d(x)) =1

1 + e−γd(x)+θ, (2–4)

where γ is a predefined parameter.

We try to minimize the expected loss to optimize the performance of the classifier.

In standard MCE training, we seek to estimate the mixture proportions, means, and

covariances of the Gaussian mixtures, as well as the transition probabilities that

will minimize the average loss. To develop practical formulas, auxiliary variables are

explicitly and implicitly introduced:

µjkl = µjklσjkl,

σjkl = log σjkl,

cij =ecij

∑h ecih

,

aij =eaij

∑h eaih

.

(2–5)

Note that these relations yield the following differential relations:

∂µjkl

∂µjkl

= σjkl,

∂cjs

∂cjk

=

cjk(1− cjk) if (s = k)

−cjkcjs if (s 6= k),

(2–6)

which are used in the update formulas below. Similar differential relations hold for the

transition probabilities. Let px ≡ γl(x|Λ) (1− l(x|Λ)). Using these auxiliary variables and

relations, application of gradient descent leads to the following formulas to update HMM

model parameters:

∂l(x|Λ)

∂µjkl

= −px

[T∑

t=1

δ(qt = j)1

bj(xt)

∂bj(xt)

∂µjkl

], (2–7)

35

where∂bj(xt)

∂µjkl

= cjkbjk(xt)

(xtl − µjkl

σjkl

);

∂l(x|Λ)

∂σjkl

= −px

[T∑

t=1

δ(qt = j)1

bj(xt)

∂bj(xt)

∂σjkl

], (2–8)

where∂bj(xt)

∂σjkl

= cjkbjk(xt)

[(xtl − σjkl

σjkl

)2

− 1

];

∂l(x|Λ)

∂cjk

= −px

[T∑

t=1

δ(qt = j)1

bj(xt)

∂bj(xt)

∂cjk

], (2–9)

where∂bj(xt)

∂cjk

=K∑

s=1

bjs(xt)∂cjk

∂cjk

;

∂l(x|Λ)

∂aij

= −px

[T∑

t=1

δ (qt−1 = i, qt = j)−T∑

t=1

δ (qt−1 = i) aij

]. (2–10)

2.3 Gibbs Sampling

Gibbs sampling (Casella and George, 1992; Sheng et al., 2005) is a Markov

chain Monte Carlo (MCMC) method (Neal, 1993) for joint distribution estimation when

the full conditional distributions of all the concerned random variable are available.

Gibbs sampling has become a common alternative to the EM algorithm for solving an

incomplete data problem in a Bayesian context. Gibbs sampling provides samples to

estimate the joint distribution of the random hidden variables and parameters. It provides

for the estimation of random variables from these samples. Therefore, Gibbs sampling

may find a more optimal solution than EM, which is prone to finding local solutions.

Given joint density p(x1, x2, . . . , xK) for a set of random variables x1, x2, . . . , xK , it

is usually difficult to estimate or sample the marginal distributions directly, because the

marginal distribution p(xi) for xi, where i = 1 . . . K, is computed using

p(xi) =

∫. . .

∫p(x1, . . . , xK)dx1 . . . dxi−1dxi+1 . . . dxK . (2–11)

36

Instead, we will sample the variable xi from the conditional distribution p(xi|xj; i 6=j), which is typically easy to sample from. Starting from initial values x

(0)1 , . . . , x

(0)K , the

Gibbs sampler draws samples of random variables as follows:

x(t+1)1 ∼ p(x1|x2 = x

(t)2 , . . . , xK = x

(t)K ), (2–12)

x(t+1)2 ∼ p(x2|x1 = x

(t+1)1 , x3 = x

(t)3 , . . . , xK = x

(t)K ), (2–13)

...

x(t+1)i ∼ p(xi|x1 = x

(t+1)1 , . . . , xi−1 = x

(t+1)i−1 , x(i+1) = x

(t)i+1, . . . , xK = x

(t)K ), (2–14)

...

x(t+1)K ∼ p(xK |x(t+1)

1 , . . . , xK−1 = x(t+1)K−1 ), (2–15)

where t denotes the iteration index. Here X ∼ P (X|Y ) denotes the process of drawing a

sample Xi from a population defined by the conditional distribution P (X|Y )

It has been shown (Neal, 1993) that as t → ∞, the sample distribution of

x1, x2, . . . , xK converges to p(x1, x2, . . . , xK). Equivalently, as t → ∞, the distribution

of x(t)i converges to p(xi) for i = 1 . . . K. Thus, the Gibbs sampler treats the samples

x(t)1 , . . . , x

(t)K for t ≥ M as a sample from p(x1, x2, . . . , xK) by selecting some large value

for M . The initial period when samples are first drawn is referred to as the “burn-in

procedure.” Now we can calculate the expectation of a function f(x) over the distribution

p(xi). This is done by the Monte Carlo integration

Ep(xi)[f(xi)] =

∫f(xi)p(xi)dx ≈ 1

N

N+M∑t=M

f(x(t)i ), (2–16)

where t is the iteration index in the sampling process, and N is the total number of

samples collected.

37

−8 −6 −4 −2 0 2 4 6 8 10−3

−2

−1

0

1

2

Figure 2-1. The dashed line is the PCA projection, but the vertical dotted line representsthe best projection to separate two clusters

−10 −5 0 5 10 15 20−1

0

1

β −>0

−8 −6 −4 −2 0 2 4 6 8 10−4

−2

0

2

β −>1

Figure 2-2. The top plot has β close to zero to maximize the variation of projections(horizontal axis) of all observations, and the bottom plot has β close to oneto minimize the variation of the projections (vertical axis) of the observationsin same cluster.

38

CHAPTER 3CONCEPTUAL APPROACH

3.1 Overview

The goal of this research is to devise and analyze HMM-based algorithms that

can simultaneously learn feature extraction and classification parameters. All the work

described here is performed for continuous HMMs.

Three new approaches are proposed: (1) simultaneous feature learning and

HMM training using an MCE algorithm, (2) HMM training using Gibbs sampling,

and (3) both loosely and tightly coupled feature learning and HMM training using

Gibbs sampling. Note that the second approach focuses only on estimating the HMM

parameters and not the feature parameters. Estimating the HMM parameters alone

allows for a focused analysis of the proposed novel technique. We refer to the first

two methods as McFeaLHMM and SampHMM, respectively (here FeaL is an acronym

replacing Feature Learning). The third method consists of two sub-methods, which

we refer to as TSampFeaLHMM and LSampFealHMM. Note that the feature models

used for McFeaLHMM are ordered weighted average (OWA)-based generalizations of

morphological operators, whereas those used by TSampFeaLHMM and LSampFeaLHMM

are convolutional models.

Results indicate that, while all feature learning models can achieve performance

similar to or better than that of a human, McFeaLHMM is very sensitive to initialization

and learning rates. Fortunately, TSampFeaLHMM and LSampFeaLHMM are much more

stable and can produce better solutions than McFeaLHMM in landmine detection

experiments. It may be possible to alleviate these problems for McFeaLHMM by

sampling from a posterior distribution based on the MCE loss function, but investigation

of that concept is left to future work. The feature models and training algorithms are now

described below in detail.

39

3.2 Feature Representation

Two feature models are described: the OWA-based generalized morphological

feature model and the convolutional feature model. The first model was inspired by the

current feature extraction in state of the art landmine detection. The second model is

more suitable for sampling, and results in faster operational models.

3.2.1 Ordered Weighted Average (OWA)-based Generalized MorphologicalFeature Model

OWA operators are used to parameterize both the morphological operations

and other aggregation operations. They form a family of operators that can represent

maximum, minimum, median , α-trimmed means, and many other operators. Let w =

(w1, w2, . . . , wn) be a vector of real numbers constrained such that

n∑i=1

wi = 1 and 0 ≤ wi ≤ 1 for i = 1, 2, . . . , n. (3–1)

Any weights satisfying these properties will be referred to as a set of OWA weights.

Let f = {f1, f2, . . . , fn} be a multi-set of real numbers. The i-th order statistic of f is

f(i), where the parenthesized subscripts denote a permutation of the indices such that

f(1) ≤ f(2) ≤ · · · ≤ f(n). The OWA operator on f with weight vector w is defined as

OWAw(f) =n∑

i=1

wif(i). (3–2)

We also define OWAw as an OWA operator of size n where n is the number of

weights. OWA operators are used here to define general feature extractors. In our

context, a feature is defined to be a tuple consisting of the following:

1. A feature window size, N ×K.

2. Two sets of OWA weights h and m of size NK and associated OWA operators that

act on two-dimensional arrays B as follow:

OWAh(B) =NK∑i=1

hibσ(i), (3–3)

40

where σ : {1, . . . , NK} → {(nk)|n = 1, . . . , N ; k = 1, . . . , K} is a bijection (a 1-1 and

onto mapping) satisfying bσ(1) ≤ bσ(2) ≤ · · · ≤ bσ(NK).

3. Two N × K arrays called the hit mask and miss mask and denoted Gh and

Gm, respectively. The masks represent the geometric shapes of the features.

Consistent with standard practice in mathematical morphology, the hit mask is

a pattern that matches a foreground shape and the miss mask is a pattern that

matches a background shape of the features. The values of the arrays are either

binary, in the set {0, 1}, or non-binary, in the interval [0, 1]. We have h and m to

represent hit and miss masks, respectively.

The OWA weights and mask values are the feature extraction parameters that

are learned in the training process. Features are computed over subwindows from the

images using neighborhood OWA operators.

A training set T consists of images from each class. The first step in feature

learning is feature initialization. Feature initialization proceeds by first collecting

subwindows from the training images. Various procedures can be used to select or

compute a small set of likely initial features from these subwindows. We will investigate

several of these procedures. The feature parameters are then updated together with the

standard HMM parameters. Several training algorithms, including MCE-based and Gibbs

sampling will be considered. Neighborhood updating methods that attempt to encourage

connected features will also be studied. Given an image, the feature extraction process

proceeds as follows. Let A be an image with pixel values in the interval [0,1] and size

larger than the window size. The feature extraction consists of two steps: first is applying

a masked, neighborhood OWA hit-miss operator to A, and second is aggregating the

result of the first step over the rows of A. More precisely, let Atk denote the N × K

subimage of A with the upper left hand corner located at row t and column k. After

we apply the mask and hit-miss operator to A, an image D with the same size as A is

41

obtained. The value at row t and column k of D is defined by

D(t, k) ≡ min [OWAh (Atk ◦Gh) , OWAm ((1− Aik) ◦Gm)] , (3–4)

where symbol “◦” denotes pointwise multiplication. Note that if the image and the masks

are binary and if h and m correspond to minimum operations, then this operator is

exactly the ordinary hit-miss operator from mathematical morphology. The final step

in feature extraction is aggregating the outputs of the masked, neighborhood hit-miss

operator by computing the maximum of each column of D:

xk = maxt

(D(t, k)). (3–5)

The result is a sequence of feature values indexed by k. A pictorial description (Zhang

et al., 2007) is shown in Figure 3-1.

3.2.2 Convolutional Feature Models

The following two models for feature extraction presented here are modified

slightly to fit the Gibbs approach and produce more computationally efficient models.

Specifically, feature extraction is modeled as convolution with random perturbations or

error terms.

3.2.2.1 Feature model for loosely coupled sampling feature learning HMM(LSampFealHMM)

As shown in Figure 3-4, we are given p N × K images. We transform each image

to an NK × 1 vector Ai, where i = 1, . . . , H. For each image Ai an N × K binary hit

mask or an N × K ternary with values in {-1, 0, 1} hit-miss mask Mi is applied using

convolution as follows. Let L = N ×K. We define the vector Bi as

Bi = Ai ◦Mi + ζ, i = 1, . . . , H, ζ ∼ N(0, σ2

ζIL×1

), (3–6)

where the symbol ‘◦’ denotes pointwise multiplication, and ζ represents a zero mean

Gaussian perturbation with covariance matrix Σζ = σ2ζIL×1. Note that we consider each

42

Mi to be a single realization of a random mask M . The random mask M can be thought

of as an N ×K array of binary or ternary variables. Then the k-th element of Bi can be

denoted as

Bik = AikMik + ζk, k = 1, . . . , L, ζk ∼ N(0, σ2ζ ). (3–7)

Now we define Di as the sum over B plus a zero mean additive Gaussian

perturbation with variance σ2η:

Di =∑

k

Bik + η, η ∼ N(0, σ2η). (3–8)

We define one feature xt as the aggregation of Di with the additive zero mean

Gaussian noise ε,

xt =∑

i

Di + ε, ε ∼ N(0, σ2ε). (3–9)

Now we assign the label yt to the feature xt, given the threshold ξ, by

yt =

1, if xt > ξ

0, if xt ≤ ξ.(3–10)

3.2.2.2 Feature model for tightly coupled sampling feature learning HMM(TSampFealHMM)

The feature model for TSampFealHMM will learn the feature and HMM parameters

in a single Bayesian probability framework. The previous feature model produced

observation sequences that were tightly coupled to columns of the input image. That

is, given input image A and associated observation sequence x1, . . . , xT , each xi

corresponds to a set of columns of A. By contrast, the feature model used in the

TSampFealHMM model produces observation sequences that can be associated with

subimages of A that can vary both horizontally and vertically within A.

As shown in Figure 3-6, when an N1 ×K1 image A is given, the image is split into T

N2 ×K2 subimages At, t = 1, . . . , T . These subimages are called zones. For each zone

At, an N ×K ternary {-1, 0, 1} hit-miss mask Mt is applied using convolution, similar to

43

that in LSampFealHMM. The mask is only applied at positions for which the mask fits in

the image. Assume that convolution over a zone At produces p values. Vectors Bti and

Di are defined as in LSampFealHMM. The k-th element of Bti can be denoted as

Btik = AtikMtk + ζk, i = 1, . . . , p, ζi ∼ N(0, σ2ζ ). (3–11)

The vector Dti can be represented as

Dti =∑

k

Btik + η, η ∼ N(0, σ2η). (3–12)

We define one feature xt as the aggregation of Dti,

xt =∑

i

Dti. (3–13)

To reduce computation and the number of variables to be sampled, we change the

order of the equations above, as shown in Figure 3-7. We first convolve zone At with a

mask M , where Mk = 1 for each k. The result is an N ×K matrix or size NK × 1 vector

Zt that is

Ztk =∑

i

Atik + η, i = 1, . . . , p, η ∼ N(0, σ2η). (3–14)

We then define an array Ct by

Ctk = ZtkMtk + ζk, k = 1, . . . , NK, ζk ∼ N(0, σ2ζ ). (3–15)

We define one feature xt as the aggregation of Ctk,

xt =∑

k

Ctk. (3–16)

Now we assume the feature xt is from a Gaussian distribution N(µ, σ) and apply the

HMM model(A, π, µ, σ) to the sequence x1, . . . , xT . Note that the values of Atk should be

scaled to the interval [-1, 1], with the background values of the image taking the value

-1. This scaling provides the advantage that if the hit-miss mask has a hit on the area of

interest, the feature value will always be close to the maximum value.

44

3.3 Feature Initialization

As will be discussed in detail in the next section, the McFeaLHMM algorithm

is based on gradient descent, similar to past MCE algorithms for HMM (Ma, 2004).

Consequently, this algorithm is very sensitive to initialization. In fact, in our experiments,

random initialization did not lead to useful solutions. Therefore, a data-based algorithm

for initializing masks was devised for McFeaLHMM. The sampling-based algorithms

were not sensitive to initialization. Therefore random initialization was used for

TSampFeaLHMM and LSampFealHMM.

The algorithm used to initialize McFeaLHMM is based on clustering, so is referred

to as McClustInit. The McFeaLHMM algorithm uses one HMM to model each class.

For each model, a training set A1, A2, . . . , AH of Nbig × Kbig images of patterns from

the associated class is given. The OWA-based feature operators are intended to detect

sub-patterns of those patterns contained in N×K subimages. The McClustInit algorithm

therefore clusters subimages of size N × K extracted from the training data set. In

the first step, the McClustInit algorithm employs the Otsu thresholding algorithm (Otsu,

1979) to semi-threshold the training patterns. Next, all N ×K subimages with sufficient

energy are extracted from the semi-thresholded training images. We let S denote the

set of these subimages. The goal is to find shift-invariant prototypes of the patterns in

S. For example, all horizontal lines should have the same representation. Therefore,

the algorithm calculates the magnitude of the Fourier transforms of all the patterns

in S, producing set F . The elements of F were clustered using the Fuzzy C-Means

(FCM) algorithm with a pre-defined value of C, resulting in a set P of frequency domain

prototypes. These prototypes were then used to compute spatial domain prototypes,

which were used as the initial feature masks.

45

3.4 Learning Methods

3.4.1 MCE-HMM Model for Feature Learning

To derive the feature learning algorithm, we first represent the feature extraction

algorithms at the pixel level. We derive the learning algorithm with the assumption

that there is only one feature, that is, that the dimensionality of the feature vectors is

one. Multi-dimensional features can be learned by applying the same formulas to each

dimension independent of all other dimensions. The masking operation is given by

Bhtk(i, j) = At,kGh(i, j),

Bmtk (i, j) = At,kGm(i, j),

(3–17)

where (i, j) is the position of row i and column j of 2-D matrix Gh, Gm, or Btk. Following

masking, the OWA hit-miss operator is applied as follows:

D(t, k) = min

(∑s=1

hsBht,k,σ(s),

∑s=1

msBmt,k,σ(s)

). (3–18)

Here Bt,k,σ(s) denotes the sorted value of Btk, where σ(s) is the one-dimensional sorting

index of Btk. The feature values are then calculated according to equation (3–5).

Here the feature learning algorithm is derived using gradient descent on the MCE.

Objective loss function l(x|Λ) is defined in section 2.2.4. Hence, we need to compute∂l(x|Λ)

∂h, ∂l(x|Λ)

∂m, ∂l(x|Λ)

∂Gh, and ∂l(x|Λ)

∂Gm. Also note that the expression in equation (3–17) is a min

function and that

∂ min(x1, . . . , xn)

∂xi

=

1 if i = arg min(x1, . . . , xn)

0 otherwise.(3–19)

Thus, it suffices to derive ∂l(x|Λ)∂h

and ∂l(x|Λ)∂Gh

only, where we assume without loss of

generality that the hit masked OWA operation is the min of two items in equation (3–17).

To maintain the requirements placed on OWA weights and masks, we introduce auxiliary

46

variables up and Gh(i, j), such that:

hp =u2

p∑k u2

k

,

Gh(i, j) =1

1 + e−γhGh(i,j),

(3–20)

where γh is a user-defined prior parameter to decide the slope of the sigmoid function.

Note that these relations yield the following differential relations:

∂hp

∂uk

=

2hk(1−hk)uk

if (p = k)

2hphk

ukif (p 6= k)

,

∂Gh(i, j)

∂Gh(i, j)= −γhGh(i, j) (1−Gh(i, j)) ,

(3–21)

which are used in the update formulas. We apply gradient descent to the auxiliary

variables and then update the variables used in the calculations according to equation

(3–21). Since we know that ∂l(x|Λ)∂up

= ∂l(x|Λ)∂x

∂x∂up

and ∂l(x|Λ)

∂Gh(i,j)= ∂l(x|Λ)

∂x∂x

∂Gh(i,j), the derivatives

∂x∂up

and ∂x∂Gh(i,j)

are derived as follows:

∂x

∂up

=∂dt,k

∂up

=∑

s

Bt,k,s∂hs

∂up

, (3–22)

where

tmaxk = arg max

tk(D(· , k)),

and∂x

∂Gh(i, j)=

∂Dtmaxk ,k

∂Gh(i, j)=

∑s

hs

∂Btmaxk ,k,s

∂Gh(i, j)= h(p)At,k,p =

∂Gh(i, j)

∂Gh(i, j), (3–23)

where

p = (i− 1)K + j.

Application of gradient descent leads to the following formulas to update the feature

parameters. Let c ∈ {1, 2} denote the class label of the input observation sequence x.

47

Then

∂l(x|Λ)

∂up

= (−1)cpx

[T∑

t=1

(δ(q

(1)t = v)

1

b(1)v (xt)

∂b(1)v (xt)

∂xt

− δ(q(2)t = v)

1

b(1)v (xt)

∂b(2)v (xt)

∂xt

)∂xt

∂up

],

(3–24)

and

∂l(x|Λ)

∂Gh(i, j)= (−1)cpx

[T∑

t=1

(δ(q

(1)t = v)

1

b(1)v (xt)

∂b(1)v (xt)

∂xt

− δ(q(2)t = v)

1

b(1)v (xt)

∂b(2)v (xt)

∂xt

)∂xt

∂Gh(i, j)

],

(3–25)

where∂bj(xt)

∂xt

=∑

k

cjkbjk(xt)

(xt − µjk

σjk

),

and

δ(qt = v) =

1 if qt = v

0 otherwise.

The training algorithm is summarized in Figure 3-2.

3.4.2 Gibbs Sampling Method for Continuous HMM with Multivariate GaussianMixtures

Gibbs sampling methods have the advantage of generally finding better optima

than such traditional methods as expectation maximization algorithm (Dunmur and

Titterington, 1997; Ryden and Titterington, 1998). Although there are some learning

methods proposed for HMM based on MCMC sampling, they are either for the discrete

HMM problem (standard HMM (Bae, 2005) or infinite HMM (Beal et al., 2002)), plus

some uncommon HMMs, such as trajectory HMMs (Zen et al., 2006), non parametric

HMMs (Thrun et al., 1999), or nonstationary HMMs (Zhu et al., 2007). Here, a Gibbs

approach is proposed for training continuous HMMs with multivariate Gaussian mixtures

representing the states.

48

The joint probability of a sequence of observations and a hidden state sequence

is denoted by P (X, Q). The standard dependence assumptions allow us to derive an

expression for the joint likelihood as follows:

P (X,Q) = P (X|Q)P (Q) = P (Q)∏

t

P (xt|Q) = P (Q)∏

t

P (xt|qt)

= P (qt)

[T∏

t=2

P (qt|qt−1)

] [∏t

P (xt|qt)

]= P (qt)P (xt|qt)

T∏t=2

(P (qt|qt−1)P (xt|qt)) .

(3–26)

The probability of a state sequence is computed as follows:

P (Q) = P (q1)T∏

t=2

P (qt|qt−1) =N∏

r=1

(N∏

s=1

P (qt = s|qt−1 = r)nrs

), (3–27)

where nrs = the number of transition state pairs such that qt−1 = r and qt = s for t =

2, . . . , T . Note that the transition probabilities are stationary, the value P (qt|qt−1) for

t = 2, . . . , T and fixed s, r are equal.

Since the posterior distributions of the parameters Λ, as defined in section 2.2.1,

are not available in explicit form, we use Gibbs sampling to simulate the parameters

from the posterior distributions after defining likelihood and prior probability models

for the parameters. First, we assume the likelihood model for state transitions is the

multinomial probability distribution:

P (qr1 = nr1, . . . , qrN = nrN |arN , . . . , arN) ∝N∏

s=1

anrsrs . (3–28)

The conjugate prior of the multinomial distribution is used as the prior probability

distribution of ars, so it is a Dirichlet probability distribution

P (ar1, . . . , arN) = Dirichlet(~αr0), r = 1, . . . , N, (3–29)

where ~αr0 is the vector of prior parameters. The state probability distribution is assumed

to be a Gaussian mixture. We let θr = (cr, µr, Σr) denote the parameters of the state

49

density probability distributions of state r:

P (xt|θr) =∑

k

crkN(xt; µrk, Σrk). (3–30)

We assume the probability of the components is governed by a multinomial probability

distribution. Let nrk denote the number of occurrences of component k in state r.

P (nr1, . . . , nrK |cr1, . . . , crK) ∝∏

rk

cnrkrk . (3–31)

Similar to before, we use the conjugate prior of the multinomial, the Dirichlet probability

distribution, as the prior probability distribution of crk:

P (cr1, . . . , crK) = Dirichlet(~α0), (3–32)

where ~α0 is the hyperprior parameter.

Now we can compute the posterior conditional probabilities. The posterior

conditional probability of state transition ars, s = 1, . . . , N is given by

ar1, . . . , arN ∼ P (ar1, . . . , arN |Q)

∝ P (Q|ar1, . . . , arN)P (ar1, . . . , arN) ∝ Dirichlet(~αrp),

(3–33)

where

~αrp = ~αr0 + [nr1, . . . , nrN ]T .

We represent the state sequence Q excluding state qt as q1, . . . , qt−1, qt+1, . . . , qT or

by q−t. The posterior conditional probability for the state qt|q−t is

P (qt = r|q−t, θ, X) ∝ P (qt = r|q−t)P (xt|qt = r, θr)P (qt+1|qt = r)

= aqt−1rarqt+1P (xt|qt = r, θr).

(3–34)

50

Now we compute the posterior conditional probability of the state parameters. The

posterior conditional probability for the component probability c is given by

cr1, . . . , crK ∼ P (cr1, . . . , crK |nr1, . . . , nrK)

∝ P (nr1, . . . , nrK |cr1, . . . , crK)P (cr1, . . . , crK) ∝ Dirichlet(~αp),

(3–35)

where

~αp = ~α0 + [nr1, . . . , nrK ]T .

The posterior conditional probability that xt was produced by component k of state

qt when xt was produced from state qt is

P (k|qt = r, θ, X) = P (k|θ, xt) ∝ crkP (xt|θrk). (3–36)

Now we compute the posterior conditional probability of the Gaussian model

parameters modeling state components. A Gaussian probability distribution is given by

p(X|µ, Σ) =1

(2π)dT/2|Σ|T/2exp

(−1

2

∑t

(xt − µ)T Σ−1(xt − µ)

), (3–37)

where d is the dimension of xt. We assume the prior of the mean µ is a Gaussian

p(µ|µ0, Σ0) = N(µ; µ0, Σ0), where µ0, Σ0 are hyperprior parameters, and the prior of the

covariance Σ is an invWishart probability distribution p(Σ|αs, Ts) = invWishart(Σ|ψs, Ts),

where ψs, Ts are hyperprior parameters. Hence the posterior conditional probability

distribution of the mean is

µ ∼ p(µ|X) ∝ P (X|µ)p(µ) ∝ N(µ; µp, Σp), (3–38)

where

µp =(Σ−1

0 + TΣ−1)−1

(Σ−1

0 µ0 + Σ−1∑

t

xt

),

Σp =(Σ−1

0 + TΣ−1)−1

.

51

The posterior conditional probability distribution of the covariance is

Σ ∼ p(Σ|X) ∝ P (X|Σ)p(Σ) ∝ invWishart(Σ|ψp, Tp), (3–39)

where

ψp = ψs + T,

Tp = Ts +∑

t

(xt − µ)(xt − µ)T .


3.4.3 Loosely Coupled Gibbs Sampling Model for Feature Learning

The LSampFealHMM algorithm is a simplified, nonstandard HMM algorithm that

associates each state with a feature. The “probability” that the system is in state q is

proportional to a monotonic function of the feature value.

The training algorithm for LSampFealHMM consists of alternating optimization

between a Gibbs sampler and a modified Viterbi learning algorithm. That is, it is of the

form

• Initialize states

• Do until stopping criterion reached

Run a Gibbs sampler to estimate feature masks

Refine states using modified Viterbi learning

• End Do

In the next section, we first describe the Gibbs Sampler and then the initialization

and modified Viterbi learning.

3.4.3.1 Gibbs sampler for LSampFeaLHMM

To use a Gibbs sampler we need to assume a prior distribution of the probability pk

of the k-th element of mask M . Given the hyper parameters α and β, the prior for the

52

probability pk is a beta distribution,

pk ∼ β(αk, βk). (3–40)

The probability of the k-th element of the binary hit mask M is a binomial distribution,

P (M·k|pk) ∝ pnkk (1− pk)

Nk−nk , (3–41)

where

nk =∑

i

(Mik = 1).

If M is the ternary hit-miss mask, the probability of mask M is a multinomial

distribution,

P (M·k|pk1 , pk0 , pk2) ∝ pnk1k1

pnk0k0

pnk2k2

, (3–42)

where

nk1 =∑

i

(Mik = 1),

nk0 =∑

i

(Mik = 0),

nk2 =∑

i

(Mik = −1),

and the prior for the probabilities of pk1 , pk0, pk2 is a Dirichlet distribution,

(pk1 , pk0 , pk2) ∼ Dirichlet(αk, βk, γk). (3–43)

The posterior distribution is not available in explicit form, so we use the Gibbs

sampling approach to sample all the variables regard for estimation.

53

The parameter M and the variables (B,D, x) must be estimated. By Gibbs sampler

we will sample these variables from the complete conditional probability distribution:

pk ∼ p(pk|M·k, αk, βk),

M·k ∼ P (M·k|pk1 , pk0 , pk2 , B, A),

Bi ∼ p(Bi|Di, Ai,Mi), (3–44)

D ∼ p(D|xt, Bi),

xt ∼ p(xt|yt, D).

Computation proceeds as follows:

1. Sample pk, k = 1, . . . , L given (M,α, β). The sample is drawn from the beta

distribution if M is the binary hit mask.

p(pk|M·k, αk, βk) ∝ P (M·k, αk, βk|pk)p(pk|αk, βk)

∝ β(αk + nk, βk + Nk − nk).

(3–45)

The sample is drawn from the Dirichlet distribution if M is the ternary hit-miss mask.

p(pk1 , pk0 , pk2|M·k, αk, βk, γk) ∝ Dirichlet(αk + nk1 , βk + nk0 , γk + nk2). (3–46)

2. Sample M given (p,B, A). Since every component of M is assumed to be

independent, it is easy to sample component wise.

If M is a binary hit mask, we sample it from a binomial distribution. We compute the

P (M·k = 1|pk, B, A) and P (M·k = 0|pk, B, A) first,

P (M·k = 1|pk, B, A) ∝ P (M·k = 1|pk)P (B|M·k = 1, A) ∝ pk exp

(−(Bik − Aik)

2

2σ2ζ

),

(3–47)

P (M·k = 0|pk, B, A) ∝ P (M·k = 0|pk)P (B|M·k = 0, A) ∝ (1− pk) exp

(−(Bik)

2

2σ2ζ

). (3–48)

54

Then, after these two values are normalized, we sample M from a binomial distribution:

M·k|pk, B,A ∼ binomial(P (M·k = 1|pk, B, A), P (Mik = 0|pk, B,A)). (3–49)

If M is a ternary hit-miss mask, we sample it from a multinomial distribution. First

we compute P (M·k = 1|pk1 , B,A), P (M·k = 0|pk0 , B,A) and P (M·k = −1|pk2 , B, A), where

P (M·k = 1|pk1 , B, A) ∝ P (M·k = 1|pk1)p(B|M·k = 1, A) ∝ pk1 exp

(−(Bik − Aik)

2

2σ2ζ

),

(3–50)

P (M·k = 0|pk0 , B, A) ∝ P (M·k = 0|pk0)P (B|M·k = 0, A) ∝ pk0 exp

(−(Bik)

2

2σ2ζ

), (3–51)

P (M·k = −1|pk2 , B, A) ∝ P (M·k = −1|pk2)P (B|M·k = −1, A) ∝ pk2 exp

(−(Bik + Aik)

2

2σ2ζ

).

(3–52)

After these three values are normalized, we can sample M from a multinomial

distribution:

M·k|pk1 , pk0 , pk2 , B, A ∼

multinomial(P (M·k = 1|pk1 , B,A), P (M·k = 0|pk0 , B,A), P (M·k = −1|pk2 , B, A)). (3–53)

3. Sample the variable B given (A,M). We know that

p(Bi|Ai,Mi) ∝ exp

(−

L∑

k=1

(Bik − AikMik)2

2σ2ζ

), (3–54)

and

p(Di|Bi) ∝ exp

−

(Di −

∑Lk=1 Bik

)2

2σ2η

, (3–55)

55

so we have

p(Bi|Di, Ai,Mi) ∝ p(Di|Bi)p(Bi|Ai,Mi)

∝ exp

−

(Di −

∑Lk=1 Bik

)2

2σ2η

exp

(−

L∑

k=1

(Bik − AikMik)2

2σ2ζ

).

(3–56)

Rather than sampling B as a matrix, it is better to sample component-wise from the

Gaussian distribution:

Bik|B−ik, Aik,Mik, Di ∼ N

τζAikMik + τη

(Di −

∑q Biq + Bik

)

τζ + τη

, (τζ + τη)−1/2

, (3–57)

where τ% = 1σ2

%denotes the precision of the Gaussian distributions for % = ζ and η.

4. Sample D given (x,B). We know that,

p(xt|D) ∝ exp

(−(xt −

∑i Di)

T (xt −∑

i Di)

2σ2ε

), (3–58)

so we have

p(D|xt, Bi) ∝ p(D|Bi)p(xt|D)

∝ exp

−

∑i

(Di −

∑Lk=1 Bik

)2

2σ2η

exp

(−(xt −

∑i Di)

T (xt −∑

i Di)

2σ2ε

).

(3–59)

Similarly we would like to sample D component-wise from the Gaussian distribution:

p(Di|D−i, xt, Bi) ∝ N

τη

∑Lk=1 Bik + τε

(xt −

∑q Dq + Di

)

τη + τε

, (τη + τε)−1/2

. (3–60)

5. Sample x given (y, D) from the truncated Gaussian distribution.

p(xt|yt = 1, D) ∝ N(∑

i

Di, σ2ε) truncated at the right by the threshold ξ, (3–61)

p(xt|yt = 0, D) ∝ N(∑

i

Di, σ2ε) truncated at the left by the threshold ξ. (3–62)

56

After the burning-in period, we collect the Gibbs samples at the s-th iteration

as (p[s],M [s], B[s], D[s], x[s], s = 1, . . . ). Then we can use these samples to make the

predication and posterior inference. Note that when we do the testing, we use p, the

probability of masks, as the mask feature. In this way, p can be interpreted as the

expectation of a mask, and the values of masks used in predication/test will be gray-level

values instead of binary or ternary values.

In the LSampFealHMM, we do not use state probabilities. Instead, we use the

features themselves directly, either

bq(~x) = xq (3–63)

for binary masks or

bq(~x) = exp(xq) (3–64)

for ternary masks. Then, given the test observation sequence X = ~x1, ~x2, . . . , ~xT , the

output of LSampFealHMM is given by the Viterbi algorithm, used for finding the best

path:

output ≡T∑

t=1

ln(bqt(xqt)) (3–65)


3.4.3.2 Initialization and modified Viterbi learning

Initialization requires specifying values for initial probabilities ~π, transition matrix A,

and state emission probabilities B. The initial value of ~π is taken to be (1, 0, . . . , 0). The

initial transition matrix is taken to be

a11 a1 a1 . . . a1

0 a22 a2 . . . a2

. . .

0 0 0 . . . aQQ

,

where aii > ai and, of course, aii +∑

i ai = 1. Thus, LSampFeaLHMM produces a

left-to-right model. Note that the Gibbs sampler is not very sensitive to initialization,

57

but a transition matrix of this form is a sensible initialization. Finally, since the states

are associated with the features and the features are determined by the masks as

discussed in section 3.2.2.1, initializing B is performed by randomly initializing the

masks Mi, i = 1, . . . , Q using uniform probabilities of 0 and 1 at each element of each

Mi.

In addition to the model parameters, the observations need to be associated with

each state to provide a set of training samples for each mask. The method employed

is to use the first T/Q of all the observations for states 1, the second T/Q samples

for states 2, and so forth. Of course, an implementation detail arises if T/Q is not an

integer, but this is not significant for the Gibbs sampler.

Modified Viterbi learning is used to update the samples used to learn the masks

for each state. Using the parameters and observation sequences, an optimal state

sequence is found for each training sample using the Viterbi algorithm. Hence, for each

training sample, we have paired sequences

xh1, . . . , xhT ,

qh1, . . . , qhT

=

observation sequence for training image Ah,

optimal state sequence for training image Ah

. (3–66)

Note that the second segment is the set of state labels associated with an optimal state

sequence, but we may refer to it as an optimal state sequence. We write the entire set of

pairs of observations obtained in this manner as

ρ = {(xhj, qhj)|h = 1, . . . , H; j = 1, . . . , T}. (3–67)

For each state index r ∈ {1, 2, ..., Q}, we define the set of observations associated

with r as

χr ≡ {xhj|(xhj, r) ∈ ρ}. (3–68)

The transition matrix is also updated using the optimal state sequences. Let nij

denote the number of occurrences of the consecutive subsequence i, j in the set of all

58

optimal state sequences. Let nd = (T − 1)H denote the number of two consecutive

element subsequences obtained from the training data. Next, the transition matrix is

updated using anewij = aold

ij + ηnij

nd, where η is a user-defined learning rate.

3.4.4 Tightly Coupled Gibbs Sampling Model for Feature Learning

Similar to LSampFealHMM, we need to define priors over feature learning variables.

We also need priors for the HMM model variables (state transition probabilities, state

Gaussian means, and state Gaussian variances).

Given the hyper parameters αk, βk, γk, the prior for the probabilities pk1 , pk0, pk2 is a

Dirichlet distribution,

(pk1 , pk0 , pk2) ∼ Dirichlet(αk, βk, γk). (3–69)

The probability of the k-th element of the hit-miss mask M is a multinomial distribution,

P (M·k|pk1 , pk0 , pk2) ∝ pnk1k1

pnk0k0

pnk2k2

, (3–70)

where

nk1 =∑

i

(Mik = 1),

nk0 =∑

i

(Mik = 0),

nk2 =∑

i

(Mik = −1).

A Gibbs sampling approach is used to sample all the variables required. Feature

learning parameter M , variables (Z, C, x), HMM model parameters (λ = µ, σ,A, π), and

state sequences Q must be sampled. Let χr ≡ {xt|xt is in the state r, t = 1, . . . , T}similar to equation 3–68. Using the Gibbs sampler, we need to sample these variables

59

from the complete conditional probability distribution:

Zt ∼ p(Zt|At, Ct,Mt),

(pk1 , pk0 , pk2) ∼ p(pk1 , pk0 , pk2|M·k, αk, βk, γk),

Mtk ∼ P (Mtk|pk1 , pk0 , pk2 , C, Z), (3–71)

Ct ∼ p(Ct|µqt , σqt , Zt,Mt),

µr ∼ p(µr|χr),

σr ∼ p(σr|χr),

transition matrix ∼ p(|x,Q).

The computation process is as follows:

1. Sample the variable Z given (C,A, M). We know that

p(Zt|At) ∝ exp

(−(Zt −

∑i Ati)

T (Zt −∑

i Ati)

2σ2η

)(3–72)

and

p(Ct|Zt,Mt) ∝ exp

(−

L∑

k=1

(Ctk − ZtkMtk)2

2σ2ζ

), (3–73)

So we have

p(Zt|At, Ct,Mt) ∝ p(Zt|At)p(Ct|Zt,Mt)

∝ exp

(−(Zt −

∑i Ati)

T (Zt −∑

i Ati)

2σ2η

)exp

(−

L∑

k=1

(Ctk − ZtkMtk)2

2σ2ζ

).

(3–74)

We would like to sample Z component-wise from the Gaussian distribution:

p(Ztk|Z−tk, Atk, Ctk,Mtk) ∝ N

(τη

∑Li=1 Ati + τζ (CtkMtk)

τη + τζM2tk

, (τη + τζM2tk)

−1/2

). (3–75)

60

2. Sample pk, k = 1, . . . , L given (M,αk, βk, γk). The sample is drawn from the

Dirichlet distribution.

p(pk1 , pk0 , pk2|M·k, αk, βk, γk) ∝ Dirichlet(αk + nk1 , βk + nk0 , γk + nk2). (3–76)

3. Sample M given (p, C, Z). Since every component of M is assumed to be

independent, it is easy to sample component-wise.

M is a ternary hit-miss mask; we sample it from a multinomial distribution. First we

compute P (Mtk = 1|pk1, C, Z), P (Mtk = 0|pk0 , C, Z) and P (Mtk = −1|pk2 , c, Z),

P (Mtk = 1|pk1 , C, Z) ∝ P (Mtk = 1|pk1)p(C|Mtk = 1, Z) ∝ pk1 exp

(−(Ctk − Zik)

2

2σ2ζ

),

(3–77)

P (Mtk = 0|pk0 , C, Z) ∝ P (Mtk = 0|pk0)p(C|Mtk = 0, Z) ∝ pk0 exp

(−(Ctk)

2

2σ2ζ

), (3–78)

P (Mtk = −1|pk2 , C, Z) ∝ P (Mtk = −1|pk2)p(C|Mtk = −1, Z) ∝ pk2 exp

(−(Ctk + Ztk)

2

2σ2ζ

).

(3–79)

After these three values are normalized, we can sample M from a multinomial

distribution:

Mtk|pk1 , pk0 , pk2 , C, Z ∼

multinomial(P (Mtk = 1|pk1 , C, Z), P (Mtk = 0|pk0 , C, Z), P (Mtk = −1|pk2 , C, Z)). (3–80)

4. Sample C given (Z,M, µqt , σqt). We know that

p(Ct|Zt,Mt) ∝ exp

(−

L∑

k=1

(Ctk − ZtkMtk)2

2σ2ζ

)(3–81)

and

p(µqt |Ct, σqt) ∝ exp

−

(µqt −

∑Lk=1 Ctk

)2

2σ2qt

, (3–82)

61

so we have

p(Ct|µqt , σqt , Zt, Mt) ∝ p(µqt|Ct, σqt)p(Ct|Zt,Mt)

∝ exp

−

(µqt −

∑Lk=1 Ctk

)2

2σ2qt

exp

(−

L∑

k=1

(Ctk − ZtkMtk)2

2σ2ζ

).

(3–83)

Rather than sampling C as a matrix, it is better to sample component-wise from the

Gaussian distribution:

Ctk|C−tk, Ztk, Mtk, µqt , σqt ∼ N

τζZikMtk + τt

(µqt −

∑j Ctj + Ctk

)

τζ + τqt

, (τζ + τqt)−1/2

,

(3–84)

where τ% = 1σ2

%denotes the precision of the Gaussian distributions for % = ζ and qt.

5. Sample the mean µr and variance σr for the state r, given the state sequence Q.

We compute xt for t = 1, . . . , T first by

xt =∑

k

Ctk. (3–85)

Then, similar to section 3.4.2, we can sample µr and σr from the posterior conditional

probability distributions as follows:

µr ∼ p(µr|χr) ∝ p(χr|µr)p(µr) ∝ N(µr; µφ, σφ) (3–86)

where

µφ = (τ0 + Tτ)−1

τ0µ0 + τ

∑xj∈χr

xj

,

σφ = (τ0 + Tτ)−1/2

and

σr ∼ p(σr|χr) ∝ p(χr|σr)p(σr) ∝ Gamma(σr|ψφ, βφ) (3–87)

62

where

ψφ = ψ0 + T/2,

βφ = β0 +∑

xj∈χr

(xj − µr)2.

6. Sample the state sequences using equation 3–34 in section 3.4.2.

7. Sample the transition matrix using equation 3–33 in section 3.4.2.

The training algorithm is summarized in Figure 3-8. Initialization in this case is

random.

Max Over Columns

A

D

t

k A tk

m

OWA h

Min

Hit

mask

k

t

Miss

mask

Observation Sequence

X 1 X 2 … .X

Max Over Columns

A

D

t

k tk

OWA

Min Min

Hit

mask

Hit

mask

k

t

Miss

mask

Miss

mask


X 1 X 2 … .X T

Figure 3-1. Feature extraction process for feature learning

63

Initialize the HMM model parameters

Initialize the Masks, OWA weights

Feature learning loop1

Extract features

Create observation sequences from the features

HMM model training loop2

Randomize the order of sequences

Loop3 for all sequences, one mine follow by one nonmine sequence

Get a sequence

Compute the loss of this sequence by current HMM Models

If it is correct classified, continue to next sequence

Else

Compute the gradient with respect to all parameters using

equations (2-7,8, 9, 10) and equations (3-24,25)

Accumulate the total loss.

Accumulate the value of gradients of mask and OWA weights

Update the HMM model parameters, both mine and nonmine

by subtracting their respective gradients

End loop3 for all sequences

If total loss of all sequences decreasing, save the HMM models

End HMM model training loop2

Update the mask and OWA weights

Test model over the validation set

If loss over validation set is increasing or number of iterations exceeds a threshold, then stop

feature learning

End Feature learning loop1

Figure 3-2. MCE-based training process for feature learning

Loop until convergence

– Sample state qt, given all other states of the sequence, from

multinomial distribution one-by-one after computing the

probabilities P(qt|q-t) using equation (3-34)

– Sample transition probabilities ars from the Dirichlet

distribution after counting the state pairs in the sequences of all

observations using equation (3-33)

– Sample the state model parameters • Sample component label k for each observation Xt one-by-one from

the multinomial distribution after computing the probabilities

P(k|-t) using equation (3-36)

• Sample K mixture proportions, ck, from the Dirichlet distribution

after counting labels of all observations using equation (3-35)

• Sample the Gaussian model parameters (µ, �

)

– Sample the mean from the posterior Gaussian distribution.

– Sample the covariance matrix from the inverse Wishart

distribution.

Figure 3-3. Gibbs sampling HMM training process

64

Max Over Columns

A

D

i

k

SumHit-miss

Mask

Mi

k

iObservation Sequence

X 1 X 2 … .X N

Max/Sum over Columns

A

D

t

t

i


X 1 X 2 … .X T

Bi Ai

Figure 3-4. Feature model for LSampFealHMM

LSampFealHMM Training Algorithm:

Set hyper parameters �, �

, σ� , σ� , σ� and threshold �

Start with the initial state sequence Q[0]

Loop1

Split all the image sequences as different image segments

according to the state they are associated to

Loop2 for each state segment

Start with initial values [p[0]

, M[0]

, B[0]

, D[0]

, x[0]

]

Loop3 to sample the state parameters

At the iteration s

Sample p[s]

given (M[s-1]

, �, �

) using equations (3-45, 46)

Sample M[s]

given (p[s]

, B[s-1]

, A) using equation (3-49)

Sample B[s]

given (A, M[s]

) using equation (3-57)

Sample D[s]

given (x[s-1]

, x[s]


Sample x[s]

given (y, x[s]


End loop3 after the required number of iterations

End loop2

Compute every state probability density with sampled state

parameters for all the images sequences.

Find the best state sequence using Viterbi algorithm

End loop3 after a fixed number of iterations

Figure 3-5. LSampFealHMM Training Algorithm

65

Max Over Columns

D

A

Sum Hit-miss

Mask Mti

i Observation Sequence

X 1 X 2 … .X T

Sum over Zone

At

A

i


X 1 X 2 … .X

Bti Ati i

Zones

Figure 3-6. Initial feature model for TSampFealHMM

Z A

Sum Hit-miss

Mask

Mt

At

A

Ct Zt

t Convolve Zone with

an all one mask

X 1 X 2 … .X T X 1 X 2 … .X


Zones

Figure 3-7. Final feature model for TSampFealHMM

66

TSampFealHMM Training Algorithm:

Set hyper parameters �, �

, σ� , σ�

Initialize state sequence Q[0]

Initialize variables [p[0]

, M[0]

, C[0]

, Z[0]

]

Loop1

At iteration s

Sample the state qt[s]

given all other states of the sequence from multinomial distribution

one-by-one after computing the probabilities P(qt|q-t) using equation (3-34)

Sample transition probabilities ars[s]

from the Dirichlet distribution after counting the state

pairs in the sequences of all observations using equation (3-33)

Sample the state parameters

Sample Z[s]

given (A, M[s-1]

, C[s-1]


Sample p[s]

given (M[s-1]

, �

, �


Sample M[s]

given (p[s]

, C[s-1]

, Z[s]


Sample C[s]

given (Z[s]

, M[s]

, �[s], �[s]


Sample �[s], �[s]

given (x[s]

, Q[s]


Stop loop1 after a fixed number of iterations

Compute x with fix Mask as the mean of samples

Loop 2

Sample the state qt given all other states of the sequence from multinomial distribution one-

by-one after computing the probabilities P(qt|q-t) using equation (3-34)

Sample transition probabilities ars from the Dirichlet distribution after counting the state

pairs in the sequences of all observations using equation (3-33)

Sample the state parameters

Sample �, � given (x, Q) using equations (3-86, 87)

Stop loop2 after a fixed number of iterations

Figure 3-8. TSampFealHMM Training Algorithm

67

CHAPTER 4EMPIRICAL ANALYSIS

4.1 Data Sets

Experiments were performed using both synthetic and real data sets. There were

two types of real data, GPR and handwritten digit data. There also were two synthetic

data sets. The first synthetic data set contained samples from two classes. Each sample

was a 29 × 23 image with intensity values in the interval [0, 1]. Classes consisted of

images of simulated “hyperbolas” that consisted of line segments that were oriented

at 45 and 135 degrees plus noise for Class 1 and 60 and 120 degrees plus noise for

Class 2. To generate the sample images, a binary image was used as the template, then

additive Gaussian noise (0.1 mean and 0.1 standard derivation) and “salt and pepper”

noise (with a probability 0.3 of changing state) was used to corrupt the template. Ten

samples are shown in Figure 4-1. The synthetic data contained 300 images from each

class in the training set and 40 images from each class in the testing set. McFeaLHMM

was applied to this data set to show the performance of the algorithm. We refer to this

data set as the SynData1.

The second synthetic data set contained the data samples of image sequences,

100 sequences for each class. Each sequence had nine images, each image was a 5 ×5 image with intensity values in the interval [0, 1]. The feature class had the sequences

of simulated “hyperbolas.” Therefore, these sequences had three groups of images. The

three groups were associated with the images containing line segments that oriented

45, 180, and 135 degrees, respectively. To generate the sample images, we first used

a fixed, left-right transition matrix to generate state sequences. Then, according to the

state sequences, line images from that state were generated. A binary image was used

as the template, then additive Gaussian noise (0.1 mean and 0.1 standard derivation)

and “salt and pepper” noise (with probability 0.3 of changing state) was used to corrupt

the template. The background sequences consisted of images from corrupted blank

68

templates. Some sample sequences are shown in Figure 4-9. In the figure, there are

ten sequences from each class from top to bottom. Each row is a sequence of ten

images. Two adjacent images in the sequence are separated by a blank column. The

two adjacent sequence rows are separated by a blank row. LSampFeaLHMM was

applied to this data set to show the performance of the algorithm. We refer to this data

set as SynData2.

There are three GPR data sets. They were both acquired using NIITEK time domain

GPR systems well described in the literature (Lee et al., 2007). The focus here is on

a discussion of the relative performance of the new and old algorithms. Data sets

contained 2 classes: Anti-tank (AT) mines and non-mines. These are both plastic-cased

and metal-cased mines. In all cases, the various HMM algorithms were applied to

alarms detected by a pre-screener (Gader et al., 2004).

The first GPR data set was acquired from an arid test site. It consisted of 120 mine

encounters and 120 false alarms. Data samples were extracted from pre-screener

alarms. This set contains 80 images from each class in the training set and 40 images

from each class in the testing set. Each data sample is a 29 × 23 image with intensity

values in the interval [-1, 1]. Ten samples are shown in Figure 4-2. This author produced

all the HMM results on this data set. Comparisons of standard HMM (Ma, 2004) and

McFeaLHMM algorithms were made on this set. We refer to it as the GPRAcid dataset.

The second GPR data set was acquired from an temperate test site. It consisted

of 316 mine encounters, of which 234 were plastic-cased, and 1,025 were false alarms.

Similar to the GPRAcid data set, data samples were extracted from pre-screener alarms.

Each data sample is a 29 × 23 image with intensity values in the interval [-1, 1]. The

lane-based 10-fold cross validations were applied to this dataset. This author produced

all the HMM results from this data set. Comparisons of standard EM-HMM, SampHMM,

and LSampFeaLHMM algorithms were made on this set. We refer to it as the GPRTemp

dataset.

69

The third GPR data set contained measurements from the different geographical

sites, referred to as S1 and S2. The data at S1 were measured with two different NIITEK

systems, A1 and A2.

A significant point is that the HMM experiments were not run by the author on

this data set. They were run by others (P. Gader and J. Bolton, pers. comm.) and

verified by P. Gader, the adviser for this dissertation study. Algorithm SampHMM and

DTXTHMM were compared using this set, which is referred to as the GPRTwoSite

data set. Other comparisons will be made in the future, but are limited by the ability to

transfer algorithms. Furthermore, true false alarm rates are not given. False alarm rates

are given in arbitrary units, but are proportional in the sense that if algorithm A has x

false alarms and algorithm B has rx false alarms, then algorithm B has r times as many

false alarms as algorithm A. Again, the focus of this research is to evaluate relative

performance of algorithms, not GPR systems. Thus, absolute false alarm rates are not

necessary.

The handwritten data consist of images acquired from the MNIST data set (LeCun

and Cortes). The purpose of these experiments is to compare performance of the

feature learning HMM with an HMM trained using handmade features. TSampFeaLHMM

and SampHMM are compared using this set, which is referred as the HWDR dataset.

4.2 Experiments and Results

The experiments conducted are shown in Table 4-1. The DTXTHMM represents the

baseline algorithm for mine detection using GPR data. It has been developed over years

and versions of it have demonstrated excellent performance on several GPR systems

(Frigui et al., 2005; Gader et al.; Wilson et al., 2007; Zhao et al., 2003). We compare

against it for the landmine detection experiments.

The HMM algorithm for HWDR with handmade features will be described in section

4.2.6

70

4.2.1 SynData1

McFeaLHMM was investigated using SynData1. SynData1 contains 300 images

from each class in the training set and 40 images from each class in the testing set.

Two-dimensional features were used.

Since the algorithm is sensitive to initialization, we considered two different mask

initializations. One initialization used hit-miss pairs representing horizontal and vertical

line segments. The other used hit-miss pairs representing diagonal and anti-diagonal

line segments. The two pairs of initial masks are shown in Figure 4-3. Each mask is a 5

× 5 array image with values in the interval [0,1].

The OWA operators associated with each mask were randomly initialized. The

same OWA operators were used for both horizontal/vertical and diagonal/anti-diagonal

initializations. The weights are shown in Figure 4-4. The vertical axis is the value of

the weights. The horizontal axis is the index of the ordered elements of the mask. The

height of the bar at index i indicates the value of wi in equation 3–2. The first fifteen

weights were set to have very small values, and the other ten weights were sampled

from the uniform distribution on the interval [0, 1] initially. The weights were normalized

for each mask.

After McFeaLHMM feature learning, the classification rate over the test set on

synthetic data was 100%. The final masks are shown in Figure 4-5. The masks are

very similar to initial masks. This is expected since these final masks can extract the

“hyperbola” shape information of the training images. The final OWA weights are shown

in Figure 4-6. These final weights are similar to a mean OWA operator or they prefer

high value elements of masks.

The result shows that McFeaLHMM can achieve great performance with good

initialization. However, the experiment also showed that the algorithm is sensitive to

initialization. The algorithm could not converge with random initialization.

71

4.2.2 GPRArid

The McFeaLHMM algorithm and the standard HMM algorithm (DTXTHMM) were

compared using GPRArid. The standard HMM algorithm used human-made feature

masks, which were created by an expert on GPR data.

GPRArid contains 80 images from each class in the training set and 40 images from

each class in the testing set. The dimensionality of the feature was chosen to be four.

Two dimensions were extracted from the positive part of the image and two dimensions

were extracted from the negative part of the image. The masks and OWA operators

were initialized the same way as for SynData1.

The algorithms ware trained on the training set and tested on the test set. The

plots of Probability of Detection (PD) vs. Probability of False Alarm (PFA) on the

test set for the standard HMM, McFeaLHMM-trained via feature learning initialized

with the horizontal and vertical masks, and McFeaLHMM-trained via feature learning

initialized with the diagonal and anti-diagonal features are shown in Figure 4-8. As can

be seen, the PFA of the feature learning algorithm is reduced to 80% of a standard HMM

algorithm at a PD of 90%. The features learned by algorithm McFeaLHMM on training

set are shown in Figure 4-7

The final OWA weights trained on GPRArid did not appear qualitatively different

from the weights trained on SynData1. The final masks are similar to the initial masks.

This is not surprising since these mask can extract information from the “hyperbola”

shape of mine images.

The experiments have shown that this MCE feature learning method is very

sensitive to initialization. The learning rates were also carefully tuned, otherwise the

training algorithm could not converge to a stable point. In fact, identifying learning

algorithms that are not so sensitive, was the reason that our sampling feature learning

algorithms were proposed.

72

4.2.3 GPRTwoSite

Comparisons of the SampHMM algorithm and the standard HMM algorithm

(DTXTHMM) were made on this dataset. Lane-based 10-fold cross validation (P.

Gader and J. Bolton, pers. comm.) were used to evaluate the algorithms.

The plot of PD vs PFA for SampHMM and DTXTHMM algorithms on different data

sites are shown in Figures 4-23, 4-24, 4-25, 4-26, 4-27, and 4-28. The plots show that

PFA of SampHMM is less or even in half of the PFA of the standard HMM algorithm for

PDs of 90% or 85%. We could conclude that the SampHMM algorithm outperforms the

existing HMM algorithms for GPR mine detection.

4.2.4 SynData2

LSampFeaLHMM was tested using SynData2. This synthetic data set contains

the data samples of image sequences. There are 100 sequences for each class. All of

them were used as training samples. The purpose was to show that the feature masks

learned by the algorithm could match the templates used to generate the training data

samples. Experiments were conducted using hit masks and hit-miss masks.

We applied three-state HMM models in the training process. After Gibbs feature

learning with the hit mask only, the final masks are shown in Figure 4-10. From left to

right, there are three states shown in the figure. The bottom row shows the hit mask for

each state. The top row shows the value of features. The horizontal axis indicates the

index of images in the specified state for all sequences. The first half is from the feature

class, and the second half is from the background class. The vertical axis shows the

feature values of images after applying the mask. The final hit masks are an intuitive

result, what we expected. They perfectly matched the binary templates that we used to

generate the training samples. The feature values are well separated between feature

images and background images. A big feature value indicates a strong ‘hit’ in that

image.

73

After Gibbs feature learning with hit-miss mask instead of hit mask only, the final

result for the state of the 135 degree image group is shown in Figure 4-11. The left part

of the figure shows the images associated with this state at the final iteration. The top

half is from the feature class, and the bottom half is from the background class. Each

row is a vector format of an image. The feature values of images in this state are shown

at the top center of the figure. The final hit-miss mask is shown at the bottom center of

the figure. The intensities of the image for the hit-miss mask are in the interval [-1, 1].

The left part of the figure shows the matrix format of images. The left two columns are

from the feature class. The right-most column is from the background class. The two

adjacent images are separated by a blank row in each column. The feature value figure

shows the feature value with the hit-miss mask is more separated than the feature-value

with hit mask only.

The individual hit mask, do-not-care mask, and miss mask are shown in Figure

4-12. The intensity values of all three images are in the interval [0, 1]. The figure shows

that the hit mask is an anti-diagonal shape. The miss mask is a negative anti-diagonal

mask. The do-not-care mask is almost blank, which fits this data set.

We also conducted experiments to test the performance of the algorithm when

images are not aligned in the same position. It would require the algorithm to shift the

image to match the feature mask in the learning process. The results are shown in

Figure 4-13. From the left part and right part of the figure, we can see that some images

in this group do not align, but the results show that the feature values are well separated

again. The final hit-miss mask still has good anti-diagonal shape, although it is not crisp.

The individual hit mask, do-not-care mask, and miss mask are shown in Figure

4-14. The hit mask still has good shape information. The miss mask has weak values,

since it is hard to match against the off-alignment position. The do-not-care mask gains

some intensity, since the offset positions may not contribute as much, either from the

74

positive part or from the negative part. The experiments show that the LSampFeaLHMM

can learn the feature intuitive features that produce improved classifications.

4.2.5 GPRTemp

SampHMM, LSampFeaLHMM and standard HMM were compared using GPRTemp.

Both SampHMM and standard HMM use the human-made feature masks created by

experts. LSampFeaLHMM tried to learn feature masks in the training process. Similar

to experiments using GPRTwoSite, lane-based cross validation was used to evaluate

the algorithms. In the experiment for LSampFeaLHMM, the data was preprocessed

first. Each mine image was normalized and was scaled to the interval [0, 1]. Then the

image was semi-threshold and skeletonized to obtain a crisp gray-level image. Then

we moved a 5 × 5 window along the x-axis to capture the image sequences. For each

data sample of a mine image, one image sequence was extracted. Twenty-five image

sequence samples are shown in Figure 4-15. Each sequence is along the row. Two

sequences are separated by a horizontal gray bar, and two adjacent images in one

sequence are separated by a vertical gray bar. It can be seen that the sequences

consist of ascending-edge and descending-edge images.

Two hundred image sequences from the mine class were extracted from landmine

data in the training set. The final hit masks after Gibbs feature learning with a four-state

HMM setting are shown in Figure 4-16. Each 5 × 5 block is one hit mask for one state,

and two masks are separated by a vertical black bar. The second state has very few

samples associated with it, so the second hit mask can be ignored. It is clear that the

final three hit masks are capturing ascending-edge, flat-edge, and descending-edge

information, respectively.

The final hit masks after Gibbs feature learning with a three-state HMM setting are

shown in Figure 4-17. The second state has very few samples associated with it, so the

second hit mask is ignored. The first and third hit masks capture the ascending-edge

and descending-edge information, respectively.

75

The plots of PD vs PFA on the test site for the standard HMM, SampHMM, and

LSampFeaLHMM algorithm are shown in Figure 4-18. The figure shows that the HMM

sampling algorithm has the lowest PFA at 90% of PD. The PFA of the Gibbs feature

learning algorithm matches or exceeds the standard HMM algorithm at most PDs. The

result shows that the HMM sampling algorithm performed best on the landmine GPR

data set. Our feature learning algorithm (LSampFeaLHMM) can match or exceed HMM

algorithms with human-made feature masks, thus save time and labor.

4.2.6 Handwritten Digits

Comparisons of HMM algorithm with human-made feature masks and TSampFeaLHMM

algorithm were made to handwritten digits. The HMM algorithm used here is the

sampling HMM algorithm similar to the SampHMM algorithm. It has one Gaussian

component per state. The human-made masks were used to perform convolution

over digit images to create features, as shown in Figure 3-7. In experiments on the

TSampFeaLHMM algorithm, these feature masks were estimated in the training

process. The human-made masks are shown in Figure 4-22. They are nine line

segments that were oriented 0, 20, 40, 50, 70, 90, 110, 130, and 160 degrees,

respectively. These masks were created to simulate the commonly used edge detectors

(Frigui et al., 2009). An all blank mask was added to capture the empty background

zone.

Each raw digit image is a 28 × 28 gray-level image. The intensity values were then

scaled to [0, 1]. Next, the principal transform algorithm was applied to the images, so

that the two principal directions of the images were horizontal and vertical axes. Then

the background values of the images were set to -1, and zero values were padding

to the edge of digits. Some samples are shown in Figure 4-19. Next each image was

split into 16 size 8 × 8 overlapped zones, as shown in Figure 4-20. The order along

the anti-diagonal direction from top-right to bottom-left, formed a sequence from these

sixteen zones.

76

Four digit classes were picked in experiments: 0, 1, 2, 4. The training set had 300

digit images for each class. The test set also had 300 digit images for each class. The

algorithms were trained on the training set and test on the test set.

Seven-state HMM models were used in the experiment using TSamplFeaLHMM

with digit zone sequences. Full transition matrixes were used. First, in the training

process, the feature masks and HMM state parameters were learned simultaneously.

Sampling was performed for 10,000 iterations, a sufficient “burning period,” then the

feature masks were fixed and the HMM parameters were updated for subsequent

iterations. Training was stopped after 15,000 iterations. The second loop was to fine

tune the HMM parameters. The same training process without the feature learning

process was used for SampHMM.

Classification results of the TSamplFeaLHMM are shown in Table 4-2 as the

confusion matrix for digit pair 0 and 1. The row index is true class and the column index

is algorithm classification. About classification error of 2% is shown for these two digit

classes.

Classification results of the TSamplFeaLHMM for digit classes 0, 1, 2, 4 are

shown in Table 4-3. Correct classification of digit class 2 was the most difficult. About

classification error of 8% is shown for these four digit classes.

The final hit-miss masks and transition matrix are shown in Figure 4-21. The size of

pixel means the relative value of the element in that transition matrix. It is difficult to get

conclusive information from these hit masks.

Classification results of the HMM algorithm with human-made masks for digits

0, 1, 2, 4 are shown in Table 4-4. About classification error of 14% is shown for these

digits, which is worse than our feature learning algorithm. These results show that the

performance of the feature learning HMM beats the HMM algorithm using human-made

features.

77

Table 4-1. Algorithms and DatasetsMcFeaLHMM SampHMM LSampFeaLHMM TSampFeaLHMM DTXT HMM*

SynData1 XSynData2 XGPRArid X X

GPRTemp X X XGPRTwoSite X X

HWDR X* X*Using handmade features

Table 4-2. Confusion matrix for digit pair 0 and 1 for TSampFeaLHMMDigits 0 1

0 300 01 8 292

Table 4-3. Confusion matrix for digits 0, 1, 2, 4 for TSampFeaLHMMDigits 0 1 2 4

0 277 0 10 131 0 292 1 62 12 0 256 324 2 1 13 284

Table 4-4. Confusion matrix for digits 0, 1, 2, 4 for HMM with human-made masksDigits 0 1 2 4

0 256 1 37 61 1 291 5 32 13 0 274 134 52 1 38 209

78

5 1015 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 1015 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

(a) Class 1: 45-degree angle images, above arerandomly picked 10 images from class 1.

5 1015 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 1015 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

(b) Class 2: 60-degree angle images, above arerandomly picked 10 images from class 2.

Figure 4-1. Ten samples from each class of dataset SynData1.

79

5 1015 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 1015 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

(a) 10 samples from mines data set

5 1015 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 1015 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 15 20

5

10

15

20

25

5 10 1520

5

10

15

20

25

(b) 10 samples from non-mines data set

Figure 4-2. Samples from each class of dataset GPRArid.

hit-mask

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

miss-mask

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

(a) Horizontal and verticalpairs

hit-mask

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

miss-mask

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

(b) Diagonal and anti-diagonalpairs

Figure 4-3. Hit-miss pairs for initial masks.

80

5 10 15 20 250

0.10.2

weight for first hit-miss mask

5 10 15 20 250

0.10.2

weight for second hit-miss mask

Figure 4-4. Initial OWA weights for hit and miss. The top weights correspond to the hitmask and the bottom weights correspond to the miss mask.

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5


1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

(b) Diagonal and anti-diagonalpairs

Figure 4-5. Hit-miss masks after feature learning corresponding to initial masks in Figure4-3.

5 10 15 20 250

0.1

0.2

5 10 15 20 250

0.1

0.2

5 10 15 20 250

0.1

0.2

5 10 15 20 250

0.1

0.2

(a) Hit and miss weights for horizon-tal and vertical initialization

5 10 15 20 250

0.1

0.2

5 10 15 20 250

0.1

0.2

5 10 15 20 250

0.1

0.2

5 10 15 20 250

0.1

0.2

(b) Hit and miss weights for diagonaland anti-diagonal initialization

Figure 4-6. OWA weights after feature learning.

81

2 4

2

4

2 4

2

4

2 4

2

4

2 4

2

4

2 4

2

4

2 4

2

4

2 4

2

4

2 4

2

4


2 4

2

4

2 4

2

4

2 4

2

4

2 4

2

4

2 4

2

4

2 4

2

4

2 4

2

4

2 4

2

4

(b) Diagonal and anti-diagonal pairs

Figure 4-7. Final masks learned for the landmine data. Each row represents differentfeature. Row 1 positive, Row 2 positive, Row 3 negative, Row 4 negative.

0 20 40 60 80 10040

50

60

70

80

90

100

PFA

PD

diag+antidiag mask:100/15.0 95/7.5 90/2.5hori+vert mask:100/45.0 95/7.5 90/2.5standard hmm:100/37.5 95/22.5 90/12.5

Figure 4-8. Receiver operating characteristic curves comparing McFeaLHMM with twodifferent initializations to the standard HMM.

82

class

10 20 30 40 50 60

10

20

30

40

50

60

70

80

90

100

bg class

10 20 30 40 50 60

10

20

30

40

50

60

70

80

90

100

Figure 4-9. Left: ascending edge, flag edge, and descending edge sequences. Right:

sequences from noise background.

0 100 200 3000

2

4

6iter:4 seg:1

1 2 3 4 5

1

2

3

4

50.2

0.4

0.6

0.8

0 50 100 150 2000

2

4

6iter:4 seg:2

1 2 3 4 5

1

2

3

4

50.2

0.4

0.6

0.8

0 200 400 6000

1

2

3

4

5iter:4 seg:3

1 2 3 4 5

1

2

3

4

50.2

0.4

0.6

0.8

Figure 4-10. Hit masks after Gibbs feature learning.

83

two classes

5 10 15 20 25

50

100

150

200

250

300

350

400

450

0 500-10

-5

0

5threshold = -1.2596

2 4

1

2

3

4

5

-0.5

0

0.5

class

2 4

20

40

60

80

100

120

140

class bg class

Figure 4-11. Result for 135-degree state after Gibbs feature learning with hit-missmasks.

-

hit

1 2 3 4 5

1

2

3

4

5

nocare

1 2 3 4 5

1

2

3

4

5

miss

1 2 3 4 5

1

2

3

4

5

Figure 4-12. Hit-miss masks after Gibbs feature learning.

84

two classes

5 10 15 20 25

10

20

30

40

50

60

70

80

90

100

0 50 1000

1

2

3

4threshold = 2

2 4

1

2

3

4

50

0.2

0.4

0.6

0.8

class

2 4

20

40

60

80

100

120

140

class bg class

Figure 4-13. Result with shifted training images after Gibbs feature learning with hit-missmasks.

hit

2 4

1

2

3

4

5

nocare

2 4

1

2

3

4

5

miss

2 4

1

2

3

4

5

Figure 4-14. Hit-miss masks after Gibbs feature learning with shifted training images.

85

10 20 30 40 50 60 70 80 90 100

20

40

60

80

100

120

140

160

180

200

Figure 4-15. 25 sample sequences extracted from mine images from dataset GPRTemp.

feature masks

5 10 15 20

1

2

3

4

5

0

0.2

0.4

0.6

0.8

Figure 4-16. Hit masks after Gibbs feature learning with four-state HMM setting.

86

feature masks

2 4 6 8 10 12 14 16 18

1

2

3

4

50.2

0.4

0.6

0.8

Figure 4-17. Hit masks after Gibbs feature learning with three-state HMM setting.

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.050.7

0.75

0.8

0.85

0.9

0.95

1A temperate test site

FAR (FA/m2)

PD

SampHMMLSampFealHMMDTXTHMM

Figure 4-18. Receiver operating characteristic curves comparing LSampFeaLHMM andSampHMM algorithms with the standard HMM algorithm.

87

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

10

15

20

25

Figure 4-19. 18 samples for each digit from MNIST.

88

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

(a) digit 0

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

(b) digit 1

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

(c) digit 2

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

2 4 6

2

4

6

(d) digit 4

Figure 4-20. Two samples for each digit to show zone splitting.

feature masks

10 20 30 40 50

2

4

6

-1

0

1

(a) Hit masks and transition matrix for digit 0

feature masks

10 20 30 40 50

2

4

6

-1

0

1

(b) Hit masks transition matrixfor digit 1

feature masks

10 20 30 40 50

2

4

6

-1

0

1

(c) Hit masks transition matrixfor digit 2

feature masks

10 20 30 40 50

2

4

6

-1

0

1

(d) Hit masks transition matrixfor digit 4

Figure 4-21. Hit masks and transition matrix after Gibbs feature learning for digits.

10 20 30 40 50 60 70

2

4

6

-1

0

1

Figure 4-22. Ten human-made masks.

89

0 10 20 30 40 50 600.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1HMMSamp_vs_DTXTHMM_A1S1ceqm.fig

False Alarms (a.u.)

PD

DTXTHMM:HMMSamp1

Figure 4-23. HMMSamp vs DTXTHMM on GPR A1 at site S1 while clutter == mine.

0 10 20 30 40 50 600.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1HMMSamp_vs_DTXTHMM_A1S1cneqm.fig

False Alarms (a.u.)

PD

DTXTHMM:HMMSamp1

Figure 4-24. HMMSamp vs DTXTHMM on GPR A1 at site S1 while clutter ∼= mine.

90

0 10 20 30 40 50 600.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95


False Alarms (a.u.)

PD

DTXTHMM:HMMSamp1


0 10 20 30 40 50 600.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95


False Alarms (a.u.)

PD

DTXTHMM:HMMSamp1


91

0 10 20 30 40 50 600.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95


False Alarms (a.u.)

PD

DTXTHMM:HMMSamp1


0 10 20 30 40 50 600.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95


False Alarms (a.u.)

PD

DTXTHMM:HMMSamp1


92

CHAPTER 5CONCLUSIONS AND FUTURE WORK

The performance of such feature-based learning methods as HMMs depends

not only on the design of the classifier, but also on the features. Few studies have

investigated both a whole system. Features that accurately and succinctly represent

discriminating information in an image or signal are very important to any classifier.

Our approach involved developing a parameterized model of feature extraction

based on morphological or linear convolution operations. To learn the parameters of

the feature model and the HMM, two feature learning algorithms were developed. They

can simultaneously extract the features and learn the parameters of the HMM model.

One algorithm is based on minimum classification error, and the other is based on Gibbs

sampling. The Gibbs sampling method is used so that the method is more robust to the

initialization and achieves the better solution. The experiments show this new method

can outperform the other methods in the landmine detection application.

Additionally, a new learning method for learning parameters of the HMM model with

multivariate Gaussian mixture has been presented. This method has been shown to

improve performance of both the synthetic data and real data sets compared to existing

state-of-the-art methods and to human-made features.

Specifically, the following results were achieved:

• McFeaLHMM is very sensitive to initialization and learning rates.

• SampHMM was far superior compared to known HMMs for GPR mine detection.

• All feature learning models can achieve performance similar to or better than

human-made features in the HMM framework.

• The two sampling HMM feature learning algorithms are much more stable and can

produce better solutions than the McFeaLHMM algorithm in landmine detection

experiments.

93

Future work will include: applying feature learning algorithms to other datasets,

such as the GPRTwoSite; using a sigmoid model instead of a Gaussian model as the

state probability function in the HMM framework; performing discriminative training

via sampling; and investigating sampling OWA operators using Metropolis-Hastings

algorithms.

94

REFERENCES

K. Bae. Bayesian Model-based Approaches with MCMC Computation to Some Bioinfor-matics Problems. PhD thesis, Texas A&M University, 2005.

M. Beal, Z. Ghahramani, and C. Rasmussen. The infinite hidden Markov model. InMachine Learning, pages 29–245, Cambridge, MA, 2002. MIT Press.

S. Belongie, C. Carson, H. Greenspan, and J. Malik. Color- and texture-based imagesegmentation using EM and its application to content-based image retrieval. In Proc.Int’l Conf. Computer Vision, pages 675–682, 1998.

J. A. Bilmes. A gentle tutorial on the EM algorithm and its application to parameterestimation for Gaussian mixture and hidden Markov models. Technical report,University of California, Berkeley, 1997.

A. L. Blum and P. Langley. Selection of relevant features and examples in machinelearning. Artificial Intelligence, 97:245–271, 1997.

C. Burges. Geometric methods for feature extraction and dimensional reduction.Technical Report 55, Microsoft Research, November 2004.

G. Casella and E. I. George. Explaining the Gibbs sampler. The American Statistician,46(3):167–174, 1992.

S. Chen and X. Yang. Alternative linear discriminant classifier. Pattern Recognition, 37:1545–1547, 2004.

Y.-N. Chen, C.-C. Han, C.-T. Wang, B.-S. Jeng, and K.-C. Fan. The application ofa convolution neural network on face and license plate detection. InternationalConference on Pattern Recognition, 3:552–555, 2006.

S. Choi. Sequential EM learning for subspace analysis. Pattern Recognition Letter, 25:15591567, 2004.

D. Cohn. Informed projections. In Advances in Neural Information Processing Systems15, pages 849–856, Cambridge, MA, 2003. MIT Press.

T. G. Dietterich. Machine learning research: Four current directions. AI Magazine, 18(4):97–136, 1997.

A. Dunmur and D. Titterington. Computational Bayesian analysis of hidden Markovmesh models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(11):1296–1300, 1997.

J. G. Dy and C. E. Brodley. Feature selection for unsupervised learning. Journal ofMachine Learning Research, 5:845–889, April 2004.

M. A. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions onPattern Analysis and Machine Intelligence, 25(9):1150–1159, September 2003.

95

M. A. T. Figueiredo and A. K. Jain. Unsupervised learning of finite mixture models. IEEETransactions on Pattern Analysis and Machine Intelligence, 24:381–396, 2000.

H. Frigui, K. C. Ho, and P. Gader. Real-time landmine detection with ground-penetratingradar using discriminative and adaptive hidden Markov models. EURASIP J. Appl.Signal Process., 2005:1867–1885, 2005.

H. Frigui, A. Fadeev, A. Karem, and P. Gader. Adaptive edge histogram descriptor forlandmine detection using GPR. In Society of Photo-Optical Instrumentation Engineers(SPIE) Conference Series, volume 7303 of Society of Photo-Optical InstrumentationEngineers (SPIE) Conference Series, May 2009.

D. Gabor. Theory of communication. J. Inst. Electr. Engrs, 93(26):429457, November1946.

P. Gader and M. Khabou. Automatic feature generation for handwritten digit recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(12):1246–1261,December 1996.

P. Gader and M. Popescu. Generalized hidden Markov models for land mine detection.Proc. SPIE, 4742:349–355, April 2002.

P. Gader, M. Mystkowski, and Y. Zhao. Landmine detection with ground penetratingradar using hidden Markov models. IEEE Transactions on Geoscience and RemoteSensing.

P. Gader, J. R. Miramonti, Y. Won, and P. Coffield. Segmentation free shared weightnetworks for automatic vehicle detection. Neural Networks, 8(9):1457–1473, 1995.

P. Gader, M. Khabou, and A. Koldobsky. Morphological regularization neural networks.Pattern Recognition, 33:935–944, 2000.

P. Gader, W.-H. Lee, and J. Wilson. Detecting landmines with ground-penetrating radarusing feature-based rules, order statistics, and adaptive whitening. IEEE Transactionson Geoscience and Remote Sensing, 42(11):2522–2534, Nov. 2004.

C. Garcia and M. Delakis. Convolutional face finder: A neural architecture for fastand robust face detection. IEEE Transactions on Pattern Analysis and MachineIntelligence, 26(11):1408–1423, 2004.

Z. Ghahramani, T. L. Griffiths, and P. Sollich. Bayesian nonparametric latent featuremodels. Bayesian Statistics, pages 201–225, 2007.

J. Gil and R. Kimmel. Efficient dilation, erosion, opening, and closing algorithms. IEEETransactions on Pattern Analysis and Machine Intelligence, 24(12):1606–1617,December 2002.

J. Gil and M. Wcrman. Computing 2-d min, median, and max filters. IEEE Transactionson Pattern Analysis and Machine Intelligence, 15(5):504–507, May 1993.

96

A. Globerson and N. Tishby. Sufficient dimensionality reduction. Journal of MachineLearning Research, 3:1307–1331, 2003.

A. R. Golding and D. Roth. A Winnow-based approach to spelling correction. MachineLearning, 34:107–130, 1999.

H. Greenspan, R. Goodman, and R. Chellappa. Texture analysis via unsupervisedand supervised learning. International Joint Conference on Neural Networks, 1991,IJCNN-91-Seattle, 1:639–644, July 1991.

T. L. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffetprocess. Technical Report 2005-01, Gatsby Computational Neuroscience Unit,University College London, 2005.

I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal ofMachine Learning Research, 3:1157–1182, March 2003.

I. Guyon, A. B. Hur, S. Gunn, and G. Dror. Result analysis of the NIPS 2003 featureselection challenge. In Advances in Neural Information Processing Systems 17,pages 545–552, Cambridge, MA, 2004. MIT Press.

D. Haun, K. Hummel, and M. Skubic. Morphological neural network vision processingfor mobile robots. Technical report, Dept. of Computer Engineering and ComputerScience, University of Missouri-Columbia, September 2000.

K. Hild II, D. Erdogmus, K. Torkkola, and J. C. Principe. Feature extraction usinginformation-theoretic learning. IEEE Transactions on Pattern Analysis and MachineIntelligence, 28(9):1395–1392, September 2006.

P. Huber. Projection pursuit. The Annals of Statistics, 13(2):435475, 1985.

A. Hyvarinen and E. Oja. Independent component analysis: Algorithms and applications.Neural Networks, 13(4-5):411–430, June 2000.

M. Inoue and N. Ueda. Exploitation of unlabeled sequences in hidden Markov models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(12):1570–1581,December 2003.

M. Jones and R. Sibson. What is projection pursuit? Journal of the Royal StatisticalSociety, Series A, 150(1):1–36, 1987.

M. Khabou and P. Gader. Automatic target detection using entropy optimizedshared-weight neural networks. IEEE Transactions on Neural Networks, 11(1):186–193, January 2000.

M. Khabou, P. Gader, and J. Keller. Ladar target detection using morphologicalshared-weight neural networks. Machine Vision and Applications, 11:300305,2000.

97

K. Kira and L. A. Rendell. A practical approach to feature selection. In Proc. NinthInternational Conf. Machine Learning, pages 249–256, 1992.

B. Krishnapuram. Adaptive Classifier Design Using Labeled and Unlabeled Data. PhDthesis, Duke University, 2004.

B. Krishnapuram, A. Hartemink, L. Carin, and M. Figueiredo. A Bayesian approach tojoint feature selection and classifier design. IEEE Transactions on Pattern Analysisand Machine Intelligence, 26(9):1105–1111, September 2004.

S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathemati-cal Statistics, 22(1):79–86, 1951.

V. Kyrki, J.-K. Kamarainen, and H. Kalviainen. Simple Gabor feature space for invariantobject recognition. Pattern Recognition Letter, 25:311–318, 2004.

M. Law, M. Figueiredo, and A. Jain. Simultaneous feature selection and clustering usingmixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1154–1166, September 2004.

Y. LeCun and C. Cortes. MNIST handwritten digit database. Available at http://yann.lecun.com/exdb/mnist/.

Y. LeCun, B. Boser, J. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. Jackel.Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied todocument recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.

K. E. Lee, N. Sha, E. R. Dougherty, M. Vannucci, and B. K. Mallick. Gene selection: ABayesian variable selection approach. Bioinformatics, 19(1):90–97, January 2003.

W.-H. Lee, P. D. Gader, and J. N. Wilson. Optimizing the area under a receiver operatingcharacteristic curve with application to landmine detection. IEEE Transactions onGeoscience and Remote Sensing, 45(2):389–398, February 2007.

J. M. Leiva-Murillo and A. Artes-Rodriguez. Maximization of mutual information forsupervised linear feature extraction. IEEE Transactions on Neural Networks, 18(5):1433–1441, September 2007.

N. Littlestone. Learning quickly when irrelevant attributes abound: A newlinear-threshold algorithm. Foundations of Computer Science, 28th Annual Sym-posium on, pages 68–77, October 1987.

H. Liu and H. Motoda. Computational Methods of Feature Selection. Chapman &Hall/CRC, 2007.

98

http://yann.lecun.com/exdb/mnist/

http://yann.lecun.com/exdb/mnist/

C. Ma. MCE training based continuous density HMM landmine detection system.Master’s thesis, Univ. of Missouri-Columbia, 2004.

X. Ma, D. Schonfeld, and A. Khokhar. A general two-dimensional hidden Markov modeland its application in image classification. In IEEE International Conference on ImageProcessing, 2007. ICIP 2007., volume 6, pages 41–44, Oct 2007.

D. Mackey. Information Theory, Inference, and Learning Algorithms. CambridgeUniversity Press, 2003.

M. Mørup, K. H. Madsen, and L. K. Hansen. Approximate l0 constrained non-negativematrix and tensor factorization. In Accepted ISCAS 2008 special session on Non-negative Matrix and Tensor Factorization and Related Problems, 2008.

J. R. Movellan. Tutorial on Gabor filters. Tutorial paper http://mplab.ucsd.edu/tutorials/gabor.pdf, 2006.

M. Najjar, C. Ambroise, and J.-P. Cocquerez. Feature selection for semi supervisedlearning applied to image retrieval. In IEEE ICIP 2003, pages 559–562, September2003.

S. P. Nanavati and P. K. Panigrahi. Wavelets: Applications to image compression-I.Resonance, 10(2):52–61, February 2005.

R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. TechnicalReport CRG-TR-93-1, University of Toronto, 1993.

C. Nebauer. Evaluation of convolutional neural networks for visual recognition. IEEETransactions on Neural Networks, 9(4):685–696, July 1998.

N. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. SysMan., Cyber, (9):6266, 1979.

H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: Criteriaof max-dependency, max-relevance, and min-redundancy. IEEE Transactions onPattern Analysis and Machine Intelligence, 27(8):1226 – 1238, August 2005.

S. Petridis and S. Perantonis. On the relation between discriminant analysis and mutualinformation for supervised linear feature extraction. Pattern Recognition, 37:857–874,2004.

R. Porter, N. Harvey, S. Perkins, J. Theiler, S. Brumby, J. Bloch, M. Gokhale, andJ. Szymanski. Optimizing digital hardware perceptrons for multi-spectral imageclassification. Journal of Mathematical Imaging and Vision, 19:133–150, 2003.

D. Prescher. A short tutorial on the expectation-maximization algorithm. Institute forLogic, Language and Computation. University of Amsterdam, 2004.

J. C. Principe, J. Fisher III, and D. Xu. Information-theoretic learning, May 1998.

99

http://mplab.ucsd.edu/tutorials/gabor.pdf

http://mplab.ucsd.edu/tutorials/gabor.pdf

J. C. Principe, D. Xu, Q. Zhao, and J. F. III. Learning from examples with informationtheoretic criteria. Journal of VLSI signal Processing Systems, 26(1-2):61–77, August2000.

L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speechrecognition. Proceedings of the IEEE, 77(2):257–286, February 1989.

M. Rizki, L. Tamburino, and M. Zmuda. Multi-resolution feature extraction from Gaborfiltered images. In Aerospace and Electronics Conference, 1993, Proceedings of theIEEE 1993 National, pages 819–824, 1993.

M. Rogati and Y. Yang. High-performing feature selection for text classification. InCIKM ’02: Proceedings of the Eleventh International Conference on Information andKnowledge Management, pages 659–661, New York, NY, USA, 2002. ACM.

V. Roth and T. Lange. Feature selection in clustering problems. In S. Thrun, L. Saul, andB. Scholkopf, editors, Advances in Neural Information Processing Systems 16. MITPress, Cambridge, MA, 2004.

S. Roweis. EM algorithms for PCA and SPCA. In NIPS ’97: Proceedings of the 1997Conference on Advances in Neural Information Processing Systems 10, pages626–632, Cambridge, MA, 1998. MIT Press.

T. Ryden and D. M. Titterington. Computational Bayesian analysis of hidden Markovmodels. Journal of Computational and Graphical Statistics, 7(2):194–211, June 1998.

S. Saha. Image compression - from DCT to wavelets: A review. Crossroads, 6(3):12–21,2000.

H. Sahoolizadeh, M. Rahimi, and H. Dehghani. Face recognition using morphologicalshared-weight neural networks. Proceedings of World Academy of Science, Engineer-ing and Technology, 35:556–559, November 2008.

J. Serra. Image Analysis and Mathematical Morphology. Academic Press, Orlando, FL,USA, 1983.

C. Shannon. A mathematical theory of communication. Bell System Technical Journal,27:379–423 and 623–656, July and October 1948.

Q. Sheng, G. Thijs, Y. Moreau, and B. D. Moor. Applications of Gibbs sampling inbioinformatics, 2005. Internal Report 05-65.

L. I. Smith. A tutorial on principal components analysis, February 2002. CornellUniversity.

M. Stanke and S. Waack. Gene prediction with a hidden Markov model and a new intronsubmodel. Bioinformatics, 19(2):215–225, 2003.

100

H. Stoppiglia, G. Dreyfus, R. Dubois, and Y. Oussar. Ranking a random feature forvariable and feature selection. Journal of Machine Learning Research, 3:1399–1414,Mar 2003.

Y. Sun. Iterative relief for feature weighting. IEEE Transactions on Pattern Analysis andMachine Intelligence, 29(6):1–17, June 2007.

L. A. Tamburino and M. M. Rizki. Automated feature detection using evolutionarylearning processes. In Aerospace and Electronics Conference, 1989, Proceedings ofthe IEEE 1989 National, volume 3, pages 1080–1087, May 1989a.

L. A. Tamburino and M. M. Rizki. Automatic generation of binary feature detectors.Aerospace and Electronic Systems Magazine, IEEE, 4(9):20–29, 1989b.

S. Thrun, J. C. Langford, and D. Fox. Monte Carlo hidden Markov models: Learningnon-parametric models of partially observable stochastic processes. In Proc. of theInternational Conference on Machine Learning (ICML), pages 415–424. MorganKaufmann, 1999.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological),, 58(1):267–288, 1996.

M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers.Neural Computation, 11(2):443–482, July 1998.

K. Torkkola. Feature extraction by non-parametric mutual information maximization.Journal of Machine Learning Research, 3:1415–1438, March 2003.

E. Urbach, J. Roerdink, and M. Wilkinson. Connected shape-size pattern spectra forrotation and scale-invariant classification of gray-scale images. IEEE Transactions onPattern Analysis and Machine Intelligence, 29(2):272–285, February 2007.

M. E. Wall, A. Rechtsteiner, and L. M. Rocha. Singular value decomposition andprincipal component analysis. In A Practical Approach to Microarray Data Analysis,pages 91–109, 2003.

S. Wang, H. Chen, S. Li, and D. Zhang. Feature extraction from tumor gene expressionprofiles using DCT and DFT. In EPIA Workshops, pages 485–496, 2007.

X. Wang and K. K. Paliwal. Feature extraction and dimensionality reduction algorithmsand their applications in vowel recognition. Pattern Recognition, 36:2429–2439, 2003.

H. Wei and S. Billings. Feature subset selection and ranking for data dimensionalityreduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1):162–166, January 2007.

J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Featureselection for SVMs. In Advances in Neural Information Processing Systems 13, pages668–674, Cambridge, MA, 2001. MIT Press.

101

P. M. Williams. Bayesian regularization and pruning using a Laplace prior. NeuralComputation, 7(1):117–143, 1995.

J. Wilson, P. Gader, W.-H. Lee, H. Frigui, and K. Ho. A large-scale systematicevaluation of algorithms using ground-penetrating radar for landmine detectionand discrimination. IEEE Transactions on Geoscience and Remote Sensing, 45(8):2560–2572, August 2007.

D. Wipf and B. Rao. l0-norm minimization for basis selection. Advances in NeuralInformation Processing Systems 17, 2005.

L. Wolf and A. Shashua. Feature selection for unsupervised and supervised inference:the emergence of sparsity in a weighted-based approach. IEEE InternationalConference on Computer Vision, 1:378–384, 2003.

Y. Won and P. D. Gader. Morphological shared-weight neural network for patternclassification and automatic target detection. In IEEE International Conference onNeural Networks, 1995 Proceedings., volume 4, pages 2134–2138, 1995.

J. Yang and J. Yang. Why can LDA be performed in PCA transformed space. PatternRecognition, 36:563–566, 2003.

J. Yang and J. Yang. Generalized KL transform based combined feature extraction.Pattern Recognition, 35:295–297, January 2002.

K. Y. Yeung and W. L. Ruzzo. Principal component analysis for clustering geneexpression data. Bioinformatics, 17(9):763–774, September 2001.

J. Yu and B. Bhanu. Evolutionary feature synthesis for facial expression recognition.Pattern Recognition Letter, 27:1289–1298, 2006.

H. Zen, Y. Nankaku, K. Tokuda, and T. Kitamura. Estimating trajectory HMM parametersusing Monte Carlo EM with Gibbs sampler. In IEEE International Conference onAcoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings.,volume 1, pages 1173–1176, May 2006.

X. Zhang, P. Gader, and H. Frigui. Feature learning for a hidden Markov model approachto landmine detection. In Society of Photo-Optical Instrumentation Engineers(SPIE) Conference Series, volume 6553 of Presented at the Society of Photo-OpticalInstrumentation Engineers (SPIE) Conference, May 2007.

D. Zhao, C. Liu, and Y. Zhang. Discriminant feature extraction using dual-objectiveoptimization model. Pattern Recognition Letter, 27:929–936, 2006.

Y. Zhao, P. Gader, P. Chen, and Y. Zhang. Training DHMMs of mine and clutter tominimize landmine detection errors. IEEE Transactions on Geoscience and RemoteSensing, 41(5):1016–1024, May 2003.

102

F. Zhu, X. D. Zhang, Y. F. Hu, and D. Xie. Nonstationary hidden Markov models formultiaspect discriminative feature extraction from radar targets. IEEE Transactions onSignal Processing, 55(5):2203–2214, May 2007.

M. A. Zmuda and L. A. Tamburino. Efficient algorithms for the soft morphologicaloperators. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(11):1142–1147, November 1996.

103

BIOGRAPHICAL SKETCH

Xuping Zhang is a Ph.D. student at the University of Florida. He earned his

bachelor’s degree at Tsinghua University, Beijing, China. His research interests include

landmine detection, artificial intelligence, machine learning, Bayesian methods, feature

learning for images/signals, data mining, and pattern recognition.

104

Documents

AUTOMATIC FEATURE LEARNING AND PARAMETER ...ufdcimages.uflib.ufl.edu/UF/E0/04/11/57/00001/zhang_x.pdfIn this dissertation, two new methods are explored to simultaneously learn parameters