Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
AUTOMATIC FEATURE LEARNING AND PARAMETER ESTIMATION FOR HIDDENMARKOV MODELS USING MCE AND GIBBS SAMPLING
By
XUPING ZHANG
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2009
c© 2009 Xuping Zhang
2
To my Parents because they taught me everything,
to my Brother because he was there when I needed help,
to my professors because they pointed me in the right direction when I was lost,
thank you
3
ACKNOWLEDGMENTS
I would like to thank Dr. Paul Gader, Dr. Joe Wilson, Dr. Gerhard Ritter, Dr. David
Wilson, and Dr. Tamer Kahveci for their patience and understanding. I would also like to
thank my co-workers at the lab, Raazia Mazhar, Alina Zare, Jeremy Bolton, Seniha Esen
Yuksel, Gyeongyong Heo, Andres Mendez-Vazquez, Arthur Barnes, Ryan Close, Ryan
Busser, Kenneth Watford, Hyo-Jin Suh, Wen-Hsiung Lee, John McElroy, Taylor Glenn,
Sean Matthews, and Ganesan Ramachandran, for their help and understanding.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Statement of Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3 Overview of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.3 Sparsity Promotion . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.4 Information Theoretic Learning . . . . . . . . . . . . . . . . . . . . 222.1.5 Transformation Methods . . . . . . . . . . . . . . . . . . . . . . . . 242.1.6 Convolutional Neural Network and Shared-Weight Neural Network 292.1.7 Morphological Transform . . . . . . . . . . . . . . . . . . . . . . . . 302.1.8 Bayesian Nonparametric Latent Feature Model . . . . . . . . . . . 31
2.2 Hidden Markov Model (HMM) . . . . . . . . . . . . . . . . . . . . . . . . . 322.2.1 Definition and Basic Concepts . . . . . . . . . . . . . . . . . . . . . 322.2.2 Applications of the HMM . . . . . . . . . . . . . . . . . . . . . . . . 332.2.3 Learning HMM Parameters . . . . . . . . . . . . . . . . . . . . . . 332.2.4 Minimum Classification Error (MCE) for HMM . . . . . . . . . . . . 34
2.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 CONCEPTUAL APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2 Feature Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Ordered Weighted Average (OWA)-based Generalized MorphologicalFeature Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Convolutional Feature Models . . . . . . . . . . . . . . . . . . . . . 423.2.2.1 Feature model for loosely coupled sampling feature learning
HMM (LSampFealHMM) . . . . . . . . . . . . . . . . . . 42
5
3.2.2.2 Feature model for tightly coupled sampling feature learningHMM (TSampFealHMM) . . . . . . . . . . . . . . . . . . 43
3.3 Feature Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4 Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 MCE-HMM Model for Feature Learning . . . . . . . . . . . . . . . . 463.4.2 Gibbs Sampling Method for Continuous HMM with Multivariate
Gaussian Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4.3 Loosely Coupled Gibbs Sampling Model for Feature Learning . . . 52
3.4.3.1 Gibbs sampler for LSampFeaLHMM . . . . . . . . . . . . 523.4.3.2 Initialization and modified Viterbi learning . . . . . . . . . 57
3.4.4 Tightly Coupled Gibbs Sampling Model for Feature Learning . . . . 59
4 EMPIRICAL ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.1 SynData1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2.2 GPRArid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2.3 GPRTwoSite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2.4 SynData2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2.5 GPRTemp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2.6 Handwritten Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . 93
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6
LIST OF TABLES
Table page
4-1 Algorithms and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4-2 Confusion matrix for digit pair 0 and 1 for TSampFeaLHMM . . . . . . . . . . . 78
4-3 Confusion matrix for digits 0, 1, 2, 4 for TSampFeaLHMM . . . . . . . . . . . . 78
4-4 Confusion matrix for digits 0, 1, 2, 4 for HMM with human-made masks . . . . 78
7
LIST OF FIGURES
Figure page
1-1 General classification model with diagram. . . . . . . . . . . . . . . . . . . . . . 15
2-1 The dashed line is the PCA projection, but the vertical dotted line representsthe best projection to separate two clusters . . . . . . . . . . . . . . . . . . . . 38
2-2 The top plot has β close to zero to maximize the variation of projections (horizontalaxis) of all observations, and the bottom plot has β close to one to minimizethe variation of the projections (vertical axis) of the observations in same cluster. 38
3-1 Feature extraction process for feature learning . . . . . . . . . . . . . . . . . . 63
3-2 MCE-based training process for feature learning . . . . . . . . . . . . . . . . . 64
3-3 Gibbs sampling HMM training process . . . . . . . . . . . . . . . . . . . . . . . 64
3-4 Feature model for LSampFealHMM . . . . . . . . . . . . . . . . . . . . . . . . . 65
3-5 LSampFealHMM Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 65
3-6 Initial feature model for TSampFealHMM . . . . . . . . . . . . . . . . . . . . . . 66
3-7 Final feature model for TSampFealHMM . . . . . . . . . . . . . . . . . . . . . . 66
3-8 TSampFealHMM Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 67
4-1 Ten samples from each class of dataset SynData1. . . . . . . . . . . . . . . . . 79
4-2 Samples from each class of dataset GPRArid. . . . . . . . . . . . . . . . . . . 80
4-3 Hit-miss pairs for initial masks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4-4 Initial OWA weights for hit and miss. The top weights correspond to the hitmask and the bottom weights correspond to the miss mask. . . . . . . . . . . . 81
4-5 Hit-miss masks after feature learning corresponding to initial masks in Figure4-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4-6 OWA weights after feature learning. . . . . . . . . . . . . . . . . . . . . . . . . 81
4-7 Final masks learned for the landmine data. Each row represents different feature.Row 1 positive, Row 2 positive, Row 3 negative, Row 4 negative. . . . . . . . . 82
4-8 Receiver operating characteristic curves comparing McFeaLHMM with twodifferent initializations to the standard HMM. . . . . . . . . . . . . . . . . . . . . 82
4-9 Left: ascending edge, flag edge, and descending edge sequences. Right: sequencesfrom noise background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8
4-10 Hit masks after Gibbs feature learning. . . . . . . . . . . . . . . . . . . . . . . . 83
4-11 Result for 135-degree state after Gibbs feature learning with hit-miss masks. . 84
4-12 Hit-miss masks after Gibbs feature learning. . . . . . . . . . . . . . . . . . . . . 84
4-13 Result with shifted training images after Gibbs feature learning with hit-missmasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4-14 Hit-miss masks after Gibbs feature learning with shifted training images. . . . . 85
4-15 25 sample sequences extracted from mine images from dataset GPRTemp. . . 86
4-16 Hit masks after Gibbs feature learning with four-state HMM setting. . . . . . . . 86
4-17 Hit masks after Gibbs feature learning with three-state HMM setting. . . . . . . 87
4-18 Receiver operating characteristic curves comparing LSampFeaLHMM andSampHMM algorithms with the standard HMM algorithm. . . . . . . . . . . . . 87
4-19 18 samples for each digit from MNIST. . . . . . . . . . . . . . . . . . . . . . . . 88
4-20 Two samples for each digit to show zone splitting. . . . . . . . . . . . . . . . . 89
4-21 Hit masks and transition matrix after Gibbs feature learning for digits. . . . . . . 89
4-22 Ten human-made masks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4-23 HMMSamp vs DTXTHMM on GPR A1 at site S1 while clutter == mine. . . . . . 90
4-24 HMMSamp vs DTXTHMM on GPR A1 at site S1 while clutter ∼= mine. . . . . 90
4-25 HMMSamp vs DTXTHMM on GPR A2 at site S1 while clutter ∼= mine. . . . . 91
4-26 HMMSamp vs DTXTHMM on GPR A2 at site S2 while clutter ∼= mine. . . . . 91
4-27 HMMSamp vs DTXTHMM on GPR A2 at site S1 while clutter == mine. . . . . . 92
4-28 HMMSamp vs DTXTHMM on GPR A2 at site S2 while clutter == mine. . . . . . 92
9
LIST OF SYMBOLS, NOMENCLATURE, OR ABBREVIATIONS
EM expectation-maximization
GPR ground penetrating radar
HMM hidden Markov model
LDA linear discriminant analysis
LSampFealHMM loosely coupled sampling feature learning HMM
MCE minimum classification error
McFeaLHMM MCE feature learning HMM
MCMC Markov chain Monte Carlo
ML maximum-likelihood
OWA ordered weighted averaging
PD probability of detection
PFA probability of false alarm
ROC receiver operating characteristic
SampHMM sampling HMM
TSampFeaLHMM tightly coupled sampling feature learning HMM
10
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
AUTOMATIC FEATURE LEARNING AND PARAMETER ESTIMATION FOR HIDDENMARKOV MODELS USING MCE AND GIBBS SAMPLING
By
Xuping Zhang
December 2009
Chair: Dr. Paul GaderMajor: Computer Engineering
Hidden Markov models (HMMs) are useful tools for landmine detection using
Ground Penetrating Radar (GPR), as well as many other applications. The performance
of HMMs and other feature-based methods depends not only on the design of the
classifier but also on the features. Few studies have investigated both classifiers and
feature sets as a whole system. Features that accurately and succinctly represent
discriminating information in an image or signal are very important to any classifier.
In addition, when the system that generated those original images has to change to
fit different environments, the features usually have to be modified. The process of
modification can be laborious and may require a great deal of application domain
specific knowledge. Hence, it is worthwhile to investigate methods of automatic feature
learning for purposes of automating algorithm development. Even if the discrimination
capability is unchanged, there is still value to feature learning in terms of time saved.
In this dissertation, two new methods are explored to simultaneously learn
parameters intended to extract features and learn parameters for image-based
classifiers. The notion of an image is general here. For example, a sequence of time or
frequency domain features. We have developed a generalized, parameterized model
of feature extraction based on morphological operations. More specifically, the model
includes hit-and-miss masks to extract the shape of interests in the images. In one
method, we use the minimum classification error (MCE) method with generalized
11
probabilistic descent algorithm to learn the parameters. Since our model is based
on gradient decent methods, the MCE method cannot guarantee a global optimal
solution and is very sensitive to initialization. We propose a new learning method
based on Gibbs sampling to learn the parameters. The new learning method samples
parameters from their individual conditional probability distribution instead to maximize
the probability directly. This new method is more robust to initialization, and can
generally find a better solution.
We also developed a new learning method based on Gibbs sampling to learn
parameters for continuous hidden Markov models with multivariate Gaussian mixtures.
Because hidden Markov models with multivariate Gaussian mixtures are commonly
used HMM models in applications, we propose a learning method based on Gibbs
sampling. The proposed method is empirically shown to be more robust than comparable
expectation-maximization algorithms.
We performed experiments using both synthetic and real data. The results show
that both methods work better than the standard HMM methods used in landmine
detection applications. Experiments with handwritten digits are also presented. The
results show that the HMM-model framework with the automatic learning feature
algorithm again performed better than the same framework with the man-made feature.
12
CHAPTER 1INTRODUCTION
1.1 Statement of Problem
A classifier is an algorithm that takes a feature vector as input and produces class
labels or confidences. Significant effort has been expended in the area of classifier
design and learning algorithms to determine parameters of classifiers. Less attention
has been focused, however, on the equally important problem of learning features
for classification. For most image- and signal-based classifier problems, experts use
their knowledge to find the features for the shapes of objects of interest in the images.
Despite this, these features may not be optimal, and human-based feature selection is
time-consuming, ad-hoc, and expensive. The problem addressed in this research is the
problem of automatically identifying and learning features for image-based classification.
1.2 Classification Model
The general classification problem (Figure 1-1) is difficult to investigate in its full
generality. Hence, the sub-problem involving a hidden Markov model (HMM) with
morphological features is considered since this methodology type is based on ad-hoc
methods that have shown promise in the past (Gader et al.; Zhao et al., 2003).
1.3 Overview of Research
The proposed approach involves developing a generalized, parameterized model of
feature extraction based on morphological or linear convolution operations. Two feature
learning algorithms are proposed to learn the parameters of the feature model and the
HMM. One algorithm is based on minimum classification error and the other is based on
Gibbs sampling. Additionally, a new method based on Gibbs sampling is proposed to
learn the parameters of continuous HMM with multivariate Gaussian mixtures.
The feature learning approach involves parameterizing the feature extraction
process. In landmine detection, the existing feature extraction methodology for the
HMMs proceeds by performing morphological operations on small windows of positive
13
and negative derivative images, so the degree to which a diagonal or anti-diagonal
shape fits within the window is computed. Specifically, the morphological operation
is an erosion, which can be calculated as a local minimum operation. These degrees
are aggregated using a maximum operation over a vertical column (Gader et al.). In
addition, linear convolution-based features will also be considered.
In one proposed algorithm, we define a minimum classification error (MCE)
objective function. The function depends not only on the HMM classification parameters,
but also on the feature extraction parameters. Optimizing this objective function over
both parameter sets simultaneously yields both feature extraction and classification
parameters. The feature model generalizes the morphological model using ordered
weighted average (OWA) operations. OWA operators are used to parameterize both
the morphological operations and the aggregation operations as they form a family of
operators that can represent maximum, minimum, and many other operators.
For another proposed algorithm, we apply a Bayesian framework to the feature
learning models. Rather than defining an objective function and maximizing that
function, we try to find the full probabilities distribution of parameters and data. By
sampling the parameters from the individual conditional probability distributions, we can
obtain better solutions. We use the Gibbs sampler as the tool to simulate the probability
distribution. Gibbs sampling is a common Markov Chain Monte Carlo (MCMC) sampling
method. It is a straightforward, powerful sampling method (Section 2.3).
Next, a new learning method for continuous HMM with multivariate Gaussian
mixtures is proposed. HMMs with multivariate Gaussian mixtures are widely used
models in many applications (Rabiner, 1989; Zhao et al., 2003). MCMC sampling
methods have the advantage of generally finding better optima than traditional methods,
such as expectation maximization algorithm (Dunmur and Titterington, 1997; Ryden
and Titterington, 1998). Although there are some learning methods proposed for HMM
based on MCMC sampling, these methods are either for a discrete HMM problem, or
14
for some uncommon HMM models, such as trajectory HMM models (Zen et al., 2006),
non-parametric HMM models (Thrun et al., 1999), or nonstationary HMM models (Zhu
et al., 2007). We propose a new learning method based on Gibbs sampling that focuses
on this specific HMM problem.
The rest of the dissertation is organized as follows. In Chapter 2, we review the
literature of various feature learning methods, HMM algorithms, and Gibbs sampling.
In Chapter 3, we present the three new learning algorithms. In Chapter 4, we show the
results of applying these algorithms. Conclusions and future work are in Chapter 5.
Data Features Classifier Class
Confidence
Figure 1-1. General classification model with diagram.
15
CHAPTER 2LITERATURE REVIEW
2.1 Feature Learning
What is a feature? We give the definition of features here. A signal X is a function
of N variables with real or complex or vectors of real or complex values. A feature
is a real or complex value calculated from a signal X. Through mapping function
f : X → Cn, features act as representatives of signals for later processing. Feature
learning defines or finds the map function to get the appropriate data representatives
with respect to the goal of later processing. Feature learning is the focus of research
in many applications (Belongie et al., 1998; Gader and Khabou, 1996; Guyon and
Elisseeff, 2003; Tamburino and Rizki, 1989a; Yu and Bhanu, 2006). Good features
result in better classification accuracy. The classifiers with better features are fast
to compute, and are more cost effective. In addition, better features help humans
understand the underlying process of data generation. For instance, identifying the
genes responsible for certain diseases can help humans to better understand the cause
of some cancers (Lee et al., 2003). We will give an overview of the most commonly used
algorithms in feature extraction in this section. Feature learning can involve selecting a
subset of features from a large set of candidates, learning or estimating parameters of
parameterized operations, or selecting operators from a set of candidates and learning
parameters. The first is referred to as feature selection. The focus of this research is the
second category, learning or estimating parameters of parameterized operations. We
provide reviews of all three types.
2.1.1 Feature Selection
Many applications already have tens, hundreds, or more variables ready to
represent the data itself, whether they are available directly from hardware output or
defined by the expert. One way to obtain the data representatives is to directly select
a subset of variables as features. Generally, there are three major feature selection
16
methods (Blum and Langley, 1997; Guyon and Elisseeff, 2003; Guyon et al., 2004; Liu
and Motoda, 2007): the filter, the wrapper, and the embedded method.
The filter method is independent of later processing. The simplest approach is to
rank the features via some criteria. The rank criteria can be the correlation between the
variables and the labels (Guyon and Elisseeff, 2003), or the mutual information between
variable and the labels (Globerson and Tishby, 2003). The high rank variables are
selected as the final data representatives. The algorithm is fast and easy to implement
and has been successfully used in many cases (Rogati and Yang, 2002; Stoppiglia
et al., 2003). However, some high rank features can be redundant because they can
be highly correlated. Some low rank features may help to improve the performance of
classifiers when they are included (Stoppiglia et al., 2003).
Instead of ranking variables individually, another class of filter methods involves
ranking subsets of all the variables. Some rank criteria are based on mutual information,
such as the minimal-redundancy-maximal-relevance (mRMR) criterion method (Peng
et al., 2005). One variable is chosen in each step to increase the size of the feature
subset, then the mutual information D(S, c) of the label c and the subset S with the next
new variable is computed as D = 1|S|
∑xi∈S I(xi; c), and the mutual information R(S)
of variables in the subset S is computed as R = 1|S|2
∑xi,xj∈S I(xi; xj), where I(x, y) is
mutual information of variables x and y. The mRMR criterion (maxS(D(S, c) − R(S)))
is used to do this incremental search. The result is not a globally optimal solution with
respect to maximizing mutual information among all subsets, as exhaustive evaluation
is not possible, but the method does reduce the number of redundant features and may
select helpful features that are useless by themselves.
Another set-based filter method, based on forward orthogonal search, has been
proposed (Wei and Billings, 2007). An incremental search is conducted using the
squared-correlation between the two variables as the ranking criterion. Suppose we
have a data set of N data samples, and each sample has n feature candidates. The
17
goal is to select d features from these n feature candidates. At first we have a set of
n variables ~xi = [xi(1), ..., xi(N)]T for i = 1, . . . , n, where xi(k) is the i-th feature of
the k-th sample. Now given the current subset of variables that are from the current
best m features (m ¿ n), these variables are transformed into a group of orthogonal
accessory variables. For a new variable with a new feature candidate, the residue of
the variable over the projection of those orthogonal accessory variables is computed.
Then the average value of the squared-correlation between each variable in the current
subset and the new variable’s residues are ranked. The highest one is included in the
new subset. In this method, the efficient feature subsets are selected with clear physical
interpretation. However, the algorithm assumes there is a linear relationship between
variables and sometimes this assumption may not hold.
Since a filter method is independent of later processing, it may not improve the final
performance. On the other hand, wrapper methods incorporate later processing directly.
Wrapper methods select subsets of variables based on their effect on performance of
later processing (Guyon and Elisseeff, 2003).
Since any classifier or other learning algorithm can be used in later processing, the
wrapper method is powerful when applied to the selection of features. However, a filter
method is usually faster than a wrapper method, because wrapper methods need to
perform learning algorithms for every feature subset candidate. Therefore, an efficient
search strategy is needed. Greedy search strategies are commonly used in wrapper
methods.
For supervised learning, the class label is given. It is natural to use the classification
performance to evaluate the relevance of feature sets (Najjar et al., 2003). Unsupervised
learning, which usually involves clustering, is not as straightforward. Instead of
classification errors, other criteria are used, such as maximum likelihood (ML), scatter
separability (Dy and Brodley, 2004), or a discriminant criterion (Roth and Lange, 2004).
ML criterion maximizes the likelihood of the data given the model (feature set and
18
parameters). Scatter separability criterion uses a within-class scatter matrix and a
between-class scatter matrix to measure separation of clusters. Discriminant criterion
uses linear discriminant analysis (LDA) to find the optimal solution.
Embedded methods: Embedded methods are similar to wrapper methods, except
that embedded methods perform feature selection in the process of training. They are
usually specific to a given learning algorithm such as finding features for support vector
machines (SVMs) (Weston et al., 2001).
2.1.2 Feature Weighting
Sometimes we do not explicitly select a subset of features, but assign a real-valued
number to each variable. The value represents the degree to which the variable is
relevant or important. Feature selection can be thought of as a special case of feature
weighting in which the weights are 0 or 1. However, the ability to permit the weights
to vary continuously allows for a wider variety of techniques. Perhaps the best known
algorithms for feature weights are the Winnow algorithm (Littlestone, 1987) and the
RELIEF algorithm (Kira and Rendell, 1992).
The Winnow algorithm was developed to update the weights in a multiplicative
manner. The idea of the Winnow algorithm is to update the weights by presenting the
positive and negative examples iteratively. Given an example denoted by (x1, ..., xn) and
the weights denoted by (w1, ..., wn), where n is the number of features, the algorithm
predicts 1 if w1x1 + ... + wnxn > θthreshold, otherwise it predicts 0. Then, in each iteration,
the weights are updated if the prediction of the algorithm is incorrect. If the algorithm
predicts a negative value for the positive example, the value of wi is increased by the
scale of a promotion parameter for each xi equal to 1. If the algorithm predicts a positive
value for a negative example, the value of wi is decreased by the scale of a demotion
parameter for each xi equal to 1. The promotion and demotion parameters are set by
experiments. The Winnow algorithm is not difficult to implement, and it scales well to
19
high dimensional space, but the convergence of the algorithm is only guaranteed for
linearly separable data (Golding and Roth, 1999).
The RELIEF algorithm (Dietterich, 1997) estimates feature weights iteratively
according to their ability to discriminate between neighboring patterns. In each iteration,
a pattern x is randomly selected. Then the two nearest neighbors of x are found. One is
from the same class as x (termed the nearest hit, denoted by NH(x)) and the other is
from a different class (termed the nearest miss, denoted by NM(x)). The weight of the
i-th feature is then updated as: wnewi = wold
i + |x(i) − NM(i)(x)| − |x(i) − NH(i)(x)|. It is
proven (Sun, 2007) that RELIEF is an online algorithm that solves a convex optimization
problem with a 1-Nearest Neighborhood objective function. Therefore, the RELIEF
algorithm performs better as a nonlinear classifier to search for informative features
compared with filter methods. In addition, it can be implemented very efficiently, as no
exhaustive search is applied. This makes it suitable for large-scale problems. However, it
calculates nearest neighbors in the original feature space rather than in weighted feature
space, which hurts its performance. Moreover, it is not robust with respect to outliers.
The IRELIEF algorithm (Sun, 2007) was proposed to improve the RELIEF algorithm.
It calculates the probabilities of data points in NM(x) and NH(x), and represents the
probability that a data point is an outlier as a binary random hidden variable. It updates
the weights following the principle of the expectation-maximization (EM) algorithm.
The result shows that IRELIEF improves the RELIEF algorithm because it is robust
against mislabeling noise, and is able to find useful features. Because it follows the
EM algorithm, the proper choice of tuning parameters is important to achieve good
performance.
2.1.3 Sparsity Promotion
Sparsity promotion is very important in feature learning. It can control the
complexity of learning functions and can avoid over fitting, yet achieve good generalization.
Moreover, a sparse model is easy to implement, easy to interpret, more stable, and
20
more robust to noise, but it increases the complexity of computation, and sometimes the
optimal function is analytically unsolvable.
Penalization-based or regularization methods are common sparsity promotion
techniques. These methods rely on minimizing a penalty term applied to a set of
parameters. The L2 norm is a very commonly used penalty term. Suppose we promote
the sparsity over a set of n parameters w1, ..., wn. The penalty is defined as α∑
i w2i ,
where α is a decay constant, also termed a weight decay penalty. In a linear model, this
form of weight decay penalty is equivalent to ridge regression. It is good at controlling
model complexity by shrinking all coefficients toward zero, but they all stay in the model,
since it is rare that parameters go to zero.
The L0 norm is defined as the number of nonzero parameters. The L0 norm penalty
is simple to apply, and it promotes sparsity directly (Wipf and Rao, 2005). However,
in general, solving an optimization problem with an L0 norm penalty is an NP-hard
problem; thus convex relaxation regularization by the L1 norm is often considered
(Mørup et al., 2008; Wolf and Shashua, 2003). The L1 norm is defined as∑
i |wi|. The
least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996) imposes
the L1 norm constraint on the parameters of a problem. It shows that the L1 norm
constraint is equivalent to assuming the parameters have Laplace priors and the L2
norm constraint is equivalent to assuming the parameters have Gaussian priors. Since
Laplace functions quickly peak at zero, the tail of Laplace functions drops slowly, and the
L1 norm constraint would push the parameters to either zero or large values.
The Laplace prior is commonly used for sparsity promotion in the Bayesian
approach (Williams, 1995). Because there is a computational difficulty due to non-
differentiability of the Laplace function at the origin, an alternative hierarchical formulation
was proposed (Figueiredo, 2003), where it is shown that a zero-mean Gaussian prior
is equivalent to a Laplace prior when the variance of the Gaussian prior has a regular
21
exponential distribution. This model has good computability, and promotes great sparsity
(Krishnapuram, 2004; Krishnapuram et al., 2004).
There are other penalties used to promote sparsity. Instead of selecting a subset
of features, the minimum message length (MML) criterion is used to estimate a set of
real-valued (usually in [0, 1]) quantities (one for each feature), which are called feature
saliencies. An MML penalty is adopted to avoid all the feature saliencies to achieve
maximum possible value (Law et al., 2004; Mackey, 2003). The MML criterion is given
(Figueiredo and Jain, 2000) by − log p(θ)− log p(Y |θ)+ 12log |I(θ)+ c
2(1+ log 1
12), where Y
is the set of data samples, θ is the set of parameters of the model, c is the dimension of
θ, and I(θ) = −E[D2θ log p(Y |θ)] is the Fisher information matrix (the negative expected
value of the Hessian of the log-likelihood). The MML criterion encourages the saliencies
of irrelevant features to go to zero, and allows the method to prune the feature set.
However, the Fisher information matrix used in the MML criterion is very difficult to
obtain analytically. Approximate methods are usually needed to find the optimal solution
(Figueiredo and Jain, 2000).
2.1.4 Information Theoretic Learning
Information theory is a mathematical theory that originates from the very essence
of the communication process. Information theoretic learning (ITL) has been proposed
(Mackey, 2003; Principe et al., 1998, 2000) for machine learning. The concepts of
entropy and mutual information are needed to pose and solve optimization problems
with information theoretic criteria.
Here we review the expressions for entropy and mutual information according
to Shannon (1948) and Kullback and Leibler (1951). Shanon’s entropy is defined as
H(y) = − ∫p(y) log p(y)dy, where p(y) is the PDF of the random variable y. Although
Shannon’s entropy is the only one that possesses all the postulated properties for an
information measure, other forms, such as Renyi’s entropy (HR(y) = 12log
∫p2(y)dy), are
equivalent with respect to entropy maximization. The conditional entropy of a random
22
variable x given a random variable y is defined as H(x|y) = − ∫p(x, y) log p(x|y)dy.
The mutual information between random variables x and y is I(x, y) = H(x) − H(x|y),
which can be written as I(x, y) =∫
x
∫yp(x, y) log p(x,y)
p(x)p(y)dydx. The mutual information can
also be seen as the Kullback-Leibler divergence measure between p(x, y) and p(x)p(y).
In general, for two probability densities f(y) and g(y), this is written as K(f, g) =∫
ylog f(y)
g(y)dy.
Mutual information is commonly used in feature selection or feature extraction
methods (Hild II et al., 2006; Leiva-Murillo and Artes-Rodriguez, 2007; Torkkola, 2003).
These methods usually train feature extractors by maximizing an approximation of
mutual information between the class labels and the output of the feature extractor.
Different methods vary for different entropy forms, different computational formulas of
mutual information, or different maximization methods. Hild II et al. (2006) presented
a method using a nonparametric estimation of Renyi’s entropy to learn features and
train the classifier. They use Parzen windows to estimate the probability density p(x)
of data X, where the density of X is estimated as a sum of spherical Gaussians, each
centered at a sample xi. So, p(x) u 1N
∑i=1 NG(x − xi, σ
2I), where N is the number
of samples and G(x, σ2I) is a Gaussian kernel with diagonal, isotropic covariance.
The entropy estimator H2(x) is given by H2(x) u − log 1N
∑i=1 NG(xi − xi−1, 2σ
2I).
Torkkola (2003) presented a method for learning discriminative feature transforms.
Instead of a commonly used mutual information measure based on Kullback-Leibler
divergence, a quadratic divergence measure is used, which is defined as D(f, g) =∫
y(f(y) − g(y))2dy. A linear feature extraction method for classification, based on the
maximization of the mutual information between the features extracted and the classes,
was proposed (Leiva-Murillo and Artes-Rodriguez, 2007). They use Gram-Charlier
expansion to estimate the probability density of data. Thus, the entropy h(z) is computed
as h(z) = hG(z) − J(z), where hG(z) is the entropy with Gaussian assumption, and J(z)
23
is the negentropy, where hG(z) ≈ log((2πe)1/2σz) and J(z) ≈ k1
(E
{z exp
(−z2
2
)})2
+
k2
(E
{z exp
(−z2
2
)}−
√1/2
)2
with the constants k1 and k2.
Experimental results (Leiva-Murillo and Artes-Rodriguez, 2007) show that mutual
information-based methods can outperform existing supervised feature extraction
methods and require no prior assumptions about data or class densities. However, since
these methods usually use nonparametric estimation of density of data, such as Parzen
density estimation, they require a large data set and very high computation time. These
methods are sensitive to the choice of window size and they do not work well on high
dimensional data.
2.1.5 Transformation Methods
Feature extraction via linear transform uses the following methodology. A matrix
T is used to transform original vector ~x into vectors ~y = T~x. The vector ~y may be
lower-dimensional than ~x, in which case ~y is usually chosen as the feature vector. For
other transforms, ~y may have the same dimensionality as ~x. If so, another schema
is used to select elements of ~y as features. Here we review some commonly used
transformations, such as principal component (PCA), singular value decomposition
(SVD), Fisher linear discriminant (FLDA), independent component analysis (ICA),
discrete Fourier transform (DFT), discrete cosine transform (DCT), and Gabor filter.
Karhuner-Loeve Transform (or PCA). PCA is the best known linear transformation
method (Burges, 2004; Smith, 2002; Torkkola, 2003; Wang and Paliwal, 2003; Yang
and Yang, 2002). The PCA method uses the eigenvectors corresponding to the
largest eigenvalues of the covariance matrix of the data as the transform matrix.
After transformation, it generates mutually uncorrelated features. This transformation
is optimal in terms of minimal mean-square error between the original data and the
data reconstructed from the features. It also maximizes mutual information between
the original data vector and its feature representation for the data from the Gaussian
24
distribution, but it can be shown that it is not good for unsupervised clustering (Yeung
and Ruzzo, 2001), as shown in Figure 2-1.
There are some variations of PCA to improve the performance or to fit different
learning frameworks. The probabilistic principal component analysis (PPCA) method
(Tipping and Bishop, 1998) modifies the original PCA to fit it into a Bayesian framework.
PPCA introduces a zero mean Gaussian distribution latent variable ~y to the regular
PCA model, such that ~x = W~y + ~µ + ~ε, where vector ~x is the observation, ~µ is the
vector parameter that represents the mean of data, ~y ∼ N(0, I), and ~ε ∼ N(0, σ2I).
Given the model, the EM algorithm is used to find the optimal transform matrix to
maximize the likelihood of the observations. EM-PCA (Roweis, 1998) uses similar
ideas. PPCA/EM-PCA have the same advantages as PCA, because they also find
more informative uncorrelated features. They can also assign low probabilities for some
outliers far away from most data. Unfortunately, they have similar disadvantages to PCA.
They are not good for finding optimal features for classification performance. Another
shortcoming of these two methods is that PPCA and EM-PCA are batch algorithms
(Choi, 2004).
Informed PCA (Cohn, 2003) is another variation of PCA that incorporates the
information of labels or categories into the definition of the transformation. PCA only
penalizes according to squared distance of an observation from its projection. Informed
PCA is based on the assumption that if a set of observations Si = {x1, x2, . . . , xn} are in
the same class i, then they should share a common source. For a hyperplane H defined
by the orthogonal matrix C, which consists of the eigenvectors of the covariance matrix
of Si, the maximum likelihood source is the mean of Si’s projections onto H, denoted
by Si. If we denote xj as the projection of the jth observation by the transform matrix
C, the likelihood should be penalized not only based on the variance of observations
around their projections (∑
j ||xj − xj||2), which is same as PCA, but also on the variance
of the projections around their set means (∑
i
∑xj∈Si
||xj − Si||2). With a trade-off hyper
25
parameter β, the square error term is Eβ = (1−β)∑
j ||xj− xj||2 +β∑
i
∑xj∈Si
||xj− Si||2.The EM algorithm is used to find the optimal transform matrix C. Informed PCA uses
the label information to get better features for classification performance than PCA,
but it assumes that clusters are compact, which is not always true. The trade-off hyper
parameter has to be carefully tuned to achieve good performance, as shown in Figure
2-2.
Singular Value Decomposition (SVD). The SVD method (Wall et al., 2003) has
the same goal as PCA in that it finds projections that minimize the squared error in
reconstructing original data. It calculates the eigenvectors of the covariance matrix of
the original data by singular value decomposition (X = USV T , where U and V are the
orthogonal matrix, S is the diagonal matrix, which has nonzero diagonal elements.). It
has more efficient algorithms available than PCA to find the eigenvectors, and some
implementations find just the top N eigenvectors. However, it is still computationally
expensive in the case of high dimension data. It has the same disadvantages as PCA in
that it is not accurate in finding optimal features for classification performance.
Fisher Linear Discriminant Analysis (FLDA). The Fisher linear discriminant
transformation method is a commonly used linear transformation method for supervised
learning (Chen and Yang, 2004; Petridis and Perantonis, 2004; Yang and Yang, 2003;
Zhao et al., 2006). It is optimally discriminative for certain cases (Torkkola, 2003). LDA
finds the eigenvectors of C = S−1w Sb, where Sb is the between-class covariance matrix,
and Sw is the sum of within-class covariance matrices. The matrix S−1w captures the
compactness of each class, and Sb represents the separation of the class means.
Eigenvectors corresponding to the largest eigenvalues of C form the columns of
transform matrix W . New discriminative features y are derived from the original data
x by y = W tx. It performs best on data with Gaussian density for each class, and well
separated means between classes. In addition, since a Fisher criterion is not directly
related to the classification accuracy, it is not optimal in terms of the classification error.
26
Independent Component Analysis (ICA). ICA is a relatively recent method. The
goal is to find a linear representation of non-Gaussian data so that the components
of the representation are statistically independent or as independent as possible.
Such a representation seems to capture the essential structure of the data in many
applications, including feature extraction and signal separation (Hyvarinen and Oja,
2000). The fundamental restriction in ICA is that the independent components must
be non-Gaussian, since the key to estimating the ICA model is non-Gaussianity.
To use non-Gaussianity in ICA estimation, there should be a quantitative measure
of non-Gaussianity of a random variable. The classical measure of non-Gaussianity is
kurtosis, or the fourth-order cumulant. Another very important measure of non-Gaussianity
is given by negentropy. Negentropy is based on the information-theoretic quantity
of differential entropy. Because a fundamental result of information theory is that a
Gaussian variable has the largest entropy among all random variables of equal variance,
entropy could be used as a measure of non-Gaussianity.
ICA can also be considered a variant of projection pursuit (Huber, 1985). Projection
pursuit is a technique developed in statistics for finding the most “interesting” projections
of multidimensional data. Some researchers (Huber, 1985; Jones and Sibson, 1987)
argued that Gaussian distribution is the least interesting one, and that the most
interesting directions are those that show the least Gaussian distribution. By computing
the nongaussian projection pursuit directions, the independent components can be
estimated, which is the concept of ICA, but if the non-Gaussianity model does not hold
for the data, ICA does not work.
Discrete Fourier Transform (DFT) and Discrete Cosine Transform (DCT). One
of the most frequently used transformations in signal/image processing is DFT (Wang
et al., 2007). Given a data sample [f(x), x = 0, 1, . . . , N − 1], the discrete Fourier
transform is F (u) = 1N
∑N−1x=0 f(x)[cos 2πux/N − j sin 2πux/N ], u = 0, 1, 2, . . . , N − 1,
and f(x) = 1N
∑N−1u=0 F (u)[cos 2πux/N + j sin 2πux/N ], x = 0, 1, 2, . . . , N − 1. It
27
transforms the original spatial or time domain data into a frequency domain. It uses
fixed basis vectors, thus it has low computational cost. Other linear transformation
methods in image processing, such as DCT (Saha, 2000), use cosine functions as
basis functions. The discrete cosine transform is D(u) = α(u)∑N−1
x=0 f(x) cos (2x+1)uπ2N
,
and f(x) =∑N−1
u=0 α(u)D(u) cos (2x+1)uπ2N
, where α(0) =√
1/N and α(u) =√
2/N, 0 <
u < N . The discrete wavelet transform (DWT) (Nanavati and Panigrahi, 2005; Saha,
2000) is another widely used transform in image processing. It uses wavelets as basis
functions. Wavelets are functions of limited duration, and have an average value of
zero. These basis functions are obtained from a single prototype function by dilations,
or contractions (scaling) and translations (shifts). The coefficients of these basis
vectors/functions are used as the representatives of original signals, such as F (u)
and D(u). Because these basis functions, such as cosine functions, have compact
energy on low frequencies, and natural images have mostly low-frequency features,
images can be represented by a small number of coefficients without much loss of
information. They have good information packing properties for signal compression
and reconstruction. One disadvantage of DFT and DCT is that their basic functions are
periodic continuous functions, such as sinusoids; they may not be good at generating
more localized features such as edge information. One disadvantage of the DWT
(Nanavati and Panigrahi, 2005) is the problem of selecting basis functions for a given
application, because a particular wavelet is suited only for a particular purpose.
Gabor Filter. Gabor (1946) formulated an oriented bandpass filter that represents
an optimal compromise to the uncertainty relationship between positional and spatial
frequency localization. The Gabor functions g(x, y) = s(x, y)wr(x, y) are defined as the
product of a radially symmetric Gaussian function wr(x, y) and a complex sinusoidal
wave function s(x, y) (Movellan, 2006; Rizki et al., 1993). A complex sinusoidal function
is defined as s(x, y) = exp(j(2π(u0x + v0y0) + P )), where (u0, v0) and P define the
spatial frequency and the phase of the sinusoidal, respectively. A distinct advantage of
28
Gabor functions is their optimality in time and frequency, or in two dimensions, space
and spatial-frequency. They can provide the smallest possible pieces of information
about time-frequency events. Any well behaving function can be represented as a linear
combination of Gabor functions. With properly chosen filter parameters, Gabor filters
have similar characteristics to Gabor functions (Kyrki et al., 2004). The Gabor filter
features are useful for textual analysis, as they have tunable orientation, radial frequency
bandwidths, and tunable center frequencies (Greenspan et al., 1991; Yu and Bhanu,
2006). A disadvantage to using a Gabor transform is that the outputs of Gabor filter
banks are not mutually orthogonal. Thus, the features extracted may be correlated. In
addition, Gabor filters usually require proper tuning of filter parameters.
2.1.6 Convolutional Neural Network and Shared-Weight Neural Network
Convolutional neural networks (CNN) provide an efficient method to constrain
the complexity of feedforward neural networks by weight sharing and restriction to
local connections. This network topology has been applied in particular to image
classification in order to avoid sophisticated preprocessing and to classify raw images
directly (Nebauer, 1998). It has been suggested that this topology is more similar to
biological networks based on receptive fields and improved tolerance to local distortions
(Nebauer, 1998). In addition, the number of weights and the complexity of models are
efficiently reduced by weight sharing. When images with high-dimensional input vectors
are presented directly to the network, this method can avoid explicitly defined feature
extraction and data reduction methods usually applied before classification (LeCun
et al., 1989, 1998). In other words, this method does feature extraction and classification
simultaneously. The disadvantage of this method is that the network is easy to over-fit,
and it is difficult to interpret the meaning of inner nodes and the structure of the network.
It is not easy to incorporate prior knowledge into the network.
Convolutional networks usually have three architectural schemes: local receptive
fields, shared weights, and subsampling. Some degree of scale shift and distortion
29
invariance is accomplished by the combination of these three schemes. One typical
convolutional network for recognizing characters is LeNet-5 (LeCun et al., 1998).
The images of characters in the input layer are approximately size normalized and
centered first. Each node in a layer is connected to a set of nodes located in a small
neighborhood in the previous layer, and then all the weights are learned with back
propagation. Because the number of free parameters is reduced by the weight sharing
technique, the “capacity” of the machine and the gap between test error and training
error is reduced (Chen et al., 2006; Garcia and Delakis, 2004).
A shared weight network (Gader et al., 1995; Porter et al., 2003) is another name
for a CNN and emphasizes the weight-sharing properties of networks. The network is
viewed as nonlinear combinations of linear filters.
2.1.7 Morphological Transform
Erosion (Tamburino and Rizki, 1989a,b), dilation, and hit-miss (Gader and Khabou,
1996; Zmuda and Tamburino, 1996) are commonly used morphological transforms in
image processing. Some other mathematical morphologies, known as granulometries,
are studied as well (Serra, 1983; Urbach et al., 2007).
The morphological transform can be applied to neural network, such as a
morphological neural network. It is a network with a morphological feature extraction
layer. An example is described by Haun et al. (2000). Raw images in the input layer
are first undersampled to decrease computation intensity. Then hit/miss transforms are
used to map the pixels within the images to feature maps. Each feature map is produced
by one hit/miss weight matrix pair. The transforms are essentially the targets eroded
(hit) and the backgrounds dilated (miss). We assume both hit matrix H and miss matrix
I are 3×3 matrices. During the transform process, a 3×3 window slides on the input
image. Given a 3×3 matrix I of pixels from the input image, where I22 is the origin, a
difference matrix D is produced by D = I − H and a sum matrix S is produced by
S = I + M . Then the value f of the pixel at the origin in the feature map is computed by
30
f = min D−max S. A following regular feed forward network is used for the classification
where feature maps are the input.
To improve the computation speed of morphological transformations, some fast
algorithms are studied to compute min, median, max, or any other order statistic filter
transform (Gil and Wcrman, 1993). An efficient and deterministic algorithm is proposed
to compute the one-dimensional dilation and erosion (max and min) sliding window
filters (Gil and Kimmel, 2002).
Morphological Share-weighted Neural Network (MSNN). MSNNs combine the
feature extraction capability of mathematical morphology with the function-mapping
capability of neural networks in a single trainable architecture (Gader et al., 2000;
Khabou and Gader, 2000; Khabou et al., 2000; Sahoolizadeh et al., 2008; Won and
Gader, 1995). It is a two-stage network, with a standard feed forward classification stage
followed by a feature extraction stage. The feature extraction stage is composed of one
or more feature extraction layers. Each layer is composed of one or more feature maps.
Associated with each feature map, is a pair of structuring elements–one for erosion
and one for dilation. The values of a feature map are the result of performing a hit-miss
operation with the pair of structuring elements on a map in the previous layer. The
values of the feature maps on the last layer are fed to the feed-forward classification
stage of the MSNN with gray-scale hit-miss transform (Gader and Khabou, 1996; Gader
et al., 1995; Haun et al., 2000).
2.1.8 Bayesian Nonparametric Latent Feature Model
A Bayesian nonparametric latent feature model (Ghahramani et al., 2007; Griffiths
and Ghahramani, 2005) is a flexible nonparametric approach to latent variable modeling
in which the number of latent variables is unbounded. This approach is based on a
probability distribution over equivalence classes of binary matrices with a finite number
of rows, corresponding to the data points, and an unbounded number of columns,
corresponding to the latent variables. Each data point can be associated with a subset
31
of the possible latent variables, which are referred to as the latent features of that data
point. The binary variables in the matrix indicate which latent feature is possessed by
which data point, and there is a potentially infinite array of features. The distribution over
unbounded binary matrices is derived by taking the limit of a distribution over N × K
binary matrices as K →∞.
2.2 Hidden Markov Model (HMM)
2.2.1 Definition and Basic Concepts
HMMs are stochastic models for complex, nonstationary doubly stochastic
processes with an underlying process (Markov chain) and another stochastic process
that produces time sequences of random observations according to the Markov chain.
At each observation time, the Markov chain may be in one of several states, and given
that the chain is in a certain state, there are probabilities of moving to other states.
These probabilities are called transition probabilities. The word “hidden” in HMM comes
from the fact that the states are hidden or not observable. Given an observation vector
at a specific time, there are probabilities that the chain is in each state. The actual state
is described by a probability density function. The probabilities of the observations are
conditioned on the chain being in the associated state. Thus, an HMM is characterized
by three sets of probability density functions: the transition probabilities, the state
probability density functions, the initial probabilities.
The notation we use generally follows Rabiner (1989) and is as follows:
• T, the length of the observation sequence or state sequence, with the time instance
denoted by t = 1, 2, . . . , T.
• N, the total number of states in the model.
• S, the set of states, with S = {S1, S2, . . . , SN}.• Q, a state sequence, with Q = {q1, q2, . . . , qT}.• A, the state transition probability distribution, with A = {aij}.• B, the set of emitting probability densities, with B = {bj(xt)}.
32
• π, the initial state distribution.
• x, the observation sequence, with x = x1, x2, . . . , xT .
An HMM is generally represented as a three-tuple Λ = (A,B, π) .
2.2.2 Applications of the HMM
The HMM model has been applied to many areas, such as speech recognition
(Rabiner, 1989), machine translation (Inoue and Ueda, 2003), and gene prediction
(Stanke and Waack, 2003). Of particular interest are those applications with images as
input (image classification (Ma et al., 2007), handwriting recognition, and mine detection
(Gader and Popescu, 2002; Gader et al.; Zhao et al., 2003)).
2.2.3 Learning HMM Parameters
Expectation Maximization. The expectation maximization (EM) algorithm is the
most widely used algorithm to learn HMM parameters (Bilmes, 1997). The EM algorithm
aims to find maximum-likelihood (ML) estimates for settings where this appears to be
difficult. The concept of the EM algorithm is to map the given data to complete data from
which it is well-known how to get ML estimates. The EM algorithm performs iterations
over the given data. Each iteration has two steps. An expectation (E) step is followed
by a maximization (M) step. In the E step, the EM algorithm calculates the expectation
of unknown (hidden) data given an instance of a probabilistic model. In the M step, the
EM algorithm calculates an ML estimate over the current case of complete-data (known
data and the expectation of hidden data). The EM algorithm is guaranteed to converge
to a local maximum (Prescher, 2004). Disadvantages include it may not find the global
optimal solution and it is sensitive to the initialization.
Discriminative Training. Discriminative training refers to a training algorithm
aimed at minimum classification error (MCE). It has been applied to estimate the HMM
parameters (Ma, 2004; Zhao et al., 2003). It has a loss function as an objective function.
The function is usually a sigmoid function of a misclassification measure. Gradient
33
descent is often used to optimize the objective function. We will use the MCE algorithm
in our research, so it is described in detail in the next section.
2.2.4 Minimum Classification Error (MCE) for HMM
An HMM is generally represented as a three-tuple Λ = (A,B, π). The log probability
of a given state sequence given a model is denoted by g(x|Λ) and is computed as
follows:
g(x|Λ) = log((x,Q|Λ)) =T∑
t=1
[log aqt−1qt + log bqt(xt)
]+ log πq0 , (2–1)
where D is the dimension of the observation x.
In this work, we use continuous HMMs with Gaussian mixture models representing
the emitting probability densities that are therefore given by
bj(xt) =K∑
k=1
cjkbjk(xt), (2–2)
where
bjk(xt) = (2π)−D/2 |Rjk|−1/2 exp
{1
2
D∑
l=1
(xtl − µjkl
σjkl
)2}
,
and
cjk, µjk, Rjk = diag(σ2jk1, . . . , σ
2jkD)
are the mixture proportion, mean, and covariance of the k-th Gaussian component,
respectively, of the Gaussian mixture density representing state j. Note that we use
diagonal covariance matrices.
To estimate the HMM parameters, we use the MCE method with generalized
probabilistic descent (GPD) (Ma, 2004). The goal of MCE training is to be able to
discriminate among samples of different classes correctly rather than to estimate the
distributions of each class accurately. The MCE objective function is a loss function
that depends on a misclassification measure. The misclassification measure for the
two-class problem used here is defined as follows:
d(i)(x) = (−1)i[g(x|Λ(1))− g(x|Λ(2))
]if x ∈ class i, for i = 1, 2, (2–3)
34
where Λ = (Λ1, Λ2). The loss function associated with the misclassification measure is
defined using the following sigmoid function:
l(x|Λ) = l(d(x)) =1
1 + e−γd(x)+θ, (2–4)
where γ is a predefined parameter.
We try to minimize the expected loss to optimize the performance of the classifier.
In standard MCE training, we seek to estimate the mixture proportions, means, and
covariances of the Gaussian mixtures, as well as the transition probabilities that
will minimize the average loss. To develop practical formulas, auxiliary variables are
explicitly and implicitly introduced:
µjkl = µjklσjkl,
σjkl = log σjkl,
cij =ecij
∑h ecih
,
aij =eaij
∑h eaih
.
(2–5)
Note that these relations yield the following differential relations:
∂µjkl
∂µjkl
= σjkl,
∂cjs
∂cjk
=
cjk(1− cjk) if (s = k)
−cjkcjs if (s 6= k),
(2–6)
which are used in the update formulas below. Similar differential relations hold for the
transition probabilities. Let px ≡ γl(x|Λ) (1− l(x|Λ)). Using these auxiliary variables and
relations, application of gradient descent leads to the following formulas to update HMM
model parameters:
∂l(x|Λ)
∂µjkl
= −px
[T∑
t=1
δ(qt = j)1
bj(xt)
∂bj(xt)
∂µjkl
], (2–7)
35
where∂bj(xt)
∂µjkl
= cjkbjk(xt)
(xtl − µjkl
σjkl
);
∂l(x|Λ)
∂σjkl
= −px
[T∑
t=1
δ(qt = j)1
bj(xt)
∂bj(xt)
∂σjkl
], (2–8)
where∂bj(xt)
∂σjkl
= cjkbjk(xt)
[(xtl − σjkl
σjkl
)2
− 1
];
∂l(x|Λ)
∂cjk
= −px
[T∑
t=1
δ(qt = j)1
bj(xt)
∂bj(xt)
∂cjk
], (2–9)
where∂bj(xt)
∂cjk
=K∑
s=1
bjs(xt)∂cjk
∂cjk
;
∂l(x|Λ)
∂aij
= −px
[T∑
t=1
δ (qt−1 = i, qt = j)−T∑
t=1
δ (qt−1 = i) aij
]. (2–10)
2.3 Gibbs Sampling
Gibbs sampling (Casella and George, 1992; Sheng et al., 2005) is a Markov
chain Monte Carlo (MCMC) method (Neal, 1993) for joint distribution estimation when
the full conditional distributions of all the concerned random variable are available.
Gibbs sampling has become a common alternative to the EM algorithm for solving an
incomplete data problem in a Bayesian context. Gibbs sampling provides samples to
estimate the joint distribution of the random hidden variables and parameters. It provides
for the estimation of random variables from these samples. Therefore, Gibbs sampling
may find a more optimal solution than EM, which is prone to finding local solutions.
Given joint density p(x1, x2, . . . , xK) for a set of random variables x1, x2, . . . , xK , it
is usually difficult to estimate or sample the marginal distributions directly, because the
marginal distribution p(xi) for xi, where i = 1 . . . K, is computed using
p(xi) =
∫. . .
∫p(x1, . . . , xK)dx1 . . . dxi−1dxi+1 . . . dxK . (2–11)
36
Instead, we will sample the variable xi from the conditional distribution p(xi|xj; i 6=j), which is typically easy to sample from. Starting from initial values x
(0)1 , . . . , x
(0)K , the
Gibbs sampler draws samples of random variables as follows:
x(t+1)1 ∼ p(x1|x2 = x
(t)2 , . . . , xK = x
(t)K ), (2–12)
x(t+1)2 ∼ p(x2|x1 = x
(t+1)1 , x3 = x
(t)3 , . . . , xK = x
(t)K ), (2–13)
...
x(t+1)i ∼ p(xi|x1 = x
(t+1)1 , . . . , xi−1 = x
(t+1)i−1 , x(i+1) = x
(t)i+1, . . . , xK = x
(t)K ), (2–14)
...
x(t+1)K ∼ p(xK |x(t+1)
1 , . . . , xK−1 = x(t+1)K−1 ), (2–15)
where t denotes the iteration index. Here X ∼ P (X|Y ) denotes the process of drawing a
sample Xi from a population defined by the conditional distribution P (X|Y )
It has been shown (Neal, 1993) that as t → ∞, the sample distribution of
x1, x2, . . . , xK converges to p(x1, x2, . . . , xK). Equivalently, as t → ∞, the distribution
of x(t)i converges to p(xi) for i = 1 . . . K. Thus, the Gibbs sampler treats the samples
x(t)1 , . . . , x
(t)K for t ≥ M as a sample from p(x1, x2, . . . , xK) by selecting some large value
for M . The initial period when samples are first drawn is referred to as the “burn-in
procedure.” Now we can calculate the expectation of a function f(x) over the distribution
p(xi). This is done by the Monte Carlo integration
Ep(xi)[f(xi)] =
∫f(xi)p(xi)dx ≈ 1
N
N+M∑t=M
f(x(t)i ), (2–16)
where t is the iteration index in the sampling process, and N is the total number of
samples collected.
37
−8 −6 −4 −2 0 2 4 6 8 10−3
−2
−1
0
1
2
Figure 2-1. The dashed line is the PCA projection, but the vertical dotted line representsthe best projection to separate two clusters
−10 −5 0 5 10 15 20−1
0
1
β −>0
−8 −6 −4 −2 0 2 4 6 8 10−4
−2
0
2
β −>1
Figure 2-2. The top plot has β close to zero to maximize the variation of projections(horizontal axis) of all observations, and the bottom plot has β close to oneto minimize the variation of the projections (vertical axis) of the observationsin same cluster.
38
CHAPTER 3CONCEPTUAL APPROACH
3.1 Overview
The goal of this research is to devise and analyze HMM-based algorithms that
can simultaneously learn feature extraction and classification parameters. All the work
described here is performed for continuous HMMs.
Three new approaches are proposed: (1) simultaneous feature learning and
HMM training using an MCE algorithm, (2) HMM training using Gibbs sampling,
and (3) both loosely and tightly coupled feature learning and HMM training using
Gibbs sampling. Note that the second approach focuses only on estimating the HMM
parameters and not the feature parameters. Estimating the HMM parameters alone
allows for a focused analysis of the proposed novel technique. We refer to the first
two methods as McFeaLHMM and SampHMM, respectively (here FeaL is an acronym
replacing Feature Learning). The third method consists of two sub-methods, which
we refer to as TSampFeaLHMM and LSampFealHMM. Note that the feature models
used for McFeaLHMM are ordered weighted average (OWA)-based generalizations of
morphological operators, whereas those used by TSampFeaLHMM and LSampFeaLHMM
are convolutional models.
Results indicate that, while all feature learning models can achieve performance
similar to or better than that of a human, McFeaLHMM is very sensitive to initialization
and learning rates. Fortunately, TSampFeaLHMM and LSampFeaLHMM are much more
stable and can produce better solutions than McFeaLHMM in landmine detection
experiments. It may be possible to alleviate these problems for McFeaLHMM by
sampling from a posterior distribution based on the MCE loss function, but investigation
of that concept is left to future work. The feature models and training algorithms are now
described below in detail.
39
3.2 Feature Representation
Two feature models are described: the OWA-based generalized morphological
feature model and the convolutional feature model. The first model was inspired by the
current feature extraction in state of the art landmine detection. The second model is
more suitable for sampling, and results in faster operational models.
3.2.1 Ordered Weighted Average (OWA)-based Generalized MorphologicalFeature Model
OWA operators are used to parameterize both the morphological operations
and other aggregation operations. They form a family of operators that can represent
maximum, minimum, median , α-trimmed means, and many other operators. Let w =
(w1, w2, . . . , wn) be a vector of real numbers constrained such that
n∑i=1
wi = 1 and 0 ≤ wi ≤ 1 for i = 1, 2, . . . , n. (3–1)
Any weights satisfying these properties will be referred to as a set of OWA weights.
Let f = {f1, f2, . . . , fn} be a multi-set of real numbers. The i-th order statistic of f is
f(i), where the parenthesized subscripts denote a permutation of the indices such that
f(1) ≤ f(2) ≤ · · · ≤ f(n). The OWA operator on f with weight vector w is defined as
OWAw(f) =n∑
i=1
wif(i). (3–2)
We also define OWAw as an OWA operator of size n where n is the number of
weights. OWA operators are used here to define general feature extractors. In our
context, a feature is defined to be a tuple consisting of the following:
1. A feature window size, N ×K.
2. Two sets of OWA weights h and m of size NK and associated OWA operators that
act on two-dimensional arrays B as follow:
OWAh(B) =NK∑i=1
hibσ(i), (3–3)
40
where σ : {1, . . . , NK} → {(nk)|n = 1, . . . , N ; k = 1, . . . , K} is a bijection (a 1-1 and
onto mapping) satisfying bσ(1) ≤ bσ(2) ≤ · · · ≤ bσ(NK).
3. Two N × K arrays called the hit mask and miss mask and denoted Gh and
Gm, respectively. The masks represent the geometric shapes of the features.
Consistent with standard practice in mathematical morphology, the hit mask is
a pattern that matches a foreground shape and the miss mask is a pattern that
matches a background shape of the features. The values of the arrays are either
binary, in the set {0, 1}, or non-binary, in the interval [0, 1]. We have h and m to
represent hit and miss masks, respectively.
The OWA weights and mask values are the feature extraction parameters that
are learned in the training process. Features are computed over subwindows from the
images using neighborhood OWA operators.
A training set T consists of images from each class. The first step in feature
learning is feature initialization. Feature initialization proceeds by first collecting
subwindows from the training images. Various procedures can be used to select or
compute a small set of likely initial features from these subwindows. We will investigate
several of these procedures. The feature parameters are then updated together with the
standard HMM parameters. Several training algorithms, including MCE-based and Gibbs
sampling will be considered. Neighborhood updating methods that attempt to encourage
connected features will also be studied. Given an image, the feature extraction process
proceeds as follows. Let A be an image with pixel values in the interval [0,1] and size
larger than the window size. The feature extraction consists of two steps: first is applying
a masked, neighborhood OWA hit-miss operator to A, and second is aggregating the
result of the first step over the rows of A. More precisely, let Atk denote the N × K
subimage of A with the upper left hand corner located at row t and column k. After
we apply the mask and hit-miss operator to A, an image D with the same size as A is
41
obtained. The value at row t and column k of D is defined by
D(t, k) ≡ min [OWAh (Atk ◦Gh) , OWAm ((1− Aik) ◦Gm)] , (3–4)
where symbol “◦” denotes pointwise multiplication. Note that if the image and the masks
are binary and if h and m correspond to minimum operations, then this operator is
exactly the ordinary hit-miss operator from mathematical morphology. The final step
in feature extraction is aggregating the outputs of the masked, neighborhood hit-miss
operator by computing the maximum of each column of D:
xk = maxt
(D(t, k)). (3–5)
The result is a sequence of feature values indexed by k. A pictorial description (Zhang
et al., 2007) is shown in Figure 3-1.
3.2.2 Convolutional Feature Models
The following two models for feature extraction presented here are modified
slightly to fit the Gibbs approach and produce more computationally efficient models.
Specifically, feature extraction is modeled as convolution with random perturbations or
error terms.
3.2.2.1 Feature model for loosely coupled sampling feature learning HMM(LSampFealHMM)
As shown in Figure 3-4, we are given p N × K images. We transform each image
to an NK × 1 vector Ai, where i = 1, . . . , H. For each image Ai an N × K binary hit
mask or an N × K ternary with values in {-1, 0, 1} hit-miss mask Mi is applied using
convolution as follows. Let L = N ×K. We define the vector Bi as
Bi = Ai ◦Mi + ζ, i = 1, . . . , H, ζ ∼ N(0, σ2
ζIL×1
), (3–6)
where the symbol ‘◦’ denotes pointwise multiplication, and ζ represents a zero mean
Gaussian perturbation with covariance matrix Σζ = σ2ζIL×1. Note that we consider each
42
Mi to be a single realization of a random mask M . The random mask M can be thought
of as an N ×K array of binary or ternary variables. Then the k-th element of Bi can be
denoted as
Bik = AikMik + ζk, k = 1, . . . , L, ζk ∼ N(0, σ2ζ ). (3–7)
Now we define Di as the sum over B plus a zero mean additive Gaussian
perturbation with variance σ2η:
Di =∑
k
Bik + η, η ∼ N(0, σ2η). (3–8)
We define one feature xt as the aggregation of Di with the additive zero mean
Gaussian noise ε,
xt =∑
i
Di + ε, ε ∼ N(0, σ2ε). (3–9)
Now we assign the label yt to the feature xt, given the threshold ξ, by
yt =
1, if xt > ξ
0, if xt ≤ ξ.(3–10)
3.2.2.2 Feature model for tightly coupled sampling feature learning HMM(TSampFealHMM)
The feature model for TSampFealHMM will learn the feature and HMM parameters
in a single Bayesian probability framework. The previous feature model produced
observation sequences that were tightly coupled to columns of the input image. That
is, given input image A and associated observation sequence x1, . . . , xT , each xi
corresponds to a set of columns of A. By contrast, the feature model used in the
TSampFealHMM model produces observation sequences that can be associated with
subimages of A that can vary both horizontally and vertically within A.
As shown in Figure 3-6, when an N1 ×K1 image A is given, the image is split into T
N2 ×K2 subimages At, t = 1, . . . , T . These subimages are called zones. For each zone
At, an N ×K ternary {-1, 0, 1} hit-miss mask Mt is applied using convolution, similar to
43
that in LSampFealHMM. The mask is only applied at positions for which the mask fits in
the image. Assume that convolution over a zone At produces p values. Vectors Bti and
Di are defined as in LSampFealHMM. The k-th element of Bti can be denoted as
Btik = AtikMtk + ζk, i = 1, . . . , p, ζi ∼ N(0, σ2ζ ). (3–11)
The vector Dti can be represented as
Dti =∑
k
Btik + η, η ∼ N(0, σ2η). (3–12)
We define one feature xt as the aggregation of Dti,
xt =∑
i
Dti. (3–13)
To reduce computation and the number of variables to be sampled, we change the
order of the equations above, as shown in Figure 3-7. We first convolve zone At with a
mask M , where Mk = 1 for each k. The result is an N ×K matrix or size NK × 1 vector
Zt that is
Ztk =∑
i
Atik + η, i = 1, . . . , p, η ∼ N(0, σ2η). (3–14)
We then define an array Ct by
Ctk = ZtkMtk + ζk, k = 1, . . . , NK, ζk ∼ N(0, σ2ζ ). (3–15)
We define one feature xt as the aggregation of Ctk,
xt =∑
k
Ctk. (3–16)
Now we assume the feature xt is from a Gaussian distribution N(µ, σ) and apply the
HMM model(A, π, µ, σ) to the sequence x1, . . . , xT . Note that the values of Atk should be
scaled to the interval [-1, 1], with the background values of the image taking the value
-1. This scaling provides the advantage that if the hit-miss mask has a hit on the area of
interest, the feature value will always be close to the maximum value.
44
3.3 Feature Initialization
As will be discussed in detail in the next section, the McFeaLHMM algorithm
is based on gradient descent, similar to past MCE algorithms for HMM (Ma, 2004).
Consequently, this algorithm is very sensitive to initialization. In fact, in our experiments,
random initialization did not lead to useful solutions. Therefore, a data-based algorithm
for initializing masks was devised for McFeaLHMM. The sampling-based algorithms
were not sensitive to initialization. Therefore random initialization was used for
TSampFeaLHMM and LSampFealHMM.
The algorithm used to initialize McFeaLHMM is based on clustering, so is referred
to as McClustInit. The McFeaLHMM algorithm uses one HMM to model each class.
For each model, a training set A1, A2, . . . , AH of Nbig × Kbig images of patterns from
the associated class is given. The OWA-based feature operators are intended to detect
sub-patterns of those patterns contained in N×K subimages. The McClustInit algorithm
therefore clusters subimages of size N × K extracted from the training data set. In
the first step, the McClustInit algorithm employs the Otsu thresholding algorithm (Otsu,
1979) to semi-threshold the training patterns. Next, all N ×K subimages with sufficient
energy are extracted from the semi-thresholded training images. We let S denote the
set of these subimages. The goal is to find shift-invariant prototypes of the patterns in
S. For example, all horizontal lines should have the same representation. Therefore,
the algorithm calculates the magnitude of the Fourier transforms of all the patterns
in S, producing set F . The elements of F were clustered using the Fuzzy C-Means
(FCM) algorithm with a pre-defined value of C, resulting in a set P of frequency domain
prototypes. These prototypes were then used to compute spatial domain prototypes,
which were used as the initial feature masks.
45
3.4 Learning Methods
3.4.1 MCE-HMM Model for Feature Learning
To derive the feature learning algorithm, we first represent the feature extraction
algorithms at the pixel level. We derive the learning algorithm with the assumption
that there is only one feature, that is, that the dimensionality of the feature vectors is
one. Multi-dimensional features can be learned by applying the same formulas to each
dimension independent of all other dimensions. The masking operation is given by
Bhtk(i, j) = At,kGh(i, j),
Bmtk (i, j) = At,kGm(i, j),
(3–17)
where (i, j) is the position of row i and column j of 2-D matrix Gh, Gm, or Btk. Following
masking, the OWA hit-miss operator is applied as follows:
D(t, k) = min
(∑s=1
hsBht,k,σ(s),
∑s=1
msBmt,k,σ(s)
). (3–18)
Here Bt,k,σ(s) denotes the sorted value of Btk, where σ(s) is the one-dimensional sorting
index of Btk. The feature values are then calculated according to equation (3–5).
Here the feature learning algorithm is derived using gradient descent on the MCE.
Objective loss function l(x|Λ) is defined in section 2.2.4. Hence, we need to compute∂l(x|Λ)
∂h, ∂l(x|Λ)
∂m, ∂l(x|Λ)
∂Gh, and ∂l(x|Λ)
∂Gm. Also note that the expression in equation (3–17) is a min
function and that
∂ min(x1, . . . , xn)
∂xi
=
1 if i = arg min(x1, . . . , xn)
0 otherwise.(3–19)
Thus, it suffices to derive ∂l(x|Λ)∂h
and ∂l(x|Λ)∂Gh
only, where we assume without loss of
generality that the hit masked OWA operation is the min of two items in equation (3–17).
To maintain the requirements placed on OWA weights and masks, we introduce auxiliary
46
variables up and Gh(i, j), such that:
hp =u2
p∑k u2
k
,
Gh(i, j) =1
1 + e−γhGh(i,j),
(3–20)
where γh is a user-defined prior parameter to decide the slope of the sigmoid function.
Note that these relations yield the following differential relations:
∂hp
∂uk
=
2hk(1−hk)uk
if (p = k)
2hphk
ukif (p 6= k)
,
∂Gh(i, j)
∂Gh(i, j)= −γhGh(i, j) (1−Gh(i, j)) ,
(3–21)
which are used in the update formulas. We apply gradient descent to the auxiliary
variables and then update the variables used in the calculations according to equation
(3–21). Since we know that ∂l(x|Λ)∂up
= ∂l(x|Λ)∂x
∂x∂up
and ∂l(x|Λ)
∂Gh(i,j)= ∂l(x|Λ)
∂x∂x
∂Gh(i,j), the derivatives
∂x∂up
and ∂x∂Gh(i,j)
are derived as follows:
∂x
∂up
=∂dt,k
∂up
=∑
s
Bt,k,s∂hs
∂up
, (3–22)
where
tmaxk = arg max
tk(D(· , k)),
and∂x
∂Gh(i, j)=
∂Dtmaxk ,k
∂Gh(i, j)=
∑s
hs
∂Btmaxk ,k,s
∂Gh(i, j)= h(p)At,k,p =
∂Gh(i, j)
∂Gh(i, j), (3–23)
where
p = (i− 1)K + j.
Application of gradient descent leads to the following formulas to update the feature
parameters. Let c ∈ {1, 2} denote the class label of the input observation sequence x.
47
Then
∂l(x|Λ)
∂up
= (−1)cpx
[T∑
t=1
(δ(q
(1)t = v)
1
b(1)v (xt)
∂b(1)v (xt)
∂xt
− δ(q(2)t = v)
1
b(1)v (xt)
∂b(2)v (xt)
∂xt
)∂xt
∂up
],
(3–24)
and
∂l(x|Λ)
∂Gh(i, j)= (−1)cpx
[T∑
t=1
(δ(q
(1)t = v)
1
b(1)v (xt)
∂b(1)v (xt)
∂xt
− δ(q(2)t = v)
1
b(1)v (xt)
∂b(2)v (xt)
∂xt
)∂xt
∂Gh(i, j)
],
(3–25)
where∂bj(xt)
∂xt
=∑
k
cjkbjk(xt)
(xt − µjk
σjk
),
and
δ(qt = v) =
1 if qt = v
0 otherwise.
The training algorithm is summarized in Figure 3-2.
3.4.2 Gibbs Sampling Method for Continuous HMM with Multivariate GaussianMixtures
Gibbs sampling methods have the advantage of generally finding better optima
than such traditional methods as expectation maximization algorithm (Dunmur and
Titterington, 1997; Ryden and Titterington, 1998). Although there are some learning
methods proposed for HMM based on MCMC sampling, they are either for the discrete
HMM problem (standard HMM (Bae, 2005) or infinite HMM (Beal et al., 2002)), plus
some uncommon HMMs, such as trajectory HMMs (Zen et al., 2006), non parametric
HMMs (Thrun et al., 1999), or nonstationary HMMs (Zhu et al., 2007). Here, a Gibbs
approach is proposed for training continuous HMMs with multivariate Gaussian mixtures
representing the states.
48
The joint probability of a sequence of observations and a hidden state sequence
is denoted by P (X, Q). The standard dependence assumptions allow us to derive an
expression for the joint likelihood as follows:
P (X,Q) = P (X|Q)P (Q) = P (Q)∏
t
P (xt|Q) = P (Q)∏
t
P (xt|qt)
= P (qt)
[T∏
t=2
P (qt|qt−1)
] [∏t
P (xt|qt)
]= P (qt)P (xt|qt)
T∏t=2
(P (qt|qt−1)P (xt|qt)) .
(3–26)
The probability of a state sequence is computed as follows:
P (Q) = P (q1)T∏
t=2
P (qt|qt−1) =N∏
r=1
(N∏
s=1
P (qt = s|qt−1 = r)nrs
), (3–27)
where nrs = the number of transition state pairs such that qt−1 = r and qt = s for t =
2, . . . , T . Note that the transition probabilities are stationary, the value P (qt|qt−1) for
t = 2, . . . , T and fixed s, r are equal.
Since the posterior distributions of the parameters Λ, as defined in section 2.2.1,
are not available in explicit form, we use Gibbs sampling to simulate the parameters
from the posterior distributions after defining likelihood and prior probability models
for the parameters. First, we assume the likelihood model for state transitions is the
multinomial probability distribution:
P (qr1 = nr1, . . . , qrN = nrN |arN , . . . , arN) ∝N∏
s=1
anrsrs . (3–28)
The conjugate prior of the multinomial distribution is used as the prior probability
distribution of ars, so it is a Dirichlet probability distribution
P (ar1, . . . , arN) = Dirichlet(~αr0), r = 1, . . . , N, (3–29)
where ~αr0 is the vector of prior parameters. The state probability distribution is assumed
to be a Gaussian mixture. We let θr = (cr, µr, Σr) denote the parameters of the state
49
density probability distributions of state r:
P (xt|θr) =∑
k
crkN(xt; µrk, Σrk). (3–30)
We assume the probability of the components is governed by a multinomial probability
distribution. Let nrk denote the number of occurrences of component k in state r.
P (nr1, . . . , nrK |cr1, . . . , crK) ∝∏
rk
cnrkrk . (3–31)
Similar to before, we use the conjugate prior of the multinomial, the Dirichlet probability
distribution, as the prior probability distribution of crk:
P (cr1, . . . , crK) = Dirichlet(~α0), (3–32)
where ~α0 is the hyperprior parameter.
Now we can compute the posterior conditional probabilities. The posterior
conditional probability of state transition ars, s = 1, . . . , N is given by
ar1, . . . , arN ∼ P (ar1, . . . , arN |Q)
∝ P (Q|ar1, . . . , arN)P (ar1, . . . , arN) ∝ Dirichlet(~αrp),
(3–33)
where
~αrp = ~αr0 + [nr1, . . . , nrN ]T .
We represent the state sequence Q excluding state qt as q1, . . . , qt−1, qt+1, . . . , qT or
by q−t. The posterior conditional probability for the state qt|q−t is
P (qt = r|q−t, θ, X) ∝ P (qt = r|q−t)P (xt|qt = r, θr)P (qt+1|qt = r)
= aqt−1rarqt+1P (xt|qt = r, θr).
(3–34)
50
Now we compute the posterior conditional probability of the state parameters. The
posterior conditional probability for the component probability c is given by
cr1, . . . , crK ∼ P (cr1, . . . , crK |nr1, . . . , nrK)
∝ P (nr1, . . . , nrK |cr1, . . . , crK)P (cr1, . . . , crK) ∝ Dirichlet(~αp),
(3–35)
where
~αp = ~α0 + [nr1, . . . , nrK ]T .
The posterior conditional probability that xt was produced by component k of state
qt when xt was produced from state qt is
P (k|qt = r, θ, X) = P (k|θ, xt) ∝ crkP (xt|θrk). (3–36)
Now we compute the posterior conditional probability of the Gaussian model
parameters modeling state components. A Gaussian probability distribution is given by
p(X|µ, Σ) =1
(2π)dT/2|Σ|T/2exp
(−1
2
∑t
(xt − µ)T Σ−1(xt − µ)
), (3–37)
where d is the dimension of xt. We assume the prior of the mean µ is a Gaussian
p(µ|µ0, Σ0) = N(µ; µ0, Σ0), where µ0, Σ0 are hyperprior parameters, and the prior of the
covariance Σ is an invWishart probability distribution p(Σ|αs, Ts) = invWishart(Σ|ψs, Ts),
where ψs, Ts are hyperprior parameters. Hence the posterior conditional probability
distribution of the mean is
µ ∼ p(µ|X) ∝ P (X|µ)p(µ) ∝ N(µ; µp, Σp), (3–38)
where
µp =(Σ−1
0 + TΣ−1)−1
(Σ−1
0 µ0 + Σ−1∑
t
xt
),
Σp =(Σ−1
0 + TΣ−1)−1
.
51
The posterior conditional probability distribution of the covariance is
Σ ∼ p(Σ|X) ∝ P (X|Σ)p(Σ) ∝ invWishart(Σ|ψp, Tp), (3–39)
where
ψp = ψs + T,
Tp = Ts +∑
t
(xt − µ)(xt − µ)T .
The training algorithm is summarized in Figure 3-3.
3.4.3 Loosely Coupled Gibbs Sampling Model for Feature Learning
The LSampFealHMM algorithm is a simplified, nonstandard HMM algorithm that
associates each state with a feature. The “probability” that the system is in state q is
proportional to a monotonic function of the feature value.
The training algorithm for LSampFealHMM consists of alternating optimization
between a Gibbs sampler and a modified Viterbi learning algorithm. That is, it is of the
form
• Initialize states
• Do until stopping criterion reached
Run a Gibbs sampler to estimate feature masks
Refine states using modified Viterbi learning
• End Do
In the next section, we first describe the Gibbs Sampler and then the initialization
and modified Viterbi learning.
3.4.3.1 Gibbs sampler for LSampFeaLHMM
To use a Gibbs sampler we need to assume a prior distribution of the probability pk
of the k-th element of mask M . Given the hyper parameters α and β, the prior for the
52
probability pk is a beta distribution,
pk ∼ β(αk, βk). (3–40)
The probability of the k-th element of the binary hit mask M is a binomial distribution,
P (M·k|pk) ∝ pnkk (1− pk)
Nk−nk , (3–41)
where
nk =∑
i
(Mik = 1).
If M is the ternary hit-miss mask, the probability of mask M is a multinomial
distribution,
P (M·k|pk1 , pk0 , pk2) ∝ pnk1k1
pnk0k0
pnk2k2
, (3–42)
where
nk1 =∑
i
(Mik = 1),
nk0 =∑
i
(Mik = 0),
nk2 =∑
i
(Mik = −1),
and the prior for the probabilities of pk1 , pk0, pk2 is a Dirichlet distribution,
(pk1 , pk0 , pk2) ∼ Dirichlet(αk, βk, γk). (3–43)
The posterior distribution is not available in explicit form, so we use the Gibbs
sampling approach to sample all the variables regard for estimation.
53
The parameter M and the variables (B,D, x) must be estimated. By Gibbs sampler
we will sample these variables from the complete conditional probability distribution:
pk ∼ p(pk|M·k, αk, βk),
M·k ∼ P (M·k|pk1 , pk0 , pk2 , B, A),
Bi ∼ p(Bi|Di, Ai,Mi), (3–44)
D ∼ p(D|xt, Bi),
xt ∼ p(xt|yt, D).
Computation proceeds as follows:
1. Sample pk, k = 1, . . . , L given (M,α, β). The sample is drawn from the beta
distribution if M is the binary hit mask.
p(pk|M·k, αk, βk) ∝ P (M·k, αk, βk|pk)p(pk|αk, βk)
∝ β(αk + nk, βk + Nk − nk).
(3–45)
The sample is drawn from the Dirichlet distribution if M is the ternary hit-miss mask.
p(pk1 , pk0 , pk2|M·k, αk, βk, γk) ∝ Dirichlet(αk + nk1 , βk + nk0 , γk + nk2). (3–46)
2. Sample M given (p,B, A). Since every component of M is assumed to be
independent, it is easy to sample component wise.
If M is a binary hit mask, we sample it from a binomial distribution. We compute the
P (M·k = 1|pk, B, A) and P (M·k = 0|pk, B, A) first,
P (M·k = 1|pk, B, A) ∝ P (M·k = 1|pk)P (B|M·k = 1, A) ∝ pk exp
(−(Bik − Aik)
2
2σ2ζ
),
(3–47)
P (M·k = 0|pk, B, A) ∝ P (M·k = 0|pk)P (B|M·k = 0, A) ∝ (1− pk) exp
(−(Bik)
2
2σ2ζ
). (3–48)
54
Then, after these two values are normalized, we sample M from a binomial distribution:
M·k|pk, B,A ∼ binomial(P (M·k = 1|pk, B, A), P (Mik = 0|pk, B,A)). (3–49)
If M is a ternary hit-miss mask, we sample it from a multinomial distribution. First
we compute P (M·k = 1|pk1 , B,A), P (M·k = 0|pk0 , B,A) and P (M·k = −1|pk2 , B, A), where
P (M·k = 1|pk1 , B, A) ∝ P (M·k = 1|pk1)p(B|M·k = 1, A) ∝ pk1 exp
(−(Bik − Aik)
2
2σ2ζ
),
(3–50)
P (M·k = 0|pk0 , B, A) ∝ P (M·k = 0|pk0)P (B|M·k = 0, A) ∝ pk0 exp
(−(Bik)
2
2σ2ζ
), (3–51)
P (M·k = −1|pk2 , B, A) ∝ P (M·k = −1|pk2)P (B|M·k = −1, A) ∝ pk2 exp
(−(Bik + Aik)
2
2σ2ζ
).
(3–52)
After these three values are normalized, we can sample M from a multinomial
distribution:
M·k|pk1 , pk0 , pk2 , B, A ∼
multinomial(P (M·k = 1|pk1 , B,A), P (M·k = 0|pk0 , B,A), P (M·k = −1|pk2 , B, A)). (3–53)
3. Sample the variable B given (A,M). We know that
p(Bi|Ai,Mi) ∝ exp
(−
L∑
k=1
(Bik − AikMik)2
2σ2ζ
), (3–54)
and
p(Di|Bi) ∝ exp
−
(Di −
∑Lk=1 Bik
)2
2σ2η
, (3–55)
55
so we have
p(Bi|Di, Ai,Mi) ∝ p(Di|Bi)p(Bi|Ai,Mi)
∝ exp
−
(Di −
∑Lk=1 Bik
)2
2σ2η
exp
(−
L∑
k=1
(Bik − AikMik)2
2σ2ζ
).
(3–56)
Rather than sampling B as a matrix, it is better to sample component-wise from the
Gaussian distribution:
Bik|B−ik, Aik,Mik, Di ∼ N
τζAikMik + τη
(Di −
∑q Biq + Bik
)
τζ + τη
, (τζ + τη)−1/2
, (3–57)
where τ% = 1σ2
%denotes the precision of the Gaussian distributions for % = ζ and η.
4. Sample D given (x,B). We know that,
p(xt|D) ∝ exp
(−(xt −
∑i Di)
T (xt −∑
i Di)
2σ2ε
), (3–58)
so we have
p(D|xt, Bi) ∝ p(D|Bi)p(xt|D)
∝ exp
−
∑i
(Di −
∑Lk=1 Bik
)2
2σ2η
exp
(−(xt −
∑i Di)
T (xt −∑
i Di)
2σ2ε
).
(3–59)
Similarly we would like to sample D component-wise from the Gaussian distribution:
p(Di|D−i, xt, Bi) ∝ N
τη
∑Lk=1 Bik + τε
(xt −
∑q Dq + Di
)
τη + τε
, (τη + τε)−1/2
. (3–60)
5. Sample x given (y, D) from the truncated Gaussian distribution.
p(xt|yt = 1, D) ∝ N(∑
i
Di, σ2ε) truncated at the right by the threshold ξ, (3–61)
p(xt|yt = 0, D) ∝ N(∑
i
Di, σ2ε) truncated at the left by the threshold ξ. (3–62)
56
After the burning-in period, we collect the Gibbs samples at the s-th iteration
as (p[s],M [s], B[s], D[s], x[s], s = 1, . . . ). Then we can use these samples to make the
predication and posterior inference. Note that when we do the testing, we use p, the
probability of masks, as the mask feature. In this way, p can be interpreted as the
expectation of a mask, and the values of masks used in predication/test will be gray-level
values instead of binary or ternary values.
In the LSampFealHMM, we do not use state probabilities. Instead, we use the
features themselves directly, either
bq(~x) = xq (3–63)
for binary masks or
bq(~x) = exp(xq) (3–64)
for ternary masks. Then, given the test observation sequence X = ~x1, ~x2, . . . , ~xT , the
output of LSampFealHMM is given by the Viterbi algorithm, used for finding the best
path:
output ≡T∑
t=1
ln(bqt(xqt)) (3–65)
The training algorithm is summarized in Figure 3-5.
3.4.3.2 Initialization and modified Viterbi learning
Initialization requires specifying values for initial probabilities ~π, transition matrix A,
and state emission probabilities B. The initial value of ~π is taken to be (1, 0, . . . , 0). The
initial transition matrix is taken to be
a11 a1 a1 . . . a1
0 a22 a2 . . . a2
. . .
0 0 0 . . . aQQ
,
where aii > ai and, of course, aii +∑
i ai = 1. Thus, LSampFeaLHMM produces a
left-to-right model. Note that the Gibbs sampler is not very sensitive to initialization,
57
but a transition matrix of this form is a sensible initialization. Finally, since the states
are associated with the features and the features are determined by the masks as
discussed in section 3.2.2.1, initializing B is performed by randomly initializing the
masks Mi, i = 1, . . . , Q using uniform probabilities of 0 and 1 at each element of each
Mi.
In addition to the model parameters, the observations need to be associated with
each state to provide a set of training samples for each mask. The method employed
is to use the first T/Q of all the observations for states 1, the second T/Q samples
for states 2, and so forth. Of course, an implementation detail arises if T/Q is not an
integer, but this is not significant for the Gibbs sampler.
Modified Viterbi learning is used to update the samples used to learn the masks
for each state. Using the parameters and observation sequences, an optimal state
sequence is found for each training sample using the Viterbi algorithm. Hence, for each
training sample, we have paired sequences
xh1, . . . , xhT ,
qh1, . . . , qhT
=
observation sequence for training image Ah,
optimal state sequence for training image Ah
. (3–66)
Note that the second segment is the set of state labels associated with an optimal state
sequence, but we may refer to it as an optimal state sequence. We write the entire set of
pairs of observations obtained in this manner as
ρ = {(xhj, qhj)|h = 1, . . . , H; j = 1, . . . , T}. (3–67)
For each state index r ∈ {1, 2, ..., Q}, we define the set of observations associated
with r as
χr ≡ {xhj|(xhj, r) ∈ ρ}. (3–68)
The transition matrix is also updated using the optimal state sequences. Let nij
denote the number of occurrences of the consecutive subsequence i, j in the set of all
58
optimal state sequences. Let nd = (T − 1)H denote the number of two consecutive
element subsequences obtained from the training data. Next, the transition matrix is
updated using anewij = aold
ij + ηnij
nd, where η is a user-defined learning rate.
3.4.4 Tightly Coupled Gibbs Sampling Model for Feature Learning
Similar to LSampFealHMM, we need to define priors over feature learning variables.
We also need priors for the HMM model variables (state transition probabilities, state
Gaussian means, and state Gaussian variances).
Given the hyper parameters αk, βk, γk, the prior for the probabilities pk1 , pk0, pk2 is a
Dirichlet distribution,
(pk1 , pk0 , pk2) ∼ Dirichlet(αk, βk, γk). (3–69)
The probability of the k-th element of the hit-miss mask M is a multinomial distribution,
P (M·k|pk1 , pk0 , pk2) ∝ pnk1k1
pnk0k0
pnk2k2
, (3–70)
where
nk1 =∑
i
(Mik = 1),
nk0 =∑
i
(Mik = 0),
nk2 =∑
i
(Mik = −1).
A Gibbs sampling approach is used to sample all the variables required. Feature
learning parameter M , variables (Z, C, x), HMM model parameters (λ = µ, σ,A, π), and
state sequences Q must be sampled. Let χr ≡ {xt|xt is in the state r, t = 1, . . . , T}similar to equation 3–68. Using the Gibbs sampler, we need to sample these variables
59
from the complete conditional probability distribution:
Zt ∼ p(Zt|At, Ct,Mt),
(pk1 , pk0 , pk2) ∼ p(pk1 , pk0 , pk2|M·k, αk, βk, γk),
Mtk ∼ P (Mtk|pk1 , pk0 , pk2 , C, Z), (3–71)
Ct ∼ p(Ct|µqt , σqt , Zt,Mt),
µr ∼ p(µr|χr),
σr ∼ p(σr|χr),
transition matrix ∼ p(|x,Q).
The computation process is as follows:
1. Sample the variable Z given (C,A, M). We know that
p(Zt|At) ∝ exp
(−(Zt −
∑i Ati)
T (Zt −∑
i Ati)
2σ2η
)(3–72)
and
p(Ct|Zt,Mt) ∝ exp
(−
L∑
k=1
(Ctk − ZtkMtk)2
2σ2ζ
), (3–73)
So we have
p(Zt|At, Ct,Mt) ∝ p(Zt|At)p(Ct|Zt,Mt)
∝ exp
(−(Zt −
∑i Ati)
T (Zt −∑
i Ati)
2σ2η
)exp
(−
L∑
k=1
(Ctk − ZtkMtk)2
2σ2ζ
).
(3–74)
We would like to sample Z component-wise from the Gaussian distribution:
p(Ztk|Z−tk, Atk, Ctk,Mtk) ∝ N
(τη
∑Li=1 Ati + τζ (CtkMtk)
τη + τζM2tk
, (τη + τζM2tk)
−1/2
). (3–75)
60
2. Sample pk, k = 1, . . . , L given (M,αk, βk, γk). The sample is drawn from the
Dirichlet distribution.
p(pk1 , pk0 , pk2|M·k, αk, βk, γk) ∝ Dirichlet(αk + nk1 , βk + nk0 , γk + nk2). (3–76)
3. Sample M given (p, C, Z). Since every component of M is assumed to be
independent, it is easy to sample component-wise.
M is a ternary hit-miss mask; we sample it from a multinomial distribution. First we
compute P (Mtk = 1|pk1, C, Z), P (Mtk = 0|pk0 , C, Z) and P (Mtk = −1|pk2 , c, Z),
P (Mtk = 1|pk1 , C, Z) ∝ P (Mtk = 1|pk1)p(C|Mtk = 1, Z) ∝ pk1 exp
(−(Ctk − Zik)
2
2σ2ζ
),
(3–77)
P (Mtk = 0|pk0 , C, Z) ∝ P (Mtk = 0|pk0)p(C|Mtk = 0, Z) ∝ pk0 exp
(−(Ctk)
2
2σ2ζ
), (3–78)
P (Mtk = −1|pk2 , C, Z) ∝ P (Mtk = −1|pk2)p(C|Mtk = −1, Z) ∝ pk2 exp
(−(Ctk + Ztk)
2
2σ2ζ
).
(3–79)
After these three values are normalized, we can sample M from a multinomial
distribution:
Mtk|pk1 , pk0 , pk2 , C, Z ∼
multinomial(P (Mtk = 1|pk1 , C, Z), P (Mtk = 0|pk0 , C, Z), P (Mtk = −1|pk2 , C, Z)). (3–80)
4. Sample C given (Z,M, µqt , σqt). We know that
p(Ct|Zt,Mt) ∝ exp
(−
L∑
k=1
(Ctk − ZtkMtk)2
2σ2ζ
)(3–81)
and
p(µqt |Ct, σqt) ∝ exp
−
(µqt −
∑Lk=1 Ctk
)2
2σ2qt
, (3–82)
61
so we have
p(Ct|µqt , σqt , Zt, Mt) ∝ p(µqt|Ct, σqt)p(Ct|Zt,Mt)
∝ exp
−
(µqt −
∑Lk=1 Ctk
)2
2σ2qt
exp
(−
L∑
k=1
(Ctk − ZtkMtk)2
2σ2ζ
).
(3–83)
Rather than sampling C as a matrix, it is better to sample component-wise from the
Gaussian distribution:
Ctk|C−tk, Ztk, Mtk, µqt , σqt ∼ N
τζZikMtk + τt
(µqt −
∑j Ctj + Ctk
)
τζ + τqt
, (τζ + τqt)−1/2
,
(3–84)
where τ% = 1σ2
%denotes the precision of the Gaussian distributions for % = ζ and qt.
5. Sample the mean µr and variance σr for the state r, given the state sequence Q.
We compute xt for t = 1, . . . , T first by
xt =∑
k
Ctk. (3–85)
Then, similar to section 3.4.2, we can sample µr and σr from the posterior conditional
probability distributions as follows:
µr ∼ p(µr|χr) ∝ p(χr|µr)p(µr) ∝ N(µr; µφ, σφ) (3–86)
where
µφ = (τ0 + Tτ)−1
τ0µ0 + τ
∑xj∈χr
xj
,
σφ = (τ0 + Tτ)−1/2
and
σr ∼ p(σr|χr) ∝ p(χr|σr)p(σr) ∝ Gamma(σr|ψφ, βφ) (3–87)
62
where
ψφ = ψ0 + T/2,
βφ = β0 +∑
xj∈χr
(xj − µr)2.
6. Sample the state sequences using equation 3–34 in section 3.4.2.
7. Sample the transition matrix using equation 3–33 in section 3.4.2.
The training algorithm is summarized in Figure 3-8. Initialization in this case is
random.
Max Over Columns
A
D
t
k A tk
m
OWA h
Min
Hit
mask
k
t
Miss
mask
Observation Sequence
X 1 X 2 … .X
Max Over Columns
A
D
t
k tk
OWA
Min Min
Hit
mask
Hit
mask
k
t
Miss
mask
Miss
mask
Observation Sequence
X 1 X 2 … .X T
Figure 3-1. Feature extraction process for feature learning
63
Initialize the HMM model parameters
Initialize the Masks, OWA weights
Feature learning loop1
Extract features
Create observation sequences from the features
HMM model training loop2
Randomize the order of sequences
Loop3 for all sequences, one mine follow by one nonmine sequence
Get a sequence
Compute the loss of this sequence by current HMM Models
If it is correct classified, continue to next sequence
Else
Compute the gradient with respect to all parameters using
equations (2-7,8, 9, 10) and equations (3-24,25)
Accumulate the total loss.
Accumulate the value of gradients of mask and OWA weights
Update the HMM model parameters, both mine and nonmine
by subtracting their respective gradients
End loop3 for all sequences
If total loss of all sequences decreasing, save the HMM models
End HMM model training loop2
Update the mask and OWA weights
Test model over the validation set
If loss over validation set is increasing or number of iterations exceeds a threshold, then stop
feature learning
End Feature learning loop1
Figure 3-2. MCE-based training process for feature learning
Loop until convergence
– Sample state qt, given all other states of the sequence, from
multinomial distribution one-by-one after computing the
probabilities P(qt|q-t) using equation (3-34)
– Sample transition probabilities ars from the Dirichlet
distribution after counting the state pairs in the sequences of all
observations using equation (3-33)
– Sample the state model parameters • Sample component label k for each observation Xt one-by-one from
the multinomial distribution after computing the probabilities
P(k|-t) using equation (3-36)
• Sample K mixture proportions, ck, from the Dirichlet distribution
after counting labels of all observations using equation (3-35)
• Sample the Gaussian model parameters (µ, �
)
– Sample the mean from the posterior Gaussian distribution.
– Sample the covariance matrix from the inverse Wishart
distribution.
Figure 3-3. Gibbs sampling HMM training process
64
Max Over Columns
A
D
i
k
SumHit-miss
Mask
Mi
k
iObservation Sequence
X 1 X 2 … .X N
Max/Sum over Columns
A
D
t
t
i
Observation Sequence
X 1 X 2 … .X T
Bi Ai
Figure 3-4. Feature model for LSampFealHMM
LSampFealHMM Training Algorithm:
Set hyper parameters �, �
, σ� , σ� , σ� and threshold �
Start with the initial state sequence Q[0]
Loop1
Split all the image sequences as different image segments
according to the state they are associated to
Loop2 for each state segment
Start with initial values [p[0]
, M[0]
, B[0]
, D[0]
, x[0]
]
Loop3 to sample the state parameters
At the iteration s
Sample p[s]
given (M[s-1]
, �, �
) using equations (3-45, 46)
Sample M[s]
given (p[s]
, B[s-1]
, A) using equation (3-49)
Sample B[s]
given (A, M[s]
) using equation (3-57)
Sample D[s]
given (x[s-1]
, x[s]
) using equation (3-59)
Sample x[s]
given (y, x[s]
) using equations (3-61, 62)
End loop3 after the required number of iterations
End loop2
Compute every state probability density with sampled state
parameters for all the images sequences.
Find the best state sequence using Viterbi algorithm
End loop3 after a fixed number of iterations
Figure 3-5. LSampFealHMM Training Algorithm
65
Max Over Columns
D
A
Sum Hit-miss
Mask Mti
i Observation Sequence
X 1 X 2 … .X T
Sum over Zone
At
A
i
Observation Sequence
X 1 X 2 … .X
Bti Ati i
Zones
Figure 3-6. Initial feature model for TSampFealHMM
Z A
Sum Hit-miss
Mask
Mt
At
A
Ct Zt
t Convolve Zone with
an all one mask
X 1 X 2 … .X T X 1 X 2 … .X
Observation Sequence
Zones
Figure 3-7. Final feature model for TSampFealHMM
66
TSampFealHMM Training Algorithm:
Set hyper parameters �, �
, σ� , σ�
Initialize state sequence Q[0]
Initialize variables [p[0]
, M[0]
, C[0]
, Z[0]
]
Loop1
At iteration s
Sample the state qt[s]
given all other states of the sequence from multinomial distribution
one-by-one after computing the probabilities P(qt|q-t) using equation (3-34)
Sample transition probabilities ars[s]
from the Dirichlet distribution after counting the state
pairs in the sequences of all observations using equation (3-33)
Sample the state parameters
Sample Z[s]
given (A, M[s-1]
, C[s-1]
) using equation (3-65)
Sample p[s]
given (M[s-1]
, �
, �
) using equation (3-76)
Sample M[s]
given (p[s]
, C[s-1]
, Z[s]
) using equation (3-80)
Sample C[s]
given (Z[s]
, M[s]
, �[s], �[s]
) using equation (3-84)
Sample �[s], �[s]
given (x[s]
, Q[s]
) using equations (3-86, 87)
Stop loop1 after a fixed number of iterations
Compute x with fix Mask as the mean of samples
Loop 2
Sample the state qt given all other states of the sequence from multinomial distribution one-
by-one after computing the probabilities P(qt|q-t) using equation (3-34)
Sample transition probabilities ars from the Dirichlet distribution after counting the state
pairs in the sequences of all observations using equation (3-33)
Sample the state parameters
Sample �, � given (x, Q) using equations (3-86, 87)
Stop loop2 after a fixed number of iterations
Figure 3-8. TSampFealHMM Training Algorithm
67
CHAPTER 4EMPIRICAL ANALYSIS
4.1 Data Sets
Experiments were performed using both synthetic and real data sets. There were
two types of real data, GPR and handwritten digit data. There also were two synthetic
data sets. The first synthetic data set contained samples from two classes. Each sample
was a 29 × 23 image with intensity values in the interval [0, 1]. Classes consisted of
images of simulated “hyperbolas” that consisted of line segments that were oriented
at 45 and 135 degrees plus noise for Class 1 and 60 and 120 degrees plus noise for
Class 2. To generate the sample images, a binary image was used as the template, then
additive Gaussian noise (0.1 mean and 0.1 standard derivation) and “salt and pepper”
noise (with a probability 0.3 of changing state) was used to corrupt the template. Ten
samples are shown in Figure 4-1. The synthetic data contained 300 images from each
class in the training set and 40 images from each class in the testing set. McFeaLHMM
was applied to this data set to show the performance of the algorithm. We refer to this
data set as the SynData1.
The second synthetic data set contained the data samples of image sequences,
100 sequences for each class. Each sequence had nine images, each image was a 5 ×5 image with intensity values in the interval [0, 1]. The feature class had the sequences
of simulated “hyperbolas.” Therefore, these sequences had three groups of images. The
three groups were associated with the images containing line segments that oriented
45, 180, and 135 degrees, respectively. To generate the sample images, we first used
a fixed, left-right transition matrix to generate state sequences. Then, according to the
state sequences, line images from that state were generated. A binary image was used
as the template, then additive Gaussian noise (0.1 mean and 0.1 standard derivation)
and “salt and pepper” noise (with probability 0.3 of changing state) was used to corrupt
the template. The background sequences consisted of images from corrupted blank
68
templates. Some sample sequences are shown in Figure 4-9. In the figure, there are
ten sequences from each class from top to bottom. Each row is a sequence of ten
images. Two adjacent images in the sequence are separated by a blank column. The
two adjacent sequence rows are separated by a blank row. LSampFeaLHMM was
applied to this data set to show the performance of the algorithm. We refer to this data
set as SynData2.
There are three GPR data sets. They were both acquired using NIITEK time domain
GPR systems well described in the literature (Lee et al., 2007). The focus here is on
a discussion of the relative performance of the new and old algorithms. Data sets
contained 2 classes: Anti-tank (AT) mines and non-mines. These are both plastic-cased
and metal-cased mines. In all cases, the various HMM algorithms were applied to
alarms detected by a pre-screener (Gader et al., 2004).
The first GPR data set was acquired from an arid test site. It consisted of 120 mine
encounters and 120 false alarms. Data samples were extracted from pre-screener
alarms. This set contains 80 images from each class in the training set and 40 images
from each class in the testing set. Each data sample is a 29 × 23 image with intensity
values in the interval [-1, 1]. Ten samples are shown in Figure 4-2. This author produced
all the HMM results on this data set. Comparisons of standard HMM (Ma, 2004) and
McFeaLHMM algorithms were made on this set. We refer to it as the GPRAcid dataset.
The second GPR data set was acquired from an temperate test site. It consisted
of 316 mine encounters, of which 234 were plastic-cased, and 1,025 were false alarms.
Similar to the GPRAcid data set, data samples were extracted from pre-screener alarms.
Each data sample is a 29 × 23 image with intensity values in the interval [-1, 1]. The
lane-based 10-fold cross validations were applied to this dataset. This author produced
all the HMM results from this data set. Comparisons of standard EM-HMM, SampHMM,
and LSampFeaLHMM algorithms were made on this set. We refer to it as the GPRTemp
dataset.
69
The third GPR data set contained measurements from the different geographical
sites, referred to as S1 and S2. The data at S1 were measured with two different NIITEK
systems, A1 and A2.
A significant point is that the HMM experiments were not run by the author on
this data set. They were run by others (P. Gader and J. Bolton, pers. comm.) and
verified by P. Gader, the adviser for this dissertation study. Algorithm SampHMM and
DTXTHMM were compared using this set, which is referred to as the GPRTwoSite
data set. Other comparisons will be made in the future, but are limited by the ability to
transfer algorithms. Furthermore, true false alarm rates are not given. False alarm rates
are given in arbitrary units, but are proportional in the sense that if algorithm A has x
false alarms and algorithm B has rx false alarms, then algorithm B has r times as many
false alarms as algorithm A. Again, the focus of this research is to evaluate relative
performance of algorithms, not GPR systems. Thus, absolute false alarm rates are not
necessary.
The handwritten data consist of images acquired from the MNIST data set (LeCun
and Cortes). The purpose of these experiments is to compare performance of the
feature learning HMM with an HMM trained using handmade features. TSampFeaLHMM
and SampHMM are compared using this set, which is referred as the HWDR dataset.
4.2 Experiments and Results
The experiments conducted are shown in Table 4-1. The DTXTHMM represents the
baseline algorithm for mine detection using GPR data. It has been developed over years
and versions of it have demonstrated excellent performance on several GPR systems
(Frigui et al., 2005; Gader et al.; Wilson et al., 2007; Zhao et al., 2003). We compare
against it for the landmine detection experiments.
The HMM algorithm for HWDR with handmade features will be described in section
4.2.6
70
4.2.1 SynData1
McFeaLHMM was investigated using SynData1. SynData1 contains 300 images
from each class in the training set and 40 images from each class in the testing set.
Two-dimensional features were used.
Since the algorithm is sensitive to initialization, we considered two different mask
initializations. One initialization used hit-miss pairs representing horizontal and vertical
line segments. The other used hit-miss pairs representing diagonal and anti-diagonal
line segments. The two pairs of initial masks are shown in Figure 4-3. Each mask is a 5
× 5 array image with values in the interval [0,1].
The OWA operators associated with each mask were randomly initialized. The
same OWA operators were used for both horizontal/vertical and diagonal/anti-diagonal
initializations. The weights are shown in Figure 4-4. The vertical axis is the value of
the weights. The horizontal axis is the index of the ordered elements of the mask. The
height of the bar at index i indicates the value of wi in equation 3–2. The first fifteen
weights were set to have very small values, and the other ten weights were sampled
from the uniform distribution on the interval [0, 1] initially. The weights were normalized
for each mask.
After McFeaLHMM feature learning, the classification rate over the test set on
synthetic data was 100%. The final masks are shown in Figure 4-5. The masks are
very similar to initial masks. This is expected since these final masks can extract the
“hyperbola” shape information of the training images. The final OWA weights are shown
in Figure 4-6. These final weights are similar to a mean OWA operator or they prefer
high value elements of masks.
The result shows that McFeaLHMM can achieve great performance with good
initialization. However, the experiment also showed that the algorithm is sensitive to
initialization. The algorithm could not converge with random initialization.
71
4.2.2 GPRArid
The McFeaLHMM algorithm and the standard HMM algorithm (DTXTHMM) were
compared using GPRArid. The standard HMM algorithm used human-made feature
masks, which were created by an expert on GPR data.
GPRArid contains 80 images from each class in the training set and 40 images from
each class in the testing set. The dimensionality of the feature was chosen to be four.
Two dimensions were extracted from the positive part of the image and two dimensions
were extracted from the negative part of the image. The masks and OWA operators
were initialized the same way as for SynData1.
The algorithms ware trained on the training set and tested on the test set. The
plots of Probability of Detection (PD) vs. Probability of False Alarm (PFA) on the
test set for the standard HMM, McFeaLHMM-trained via feature learning initialized
with the horizontal and vertical masks, and McFeaLHMM-trained via feature learning
initialized with the diagonal and anti-diagonal features are shown in Figure 4-8. As can
be seen, the PFA of the feature learning algorithm is reduced to 80% of a standard HMM
algorithm at a PD of 90%. The features learned by algorithm McFeaLHMM on training
set are shown in Figure 4-7
The final OWA weights trained on GPRArid did not appear qualitatively different
from the weights trained on SynData1. The final masks are similar to the initial masks.
This is not surprising since these mask can extract information from the “hyperbola”
shape of mine images.
The experiments have shown that this MCE feature learning method is very
sensitive to initialization. The learning rates were also carefully tuned, otherwise the
training algorithm could not converge to a stable point. In fact, identifying learning
algorithms that are not so sensitive, was the reason that our sampling feature learning
algorithms were proposed.
72
4.2.3 GPRTwoSite
Comparisons of the SampHMM algorithm and the standard HMM algorithm
(DTXTHMM) were made on this dataset. Lane-based 10-fold cross validation (P.
Gader and J. Bolton, pers. comm.) were used to evaluate the algorithms.
The plot of PD vs PFA for SampHMM and DTXTHMM algorithms on different data
sites are shown in Figures 4-23, 4-24, 4-25, 4-26, 4-27, and 4-28. The plots show that
PFA of SampHMM is less or even in half of the PFA of the standard HMM algorithm for
PDs of 90% or 85%. We could conclude that the SampHMM algorithm outperforms the
existing HMM algorithms for GPR mine detection.
4.2.4 SynData2
LSampFeaLHMM was tested using SynData2. This synthetic data set contains
the data samples of image sequences. There are 100 sequences for each class. All of
them were used as training samples. The purpose was to show that the feature masks
learned by the algorithm could match the templates used to generate the training data
samples. Experiments were conducted using hit masks and hit-miss masks.
We applied three-state HMM models in the training process. After Gibbs feature
learning with the hit mask only, the final masks are shown in Figure 4-10. From left to
right, there are three states shown in the figure. The bottom row shows the hit mask for
each state. The top row shows the value of features. The horizontal axis indicates the
index of images in the specified state for all sequences. The first half is from the feature
class, and the second half is from the background class. The vertical axis shows the
feature values of images after applying the mask. The final hit masks are an intuitive
result, what we expected. They perfectly matched the binary templates that we used to
generate the training samples. The feature values are well separated between feature
images and background images. A big feature value indicates a strong ‘hit’ in that
image.
73
After Gibbs feature learning with hit-miss mask instead of hit mask only, the final
result for the state of the 135 degree image group is shown in Figure 4-11. The left part
of the figure shows the images associated with this state at the final iteration. The top
half is from the feature class, and the bottom half is from the background class. Each
row is a vector format of an image. The feature values of images in this state are shown
at the top center of the figure. The final hit-miss mask is shown at the bottom center of
the figure. The intensities of the image for the hit-miss mask are in the interval [-1, 1].
The left part of the figure shows the matrix format of images. The left two columns are
from the feature class. The right-most column is from the background class. The two
adjacent images are separated by a blank row in each column. The feature value figure
shows the feature value with the hit-miss mask is more separated than the feature-value
with hit mask only.
The individual hit mask, do-not-care mask, and miss mask are shown in Figure
4-12. The intensity values of all three images are in the interval [0, 1]. The figure shows
that the hit mask is an anti-diagonal shape. The miss mask is a negative anti-diagonal
mask. The do-not-care mask is almost blank, which fits this data set.
We also conducted experiments to test the performance of the algorithm when
images are not aligned in the same position. It would require the algorithm to shift the
image to match the feature mask in the learning process. The results are shown in
Figure 4-13. From the left part and right part of the figure, we can see that some images
in this group do not align, but the results show that the feature values are well separated
again. The final hit-miss mask still has good anti-diagonal shape, although it is not crisp.
The individual hit mask, do-not-care mask, and miss mask are shown in Figure
4-14. The hit mask still has good shape information. The miss mask has weak values,
since it is hard to match against the off-alignment position. The do-not-care mask gains
some intensity, since the offset positions may not contribute as much, either from the
74
positive part or from the negative part. The experiments show that the LSampFeaLHMM
can learn the feature intuitive features that produce improved classifications.
4.2.5 GPRTemp
SampHMM, LSampFeaLHMM and standard HMM were compared using GPRTemp.
Both SampHMM and standard HMM use the human-made feature masks created by
experts. LSampFeaLHMM tried to learn feature masks in the training process. Similar
to experiments using GPRTwoSite, lane-based cross validation was used to evaluate
the algorithms. In the experiment for LSampFeaLHMM, the data was preprocessed
first. Each mine image was normalized and was scaled to the interval [0, 1]. Then the
image was semi-threshold and skeletonized to obtain a crisp gray-level image. Then
we moved a 5 × 5 window along the x-axis to capture the image sequences. For each
data sample of a mine image, one image sequence was extracted. Twenty-five image
sequence samples are shown in Figure 4-15. Each sequence is along the row. Two
sequences are separated by a horizontal gray bar, and two adjacent images in one
sequence are separated by a vertical gray bar. It can be seen that the sequences
consist of ascending-edge and descending-edge images.
Two hundred image sequences from the mine class were extracted from landmine
data in the training set. The final hit masks after Gibbs feature learning with a four-state
HMM setting are shown in Figure 4-16. Each 5 × 5 block is one hit mask for one state,
and two masks are separated by a vertical black bar. The second state has very few
samples associated with it, so the second hit mask can be ignored. It is clear that the
final three hit masks are capturing ascending-edge, flat-edge, and descending-edge
information, respectively.
The final hit masks after Gibbs feature learning with a three-state HMM setting are
shown in Figure 4-17. The second state has very few samples associated with it, so the
second hit mask is ignored. The first and third hit masks capture the ascending-edge
and descending-edge information, respectively.
75
The plots of PD vs PFA on the test site for the standard HMM, SampHMM, and
LSampFeaLHMM algorithm are shown in Figure 4-18. The figure shows that the HMM
sampling algorithm has the lowest PFA at 90% of PD. The PFA of the Gibbs feature
learning algorithm matches or exceeds the standard HMM algorithm at most PDs. The
result shows that the HMM sampling algorithm performed best on the landmine GPR
data set. Our feature learning algorithm (LSampFeaLHMM) can match or exceed HMM
algorithms with human-made feature masks, thus save time and labor.
4.2.6 Handwritten Digits
Comparisons of HMM algorithm with human-made feature masks and TSampFeaLHMM
algorithm were made to handwritten digits. The HMM algorithm used here is the
sampling HMM algorithm similar to the SampHMM algorithm. It has one Gaussian
component per state. The human-made masks were used to perform convolution
over digit images to create features, as shown in Figure 3-7. In experiments on the
TSampFeaLHMM algorithm, these feature masks were estimated in the training
process. The human-made masks are shown in Figure 4-22. They are nine line
segments that were oriented 0, 20, 40, 50, 70, 90, 110, 130, and 160 degrees,
respectively. These masks were created to simulate the commonly used edge detectors
(Frigui et al., 2009). An all blank mask was added to capture the empty background
zone.
Each raw digit image is a 28 × 28 gray-level image. The intensity values were then
scaled to [0, 1]. Next, the principal transform algorithm was applied to the images, so
that the two principal directions of the images were horizontal and vertical axes. Then
the background values of the images were set to -1, and zero values were padding
to the edge of digits. Some samples are shown in Figure 4-19. Next each image was
split into 16 size 8 × 8 overlapped zones, as shown in Figure 4-20. The order along
the anti-diagonal direction from top-right to bottom-left, formed a sequence from these
sixteen zones.
76
Four digit classes were picked in experiments: 0, 1, 2, 4. The training set had 300
digit images for each class. The test set also had 300 digit images for each class. The
algorithms were trained on the training set and test on the test set.
Seven-state HMM models were used in the experiment using TSamplFeaLHMM
with digit zone sequences. Full transition matrixes were used. First, in the training
process, the feature masks and HMM state parameters were learned simultaneously.
Sampling was performed for 10,000 iterations, a sufficient “burning period,” then the
feature masks were fixed and the HMM parameters were updated for subsequent
iterations. Training was stopped after 15,000 iterations. The second loop was to fine
tune the HMM parameters. The same training process without the feature learning
process was used for SampHMM.
Classification results of the TSamplFeaLHMM are shown in Table 4-2 as the
confusion matrix for digit pair 0 and 1. The row index is true class and the column index
is algorithm classification. About classification error of 2% is shown for these two digit
classes.
Classification results of the TSamplFeaLHMM for digit classes 0, 1, 2, 4 are
shown in Table 4-3. Correct classification of digit class 2 was the most difficult. About
classification error of 8% is shown for these four digit classes.
The final hit-miss masks and transition matrix are shown in Figure 4-21. The size of
pixel means the relative value of the element in that transition matrix. It is difficult to get
conclusive information from these hit masks.
Classification results of the HMM algorithm with human-made masks for digits
0, 1, 2, 4 are shown in Table 4-4. About classification error of 14% is shown for these
digits, which is worse than our feature learning algorithm. These results show that the
performance of the feature learning HMM beats the HMM algorithm using human-made
features.
77
Table 4-1. Algorithms and DatasetsMcFeaLHMM SampHMM LSampFeaLHMM TSampFeaLHMM DTXT HMM*
SynData1 XSynData2 XGPRArid X X
GPRTemp X X XGPRTwoSite X X
HWDR X* X*Using handmade features
Table 4-2. Confusion matrix for digit pair 0 and 1 for TSampFeaLHMMDigits 0 1
0 300 01 8 292
Table 4-3. Confusion matrix for digits 0, 1, 2, 4 for TSampFeaLHMMDigits 0 1 2 4
0 277 0 10 131 0 292 1 62 12 0 256 324 2 1 13 284
Table 4-4. Confusion matrix for digits 0, 1, 2, 4 for HMM with human-made masksDigits 0 1 2 4
0 256 1 37 61 1 291 5 32 13 0 274 134 52 1 38 209
78
5 1015 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 1015 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
(a) Class 1: 45-degree angle images, above arerandomly picked 10 images from class 1.
5 1015 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 1015 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
(b) Class 2: 60-degree angle images, above arerandomly picked 10 images from class 2.
Figure 4-1. Ten samples from each class of dataset SynData1.
79
5 1015 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 1015 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
(a) 10 samples from mines data set
5 1015 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 1015 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 15 20
5
10
15
20
25
5 10 1520
5
10
15
20
25
(b) 10 samples from non-mines data set
Figure 4-2. Samples from each class of dataset GPRArid.
hit-mask
1 2 3 4 5
1
2
3
4
5
1 2 3 4 5
1
2
3
4
5
miss-mask
1 2 3 4 5
1
2
3
4
5
1 2 3 4 5
1
2
3
4
5
(a) Horizontal and verticalpairs
hit-mask
1 2 3 4 5
1
2
3
4
5
1 2 3 4 5
1
2
3
4
5
miss-mask
1 2 3 4 5
1
2
3
4
5
1 2 3 4 5
1
2
3
4
5
(b) Diagonal and anti-diagonalpairs
Figure 4-3. Hit-miss pairs for initial masks.
80
5 10 15 20 250
0.10.2
weight for first hit-miss mask
5 10 15 20 250
0.10.2
weight for second hit-miss mask
Figure 4-4. Initial OWA weights for hit and miss. The top weights correspond to the hitmask and the bottom weights correspond to the miss mask.
1 2 3 4 5
1
2
3
4
5
1 2 3 4 5
1
2
3
4
5
1 2 3 4 5
1
2
3
4
5
1 2 3 4 5
1
2
3
4
5
(a) Horizontal and verticalpairs
1 2 3 4 5
1
2
3
4
5
1 2 3 4 5
1
2
3
4
5
1 2 3 4 5
1
2
3
4
5
1 2 3 4 5
1
2
3
4
5
(b) Diagonal and anti-diagonalpairs
Figure 4-5. Hit-miss masks after feature learning corresponding to initial masks in Figure4-3.
5 10 15 20 250
0.1
0.2
5 10 15 20 250
0.1
0.2
5 10 15 20 250
0.1
0.2
5 10 15 20 250
0.1
0.2
(a) Hit and miss weights for horizon-tal and vertical initialization
5 10 15 20 250
0.1
0.2
5 10 15 20 250
0.1
0.2
5 10 15 20 250
0.1
0.2
5 10 15 20 250
0.1
0.2
(b) Hit and miss weights for diagonaland anti-diagonal initialization
Figure 4-6. OWA weights after feature learning.
81
2 4
2
4
2 4
2
4
2 4
2
4
2 4
2
4
2 4
2
4
2 4
2
4
2 4
2
4
2 4
2
4
(a) Horizontal and verticalpairs
2 4
2
4
2 4
2
4
2 4
2
4
2 4
2
4
2 4
2
4
2 4
2
4
2 4
2
4
2 4
2
4
(b) Diagonal and anti-diagonal pairs
Figure 4-7. Final masks learned for the landmine data. Each row represents differentfeature. Row 1 positive, Row 2 positive, Row 3 negative, Row 4 negative.
0 20 40 60 80 10040
50
60
70
80
90
100
PFA
PD
diag+antidiag mask:100/15.0 95/7.5 90/2.5hori+vert mask:100/45.0 95/7.5 90/2.5standard hmm:100/37.5 95/22.5 90/12.5
Figure 4-8. Receiver operating characteristic curves comparing McFeaLHMM with twodifferent initializations to the standard HMM.
82
class
10 20 30 40 50 60
10
20
30
40
50
60
70
80
90
100
bg class
10 20 30 40 50 60
10
20
30
40
50
60
70
80
90
100
Figure 4-9. Left: ascending edge, flag edge, and descending edge sequences. Right:
sequences from noise background.
0 100 200 3000
2
4
6iter:4 seg:1
1 2 3 4 5
1
2
3
4
50.2
0.4
0.6
0.8
0 50 100 150 2000
2
4
6iter:4 seg:2
1 2 3 4 5
1
2
3
4
50.2
0.4
0.6
0.8
0 200 400 6000
1
2
3
4
5iter:4 seg:3
1 2 3 4 5
1
2
3
4
50.2
0.4
0.6
0.8
Figure 4-10. Hit masks after Gibbs feature learning.
83
two classes
5 10 15 20 25
50
100
150
200
250
300
350
400
450
0 500-10
-5
0
5threshold = -1.2596
2 4
1
2
3
4
5
-0.5
0
0.5
class
2 4
20
40
60
80
100
120
140
class bg class
Figure 4-11. Result for 135-degree state after Gibbs feature learning with hit-missmasks.
-
hit
1 2 3 4 5
1
2
3
4
5
nocare
1 2 3 4 5
1
2
3
4
5
miss
1 2 3 4 5
1
2
3
4
5
Figure 4-12. Hit-miss masks after Gibbs feature learning.
84
two classes
5 10 15 20 25
10
20
30
40
50
60
70
80
90
100
0 50 1000
1
2
3
4threshold = 2
2 4
1
2
3
4
50
0.2
0.4
0.6
0.8
class
2 4
20
40
60
80
100
120
140
class bg class
Figure 4-13. Result with shifted training images after Gibbs feature learning with hit-missmasks.
hit
2 4
1
2
3
4
5
nocare
2 4
1
2
3
4
5
miss
2 4
1
2
3
4
5
Figure 4-14. Hit-miss masks after Gibbs feature learning with shifted training images.
85
10 20 30 40 50 60 70 80 90 100
20
40
60
80
100
120
140
160
180
200
Figure 4-15. 25 sample sequences extracted from mine images from dataset GPRTemp.
feature masks
5 10 15 20
1
2
3
4
5
0
0.2
0.4
0.6
0.8
Figure 4-16. Hit masks after Gibbs feature learning with four-state HMM setting.
86
feature masks
2 4 6 8 10 12 14 16 18
1
2
3
4
50.2
0.4
0.6
0.8
Figure 4-17. Hit masks after Gibbs feature learning with three-state HMM setting.
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.050.7
0.75
0.8
0.85
0.9
0.95
1A temperate test site
FAR (FA/m2)
PD
SampHMMLSampFealHMMDTXTHMM
Figure 4-18. Receiver operating characteristic curves comparing LSampFeaLHMM andSampHMM algorithms with the standard HMM algorithm.
87
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
10
15
20
25
Figure 4-19. 18 samples for each digit from MNIST.
88
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
(a) digit 0
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
(b) digit 1
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
(c) digit 2
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
2 4 6
2
4
6
(d) digit 4
Figure 4-20. Two samples for each digit to show zone splitting.
feature masks
10 20 30 40 50
2
4
6
-1
0
1
(a) Hit masks and transition matrix for digit 0
feature masks
10 20 30 40 50
2
4
6
-1
0
1
(b) Hit masks transition matrixfor digit 1
feature masks
10 20 30 40 50
2
4
6
-1
0
1
(c) Hit masks transition matrixfor digit 2
feature masks
10 20 30 40 50
2
4
6
-1
0
1
(d) Hit masks transition matrixfor digit 4
Figure 4-21. Hit masks and transition matrix after Gibbs feature learning for digits.
10 20 30 40 50 60 70
2
4
6
-1
0
1
Figure 4-22. Ten human-made masks.
89
0 10 20 30 40 50 600.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1HMMSamp_vs_DTXTHMM_A1S1ceqm.fig
False Alarms (a.u.)
PD
DTXTHMM:HMMSamp1
Figure 4-23. HMMSamp vs DTXTHMM on GPR A1 at site S1 while clutter == mine.
0 10 20 30 40 50 600.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1HMMSamp_vs_DTXTHMM_A1S1cneqm.fig
False Alarms (a.u.)
PD
DTXTHMM:HMMSamp1
Figure 4-24. HMMSamp vs DTXTHMM on GPR A1 at site S1 while clutter ∼= mine.
90
0 10 20 30 40 50 600.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1HMMSamp_vs_DTXTHMM_A2S1cneqm.fig
False Alarms (a.u.)
PD
DTXTHMM:HMMSamp1
Figure 4-25. HMMSamp vs DTXTHMM on GPR A2 at site S1 while clutter ∼= mine.
0 10 20 30 40 50 600.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1HMMSamp_vs_DTXTHMM_A2S2cneqm.fig
False Alarms (a.u.)
PD
DTXTHMM:HMMSamp1
Figure 4-26. HMMSamp vs DTXTHMM on GPR A2 at site S2 while clutter ∼= mine.
91
0 10 20 30 40 50 600.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1HMMSamp_vs_DTXTHMM_A2S1ceqm.fig
False Alarms (a.u.)
PD
DTXTHMM:HMMSamp1
Figure 4-27. HMMSamp vs DTXTHMM on GPR A2 at site S1 while clutter == mine.
0 10 20 30 40 50 600.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1HMMSamp_vs_DTXTHMM_A2S2ceqm.fig
False Alarms (a.u.)
PD
DTXTHMM:HMMSamp1
Figure 4-28. HMMSamp vs DTXTHMM on GPR A2 at site S2 while clutter == mine.
92
CHAPTER 5CONCLUSIONS AND FUTURE WORK
The performance of such feature-based learning methods as HMMs depends
not only on the design of the classifier, but also on the features. Few studies have
investigated both a whole system. Features that accurately and succinctly represent
discriminating information in an image or signal are very important to any classifier.
Our approach involved developing a parameterized model of feature extraction
based on morphological or linear convolution operations. To learn the parameters of
the feature model and the HMM, two feature learning algorithms were developed. They
can simultaneously extract the features and learn the parameters of the HMM model.
One algorithm is based on minimum classification error, and the other is based on Gibbs
sampling. The Gibbs sampling method is used so that the method is more robust to the
initialization and achieves the better solution. The experiments show this new method
can outperform the other methods in the landmine detection application.
Additionally, a new learning method for learning parameters of the HMM model with
multivariate Gaussian mixture has been presented. This method has been shown to
improve performance of both the synthetic data and real data sets compared to existing
state-of-the-art methods and to human-made features.
Specifically, the following results were achieved:
• McFeaLHMM is very sensitive to initialization and learning rates.
• SampHMM was far superior compared to known HMMs for GPR mine detection.
• All feature learning models can achieve performance similar to or better than
human-made features in the HMM framework.
• The two sampling HMM feature learning algorithms are much more stable and can
produce better solutions than the McFeaLHMM algorithm in landmine detection
experiments.
93
Future work will include: applying feature learning algorithms to other datasets,
such as the GPRTwoSite; using a sigmoid model instead of a Gaussian model as the
state probability function in the HMM framework; performing discriminative training
via sampling; and investigating sampling OWA operators using Metropolis-Hastings
algorithms.
94
REFERENCES
K. Bae. Bayesian Model-based Approaches with MCMC Computation to Some Bioinfor-matics Problems. PhD thesis, Texas A&M University, 2005.
M. Beal, Z. Ghahramani, and C. Rasmussen. The infinite hidden Markov model. InMachine Learning, pages 29–245, Cambridge, MA, 2002. MIT Press.
S. Belongie, C. Carson, H. Greenspan, and J. Malik. Color- and texture-based imagesegmentation using EM and its application to content-based image retrieval. In Proc.Int’l Conf. Computer Vision, pages 675–682, 1998.
J. A. Bilmes. A gentle tutorial on the EM algorithm and its application to parameterestimation for Gaussian mixture and hidden Markov models. Technical report,University of California, Berkeley, 1997.
A. L. Blum and P. Langley. Selection of relevant features and examples in machinelearning. Artificial Intelligence, 97:245–271, 1997.
C. Burges. Geometric methods for feature extraction and dimensional reduction.Technical Report 55, Microsoft Research, November 2004.
G. Casella and E. I. George. Explaining the Gibbs sampler. The American Statistician,46(3):167–174, 1992.
S. Chen and X. Yang. Alternative linear discriminant classifier. Pattern Recognition, 37:1545–1547, 2004.
Y.-N. Chen, C.-C. Han, C.-T. Wang, B.-S. Jeng, and K.-C. Fan. The application ofa convolution neural network on face and license plate detection. InternationalConference on Pattern Recognition, 3:552–555, 2006.
S. Choi. Sequential EM learning for subspace analysis. Pattern Recognition Letter, 25:15591567, 2004.
D. Cohn. Informed projections. In Advances in Neural Information Processing Systems15, pages 849–856, Cambridge, MA, 2003. MIT Press.
T. G. Dietterich. Machine learning research: Four current directions. AI Magazine, 18(4):97–136, 1997.
A. Dunmur and D. Titterington. Computational Bayesian analysis of hidden Markovmesh models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(11):1296–1300, 1997.
J. G. Dy and C. E. Brodley. Feature selection for unsupervised learning. Journal ofMachine Learning Research, 5:845–889, April 2004.
M. A. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions onPattern Analysis and Machine Intelligence, 25(9):1150–1159, September 2003.
95
M. A. T. Figueiredo and A. K. Jain. Unsupervised learning of finite mixture models. IEEETransactions on Pattern Analysis and Machine Intelligence, 24:381–396, 2000.
H. Frigui, K. C. Ho, and P. Gader. Real-time landmine detection with ground-penetratingradar using discriminative and adaptive hidden Markov models. EURASIP J. Appl.Signal Process., 2005:1867–1885, 2005.
H. Frigui, A. Fadeev, A. Karem, and P. Gader. Adaptive edge histogram descriptor forlandmine detection using GPR. In Society of Photo-Optical Instrumentation Engineers(SPIE) Conference Series, volume 7303 of Society of Photo-Optical InstrumentationEngineers (SPIE) Conference Series, May 2009.
D. Gabor. Theory of communication. J. Inst. Electr. Engrs, 93(26):429457, November1946.
P. Gader and M. Khabou. Automatic feature generation for handwritten digit recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(12):1246–1261,December 1996.
P. Gader and M. Popescu. Generalized hidden Markov models for land mine detection.Proc. SPIE, 4742:349–355, April 2002.
P. Gader, M. Mystkowski, and Y. Zhao. Landmine detection with ground penetratingradar using hidden Markov models. IEEE Transactions on Geoscience and RemoteSensing.
P. Gader, J. R. Miramonti, Y. Won, and P. Coffield. Segmentation free shared weightnetworks for automatic vehicle detection. Neural Networks, 8(9):1457–1473, 1995.
P. Gader, M. Khabou, and A. Koldobsky. Morphological regularization neural networks.Pattern Recognition, 33:935–944, 2000.
P. Gader, W.-H. Lee, and J. Wilson. Detecting landmines with ground-penetrating radarusing feature-based rules, order statistics, and adaptive whitening. IEEE Transactionson Geoscience and Remote Sensing, 42(11):2522–2534, Nov. 2004.
C. Garcia and M. Delakis. Convolutional face finder: A neural architecture for fastand robust face detection. IEEE Transactions on Pattern Analysis and MachineIntelligence, 26(11):1408–1423, 2004.
Z. Ghahramani, T. L. Griffiths, and P. Sollich. Bayesian nonparametric latent featuremodels. Bayesian Statistics, pages 201–225, 2007.
J. Gil and R. Kimmel. Efficient dilation, erosion, opening, and closing algorithms. IEEETransactions on Pattern Analysis and Machine Intelligence, 24(12):1606–1617,December 2002.
J. Gil and M. Wcrman. Computing 2-d min, median, and max filters. IEEE Transactionson Pattern Analysis and Machine Intelligence, 15(5):504–507, May 1993.
96
A. Globerson and N. Tishby. Sufficient dimensionality reduction. Journal of MachineLearning Research, 3:1307–1331, 2003.
A. R. Golding and D. Roth. A Winnow-based approach to spelling correction. MachineLearning, 34:107–130, 1999.
H. Greenspan, R. Goodman, and R. Chellappa. Texture analysis via unsupervisedand supervised learning. International Joint Conference on Neural Networks, 1991,IJCNN-91-Seattle, 1:639–644, July 1991.
T. L. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffetprocess. Technical Report 2005-01, Gatsby Computational Neuroscience Unit,University College London, 2005.
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal ofMachine Learning Research, 3:1157–1182, March 2003.
I. Guyon, A. B. Hur, S. Gunn, and G. Dror. Result analysis of the NIPS 2003 featureselection challenge. In Advances in Neural Information Processing Systems 17,pages 545–552, Cambridge, MA, 2004. MIT Press.
D. Haun, K. Hummel, and M. Skubic. Morphological neural network vision processingfor mobile robots. Technical report, Dept. of Computer Engineering and ComputerScience, University of Missouri-Columbia, September 2000.
K. Hild II, D. Erdogmus, K. Torkkola, and J. C. Principe. Feature extraction usinginformation-theoretic learning. IEEE Transactions on Pattern Analysis and MachineIntelligence, 28(9):1395–1392, September 2006.
P. Huber. Projection pursuit. The Annals of Statistics, 13(2):435475, 1985.
A. Hyvarinen and E. Oja. Independent component analysis: Algorithms and applications.Neural Networks, 13(4-5):411–430, June 2000.
M. Inoue and N. Ueda. Exploitation of unlabeled sequences in hidden Markov models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(12):1570–1581,December 2003.
M. Jones and R. Sibson. What is projection pursuit? Journal of the Royal StatisticalSociety, Series A, 150(1):1–36, 1987.
M. Khabou and P. Gader. Automatic target detection using entropy optimizedshared-weight neural networks. IEEE Transactions on Neural Networks, 11(1):186–193, January 2000.
M. Khabou, P. Gader, and J. Keller. Ladar target detection using morphologicalshared-weight neural networks. Machine Vision and Applications, 11:300305,2000.
97
K. Kira and L. A. Rendell. A practical approach to feature selection. In Proc. NinthInternational Conf. Machine Learning, pages 249–256, 1992.
B. Krishnapuram. Adaptive Classifier Design Using Labeled and Unlabeled Data. PhDthesis, Duke University, 2004.
B. Krishnapuram, A. Hartemink, L. Carin, and M. Figueiredo. A Bayesian approach tojoint feature selection and classifier design. IEEE Transactions on Pattern Analysisand Machine Intelligence, 26(9):1105–1111, September 2004.
S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathemati-cal Statistics, 22(1):79–86, 1951.
V. Kyrki, J.-K. Kamarainen, and H. Kalviainen. Simple Gabor feature space for invariantobject recognition. Pattern Recognition Letter, 25:311–318, 2004.
M. Law, M. Figueiredo, and A. Jain. Simultaneous feature selection and clustering usingmixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1154–1166, September 2004.
Y. LeCun and C. Cortes. MNIST handwritten digit database. Available at http://yann.lecun.com/exdb/mnist/.
Y. LeCun, B. Boser, J. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. Jackel.Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied todocument recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
K. E. Lee, N. Sha, E. R. Dougherty, M. Vannucci, and B. K. Mallick. Gene selection: ABayesian variable selection approach. Bioinformatics, 19(1):90–97, January 2003.
W.-H. Lee, P. D. Gader, and J. N. Wilson. Optimizing the area under a receiver operatingcharacteristic curve with application to landmine detection. IEEE Transactions onGeoscience and Remote Sensing, 45(2):389–398, February 2007.
J. M. Leiva-Murillo and A. Artes-Rodriguez. Maximization of mutual information forsupervised linear feature extraction. IEEE Transactions on Neural Networks, 18(5):1433–1441, September 2007.
N. Littlestone. Learning quickly when irrelevant attributes abound: A newlinear-threshold algorithm. Foundations of Computer Science, 28th Annual Sym-posium on, pages 68–77, October 1987.
H. Liu and H. Motoda. Computational Methods of Feature Selection. Chapman &Hall/CRC, 2007.
98
C. Ma. MCE training based continuous density HMM landmine detection system.Master’s thesis, Univ. of Missouri-Columbia, 2004.
X. Ma, D. Schonfeld, and A. Khokhar. A general two-dimensional hidden Markov modeland its application in image classification. In IEEE International Conference on ImageProcessing, 2007. ICIP 2007., volume 6, pages 41–44, Oct 2007.
D. Mackey. Information Theory, Inference, and Learning Algorithms. CambridgeUniversity Press, 2003.
M. Mørup, K. H. Madsen, and L. K. Hansen. Approximate l0 constrained non-negativematrix and tensor factorization. In Accepted ISCAS 2008 special session on Non-negative Matrix and Tensor Factorization and Related Problems, 2008.
J. R. Movellan. Tutorial on Gabor filters. Tutorial paper http://mplab.ucsd.edu/tutorials/gabor.pdf, 2006.
M. Najjar, C. Ambroise, and J.-P. Cocquerez. Feature selection for semi supervisedlearning applied to image retrieval. In IEEE ICIP 2003, pages 559–562, September2003.
S. P. Nanavati and P. K. Panigrahi. Wavelets: Applications to image compression-I.Resonance, 10(2):52–61, February 2005.
R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. TechnicalReport CRG-TR-93-1, University of Toronto, 1993.
C. Nebauer. Evaluation of convolutional neural networks for visual recognition. IEEETransactions on Neural Networks, 9(4):685–696, July 1998.
N. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. SysMan., Cyber, (9):6266, 1979.
H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: Criteriaof max-dependency, max-relevance, and min-redundancy. IEEE Transactions onPattern Analysis and Machine Intelligence, 27(8):1226 – 1238, August 2005.
S. Petridis and S. Perantonis. On the relation between discriminant analysis and mutualinformation for supervised linear feature extraction. Pattern Recognition, 37:857–874,2004.
R. Porter, N. Harvey, S. Perkins, J. Theiler, S. Brumby, J. Bloch, M. Gokhale, andJ. Szymanski. Optimizing digital hardware perceptrons for multi-spectral imageclassification. Journal of Mathematical Imaging and Vision, 19:133–150, 2003.
D. Prescher. A short tutorial on the expectation-maximization algorithm. Institute forLogic, Language and Computation. University of Amsterdam, 2004.
J. C. Principe, J. Fisher III, and D. Xu. Information-theoretic learning, May 1998.
99
J. C. Principe, D. Xu, Q. Zhao, and J. F. III. Learning from examples with informationtheoretic criteria. Journal of VLSI signal Processing Systems, 26(1-2):61–77, August2000.
L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speechrecognition. Proceedings of the IEEE, 77(2):257–286, February 1989.
M. Rizki, L. Tamburino, and M. Zmuda. Multi-resolution feature extraction from Gaborfiltered images. In Aerospace and Electronics Conference, 1993, Proceedings of theIEEE 1993 National, pages 819–824, 1993.
M. Rogati and Y. Yang. High-performing feature selection for text classification. InCIKM ’02: Proceedings of the Eleventh International Conference on Information andKnowledge Management, pages 659–661, New York, NY, USA, 2002. ACM.
V. Roth and T. Lange. Feature selection in clustering problems. In S. Thrun, L. Saul, andB. Scholkopf, editors, Advances in Neural Information Processing Systems 16. MITPress, Cambridge, MA, 2004.
S. Roweis. EM algorithms for PCA and SPCA. In NIPS ’97: Proceedings of the 1997Conference on Advances in Neural Information Processing Systems 10, pages626–632, Cambridge, MA, 1998. MIT Press.
T. Ryden and D. M. Titterington. Computational Bayesian analysis of hidden Markovmodels. Journal of Computational and Graphical Statistics, 7(2):194–211, June 1998.
S. Saha. Image compression - from DCT to wavelets: A review. Crossroads, 6(3):12–21,2000.
H. Sahoolizadeh, M. Rahimi, and H. Dehghani. Face recognition using morphologicalshared-weight neural networks. Proceedings of World Academy of Science, Engineer-ing and Technology, 35:556–559, November 2008.
J. Serra. Image Analysis and Mathematical Morphology. Academic Press, Orlando, FL,USA, 1983.
C. Shannon. A mathematical theory of communication. Bell System Technical Journal,27:379–423 and 623–656, July and October 1948.
Q. Sheng, G. Thijs, Y. Moreau, and B. D. Moor. Applications of Gibbs sampling inbioinformatics, 2005. Internal Report 05-65.
L. I. Smith. A tutorial on principal components analysis, February 2002. CornellUniversity.
M. Stanke and S. Waack. Gene prediction with a hidden Markov model and a new intronsubmodel. Bioinformatics, 19(2):215–225, 2003.
100
H. Stoppiglia, G. Dreyfus, R. Dubois, and Y. Oussar. Ranking a random feature forvariable and feature selection. Journal of Machine Learning Research, 3:1399–1414,Mar 2003.
Y. Sun. Iterative relief for feature weighting. IEEE Transactions on Pattern Analysis andMachine Intelligence, 29(6):1–17, June 2007.
L. A. Tamburino and M. M. Rizki. Automated feature detection using evolutionarylearning processes. In Aerospace and Electronics Conference, 1989, Proceedings ofthe IEEE 1989 National, volume 3, pages 1080–1087, May 1989a.
L. A. Tamburino and M. M. Rizki. Automatic generation of binary feature detectors.Aerospace and Electronic Systems Magazine, IEEE, 4(9):20–29, 1989b.
S. Thrun, J. C. Langford, and D. Fox. Monte Carlo hidden Markov models: Learningnon-parametric models of partially observable stochastic processes. In Proc. of theInternational Conference on Machine Learning (ICML), pages 415–424. MorganKaufmann, 1999.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological),, 58(1):267–288, 1996.
M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers.Neural Computation, 11(2):443–482, July 1998.
K. Torkkola. Feature extraction by non-parametric mutual information maximization.Journal of Machine Learning Research, 3:1415–1438, March 2003.
E. Urbach, J. Roerdink, and M. Wilkinson. Connected shape-size pattern spectra forrotation and scale-invariant classification of gray-scale images. IEEE Transactions onPattern Analysis and Machine Intelligence, 29(2):272–285, February 2007.
M. E. Wall, A. Rechtsteiner, and L. M. Rocha. Singular value decomposition andprincipal component analysis. In A Practical Approach to Microarray Data Analysis,pages 91–109, 2003.
S. Wang, H. Chen, S. Li, and D. Zhang. Feature extraction from tumor gene expressionprofiles using DCT and DFT. In EPIA Workshops, pages 485–496, 2007.
X. Wang and K. K. Paliwal. Feature extraction and dimensionality reduction algorithmsand their applications in vowel recognition. Pattern Recognition, 36:2429–2439, 2003.
H. Wei and S. Billings. Feature subset selection and ranking for data dimensionalityreduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1):162–166, January 2007.
J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Featureselection for SVMs. In Advances in Neural Information Processing Systems 13, pages668–674, Cambridge, MA, 2001. MIT Press.
101
P. M. Williams. Bayesian regularization and pruning using a Laplace prior. NeuralComputation, 7(1):117–143, 1995.
J. Wilson, P. Gader, W.-H. Lee, H. Frigui, and K. Ho. A large-scale systematicevaluation of algorithms using ground-penetrating radar for landmine detectionand discrimination. IEEE Transactions on Geoscience and Remote Sensing, 45(8):2560–2572, August 2007.
D. Wipf and B. Rao. l0-norm minimization for basis selection. Advances in NeuralInformation Processing Systems 17, 2005.
L. Wolf and A. Shashua. Feature selection for unsupervised and supervised inference:the emergence of sparsity in a weighted-based approach. IEEE InternationalConference on Computer Vision, 1:378–384, 2003.
Y. Won and P. D. Gader. Morphological shared-weight neural network for patternclassification and automatic target detection. In IEEE International Conference onNeural Networks, 1995 Proceedings., volume 4, pages 2134–2138, 1995.
J. Yang and J. Yang. Why can LDA be performed in PCA transformed space. PatternRecognition, 36:563–566, 2003.
J. Yang and J. Yang. Generalized KL transform based combined feature extraction.Pattern Recognition, 35:295–297, January 2002.
K. Y. Yeung and W. L. Ruzzo. Principal component analysis for clustering geneexpression data. Bioinformatics, 17(9):763–774, September 2001.
J. Yu and B. Bhanu. Evolutionary feature synthesis for facial expression recognition.Pattern Recognition Letter, 27:1289–1298, 2006.
H. Zen, Y. Nankaku, K. Tokuda, and T. Kitamura. Estimating trajectory HMM parametersusing Monte Carlo EM with Gibbs sampler. In IEEE International Conference onAcoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings.,volume 1, pages 1173–1176, May 2006.
X. Zhang, P. Gader, and H. Frigui. Feature learning for a hidden Markov model approachto landmine detection. In Society of Photo-Optical Instrumentation Engineers(SPIE) Conference Series, volume 6553 of Presented at the Society of Photo-OpticalInstrumentation Engineers (SPIE) Conference, May 2007.
D. Zhao, C. Liu, and Y. Zhang. Discriminant feature extraction using dual-objectiveoptimization model. Pattern Recognition Letter, 27:929–936, 2006.
Y. Zhao, P. Gader, P. Chen, and Y. Zhang. Training DHMMs of mine and clutter tominimize landmine detection errors. IEEE Transactions on Geoscience and RemoteSensing, 41(5):1016–1024, May 2003.
102
F. Zhu, X. D. Zhang, Y. F. Hu, and D. Xie. Nonstationary hidden Markov models formultiaspect discriminative feature extraction from radar targets. IEEE Transactions onSignal Processing, 55(5):2203–2214, May 2007.
M. A. Zmuda and L. A. Tamburino. Efficient algorithms for the soft morphologicaloperators. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(11):1142–1147, November 1996.
103
BIOGRAPHICAL SKETCH
Xuping Zhang is a Ph.D. student at the University of Florida. He earned his
bachelor’s degree at Tsinghua University, Beijing, China. His research interests include
landmine detection, artificial intelligence, machine learning, Bayesian methods, feature
learning for images/signals, data mining, and pattern recognition.
104