The Correntropy Mace Filter for Image Recognition

THE CORRENTROPY MACE FILTER FOR IMAGE RECOGNITION

By

KYU-HWA JEONG

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007

1

c© 2007 Kyu-Hwa Jeong

2

I dedicate this to my parents and family

3

ACKNOWLEDGMENTS

First of all, I thank my advisor, Dr. Jose C. Principe, for his great inspiration,

support and guidance over my graduate studies. I was impressed by Dr. Principe’s active

thought and appreciated very much his supervision which gave me a lot of opportunities to

explore on my research.

I am also grateful to all the members of my advisory committee: Dr. John G. Harris,

Dr. K. Clint Slatton and Dr. Murali Rao for their valuable time and interest in serving on

my supervisory committee, as well as their comments, which helped improve the quality of

this dissertation.

I also express my appreciation to all the CNEL colleagues, especially ITL group

members who are Jianwu Xu, Puskal Pokharel, Seungju Han, Sudhir Rao, Antonio Paiva,

and Weifeng Liu for their help, collaboration and valuable discussions during my PhD

study.

Finally, I express my great love for my wife, Inyoung and our two lovely sons, Hoyeon

(Luke) and Seungyeon (Justin). I thank Inyoung for her love, caring, and patience, which

made this study possible. Also I am grateful to my parents for their great support for my

life.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 FUNDAMENTAL DISTORTION INVARIANT LINEARCORRELATION FILTERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Synthetic Discriminant Function (SDF) . . . . . . . . . . . . . . . . . . . . 212.3 Minimum Average Correlation Energy (MACE) Filter . . . . . . . . . . . . 232.4 Optimal Trade-off Synthetic Discriminant (OTSDF) Function . . . . . . . 25

3 KERNEL-BASED CORRELATION FILTERS . . . . . . . . . . . . . . . . . . . 27

3.1 Brief review on Kernel Method . . . . . . . . . . . . . . . . . . . . . . . . 273.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.2 Kernel Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Kernel Synthetic Discriminant Function . . . . . . . . . . . . . . . . . . . 303.3 Application of the Kernel SDF to Face Recognition . . . . . . . . . . . . . 31

3.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 A RKHS PERSPECTIVE OF THE MACE FILTER . . . . . . . . . . . . . . . 36

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 Reproducing Kernel Hilbert Space (RKHS) . . . . . . . . . . . . . . . . . . 364.3 Interpretation of the MACE filter in the RKHS . . . . . . . . . . . . . . . 39

5 NONLINEAR VERSION OF THE MACE IN A NEW RKHS :THE CORRENTROPY MACE (CMACE) FILTER . . . . . . . . . . . . . . . . 41

5.1 Correntropy Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.2 Some Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5

5.2 The Correntropy MACE Filter . . . . . . . . . . . . . . . . . . . . . . . . . 465.3 Implications of the CMACE Filter in the VRKHS . . . . . . . . . . . . . . 50

5.3.1 Implication of Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . 505.3.2 Finite Dimensional Feature Space . . . . . . . . . . . . . . . . . . . 515.3.3 The kernel correlation filter vs. The CMACE filter: Prewhitening

in Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 THE CORRENTROPY MACE IMPLEMENTATION . . . . . . . . . . . . . . 53

6.1 The Output of the CMACE Filter . . . . . . . . . . . . . . . . . . . . . . . 536.2 Centering of the CMACE in Feature Space . . . . . . . . . . . . . . . . . . 546.3 The Fast CMACE Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.3.1 The Fast Gauss Transform . . . . . . . . . . . . . . . . . . . . . . . 576.3.2 The Fast Correntropy MACE Filter . . . . . . . . . . . . . . . . . . 57

7 APPLICATIONS OF THE CMACE TOIMAGE RECOGNITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.1 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . 607.1.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.2 Synthetic Aperture Radar (SAR) Image Recognition . . . . . . . . . . . . 647.2.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2.2 Aspect Angle Distortion Case . . . . . . . . . . . . . . . . . . . . . 657.2.3 Depression Angle Distortion Case . . . . . . . . . . . . . . . . . . . 717.2.4 The Fast Correntropy MACE Results . . . . . . . . . . . . . . . . . 757.2.5 The effect of additive noise . . . . . . . . . . . . . . . . . . . . . . . 75

8 DIMENSIONALITY REDUCTION WITH RANDOM PROJECTIONS . . . . . 79

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808.3 PCA and SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818.4 Random Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8.4.1 Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.4.2 Orthogonality and Similarity Properties . . . . . . . . . . . . . . . . 83

8.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

9 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . 93

9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

APPENDIX

A CONSTRAINED OPTIMIZATION WITH LAGRANGE MULTIPLIERS . . . . 98

B THE PROOF OF PROPERTY 5 OF CORRENTROPY . . . . . . . . . . . . . 100

6

C THE PROOF OF A SHIFT-INVARIANT PROPERTY OF THE CMACE . . . 101

D COMPUTATIONAL COMPLEXITY OF THE MACE AND CMACE . . . . . . 102

E THE CORRENTROPY-BASED ROBUST NONLINEAR BEAMFORMER . . 103

E.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103E.2 Standard Beamforming Problem . . . . . . . . . . . . . . . . . . . . . . . . 104

E.2.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104E.2.2 Minimum Variance Beamforming . . . . . . . . . . . . . . . . . . . 105E.2.3 Kernel-based beamforming . . . . . . . . . . . . . . . . . . . . . . . 106

E.3 Nonlinear Beamformer using Correntropy . . . . . . . . . . . . . . . . . . . 107E.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109E.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7

LIST OF TABLES

Table page

6-1 Estimated computational complexity for training with N images and testingwith one image. Matrix inversion and multiplication are considered . . . . . . . 59

7-1 Comparison of standard deviations of all the Monte-Carlo simulation outputs . . 63

7-2 Comparison of ROC areas with different kernel sizes . . . . . . . . . . . . . . . . 64

7-3 Case A: Comparison of ROC areas with different kernel sizes . . . . . . . . . . . 72

7-4 Comparison of computation time and error for one test image between the directmethod (CMACE) and the FGT method (Fast CMACE) with p = 4 and kc = 4 75

7-5 Comparison of computation time and error for one test image in the FGT methodwith a different number of orders and clusters . . . . . . . . . . . . . . . . . . . 76

8-1 Comparison of the memory and computation time between the original CMACEand CMACE-RP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8

LIST OF FIGURES

Figure page

1-1 Block diagram for pattern recognition. . . . . . . . . . . . . . . . . . . . . . . . 14

1-2 Block diagram for image recognition process using correlation filter. . . . . . . . 15

2-1 Example of the correlation output plane of the SDF . . . . . . . . . . . . . . . . 22

2-2 Example of the correlation output plane of the MACE . . . . . . . . . . . . . . 24

2-3 Example of the correlation output plane of the OTSDF . . . . . . . . . . . . . . 25

3-1 The example of kernel method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3-2 Sample images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3-3 The output peak values when only 3 images are used for training . . . . . . . . 33

3-4 The comparison of ROC curves with different number of training images. . . . . 34

3-5 The output values of noisy test input images with additive Gaussian noise when25 images are used for training . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3-6 The ROC curves of noisy test input images with different SNRs when 10 imagesare used for training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5-1 Contours of CIM(X,0) in 2D sample space . . . . . . . . . . . . . . . . . . . . . 44

7-1 The averaged test output peak values . . . . . . . . . . . . . . . . . . . . . . . . 61

7-2 The test output peak values with additive Gaussian noise . . . . . . . . . . . . . 61

7-3 The comparison of ROC curves with different SNRs. . . . . . . . . . . . . . . . 62

7-4 The comparison of standard deviation of 100 Monte-Carlo simulation outputsof each noisy false class test images. . . . . . . . . . . . . . . . . . . . . . . . . . 64

7-5 Case A: Sample SAR images (64x64 pixels) of two vehicle types for a target chip(BTR60) and a confuser (T62). . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7-6 Case A: Peak output responses of testing images for a target chip and a confuser 67

7-7 Case A: ROC curves with different numbers of training images. . . . . . . . . . 67

7-8 Case A: The MACE output plane vs. the CMACE output plane . . . . . . . . . 69

7-9 Sample images of BTR60 of size (64× 64) pixels . . . . . . . . . . . . . . . . . . 70

7-10 Case A: Output planes with shifted true class input image . . . . . . . . . . . . 70

7-11 The ROC comparison with different kernel sizes . . . . . . . . . . . . . . . . . . 72

9

7-12 Case B: Sample SAR images (64x64 pixels) of two vehicle types for a target chip(2S1) and a confuser (T62). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7-13 Case B: Peak output responses of testing images for a target chip and a confuser 74

7-14 Case B: ROC curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7-15 Comparison of ROC curves between the direct and the FGT method in case A . 76

7-16 Sample SAR images (64x64 pixels) of BTR60. . . . . . . . . . . . . . . . . . . . 77

7-17 ROC comparisons with noisy test images (SNR=7dB) in the case A . . . . . . . 77

8-1 The comparison of ROC areas with different RP dimensionality . . . . . . . . . 85

8-2 The comparison of ROC areas with different RP dimensionality . . . . . . . . . 86

8-3 ROC comparison with different dimensionality reduction methods for MACEand CMACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8-4 ROC comparison with PCA for MACE and CMACE . . . . . . . . . . . . . . . 89

8-5 Sample images of size 16× 16 after RP . . . . . . . . . . . . . . . . . . . . . . . 90

8-6 The cross correlation vs. cross correntropy . . . . . . . . . . . . . . . . . . . . . 91

8-7 Correlation output planes vs. correntropy output planes after dimension reductionwith random projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

E-1 Comparisons of the beampattern for three beamformers in Gaussian noise with10dB of SNR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

E-2 Comparisons of BER for three beamformers in Gaussian noise with differentSNRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

E-3 Comparisons of BER for three beamformers with different characteristic exponentα levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

E-4 Comparisons of the beampattern for three beamformers in non-Gaussian noise. . 116

10

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

THE CORRENTROPY MACE FILTER FOR IMAGE RECOGNITION

By

Kyu-Hwa Jeong

August 2007

Chair: Jose C. PrincipeMajor: Electrical and Computer Engineering

The major goal of my research was to develop nonlinear methods of the family of

distortion invariant filters, specifically the minimum average correlation energy (MACE)

filter. The minimum average correlation energy (MACE) filter is a well known correlation

filter for pattern recognition. My research investigated a closed form solution of the

nonlinear version of the MACE filter using the recently introduced correntropy function.

Correntropy is a positive definite function that generalizes the concept of correlation

by utilizing higher order moments of the signal statistics. Because of its positive definite

nature, correntropy induces a new reproducing kernel Hilbert space (RKHS). Taking

advantage of the linear structure of the RKHS, it is possible to formulate and solve the

MACE filter equations in the RKHS induced by correntropy. Due to the nonlinear relation

between the feature space and the input space, the correntropy MACE (CMACE) can

potentially improve upon the MACE performance while preserving the shift-invariant

property (additional computation for all shifts will be required in the CMACE).

To alleviate the computation complexity of the solution, my research also presents

the fast CMACE using the Fast Gauss Transform (FGT). Both the MACE and CMACE

are basically memory-based algorithms and due to the high dimensionality of the image

data, the computational cost of the CMACE filter is one of critical issues in practical

applications. Therefore, my research also used a dimensionality reduction method based

11

on random projections (RP), which has emerged as a powerful method for dimensionality

reduction in machine learning.

We applied the CMACE filter to face recognition using facial expression data and the

MSTAR public release Synthetic Aperture Radar (SAR) data set, and experimental results

show that the proposed CMACE filter indeed outperforms the traditional linear MACE

and the kernelized MACE in both generalization and rejection abilities. In addition,

simulation results in face recognition show that the CMACE filter with random projection

(CMACE-RP) also outperforms the traditional linear MACE with small degradation in

performance, but great savings in storage and computational complexity.

12

CHAPTER 1INTRODUCTION

1.1 Background

The goal of pattern recognition is to detect and assign an observation into one of

multiple classes to be recognized. The observation can be a speech signal, an image or a

higher-dimensional object. In general there are two broad class of classification problems:

open and close sets. Most of classification problems deal with closed set, which means

that we have all prior information of the classes and classify those given classes. In open

classification problem, we only have a prior information of one specific class and no prior

information for out of class which can be universe. Object recognition that we present

in this research is an open set problem, therefore the method that we are going to use is

based on one class versus the universe.

There are a lot of applications for object recognitions. In automatic target recognition,

the goal is to quickly and automatically detect and classify objects which may be present

within large amounts of data (typically imagery) with a minimum of human intervention

such as vehicle vs. non-vehicle, tanks vs. trucks, on type of tank vs. another type.

Another pattern recognition applications which is emerging research fields is the

biometrics such as face, iris and fingerprint recognition for human identification and

verification. biometrics technology is rapidly being adopted in a wide variety of security

applications such as computer and physical access control, electronic commerce, homeland

security, and defense.

Figure 1-1 shows the block diagram for the common pattern recognition process. In

the preprocessing block, denosing, normalization, edge detection, pose estimation, etc are

conducted for each applications.

Feature extraction involves simplifying the amount of resources required to describe a

large set of data accurately. When performing analysis of complex data one of the major

problems stems from the number of variables involved. Analysis with a large number

13

of variables generally requires a large amount of memory and computation power or a

classification algorithm which overfits the training sample and generalizes poorly to new

samples. Feature extraction is a general term for methods of constructing combinations of

the variables to get around these problems while still describing the data with sufficient

accuracy.

The goal of classification is to assign the features derived from the input pattern to

one of the classes. There are a variety of classifiers including statistical classifiers, artificial

neural networks, support vector machine (SVM), and so on.

Another important pattern recognition methodology is to use the training data

directly instead of extracting some features and performing classification based on those

features. While feature extraction works well in many pattern recognition applications, it

is not always easy for humans to identify what the good features may be.

Correlation filters have been applied successfully to automatic target detection

and recognition (ATR) [1] for SAR image [2],[3],[4] and biometric identification such as

face, iris and fingerprint recognition [5],[6], by virtue of their shift-invariant property[7],

which means that if the test image contains the reference object at a shifted location, the

correlation output is also shifted by exactly the same amount. Due to this property, there

is no need to conduct additional process of centering the input image prior to recognizing

it. Also, in some ATR applications, it is not only desirable to recognize various targets,

but to locate them with some degree of accuracy and the location can be easily founded

by searching the peak of the correlation output. Another advantage of correlation filters is

that it is linear and therefore the solution can be computed analytically.

Figure 1-1. Block diagram for pattern recognition.

14

Figure 1-2 depicts the simple block diagram for image recognition process using

correlation filters. Object recognition can be performed by cross-correlating an input

image with a synthesized template (filter) and the correlation output is searched for the

peak, which is used to determine whether the object of interest is present or not.

It is well known that matched filters are the optimal linear filters for signal detection

under linear channel and white noise conditions [8][9]. Matched spatial filters (MSF) are

optimal in the sense that they provide the maximum output signal to noise ratio (SNR)

for the detection of a known image in the presence of white noise, under the reasonable

assumption of Gaussian statistics [10]. However, the performance of the MSF is very

sensitive to even small changes in the reference image and the MSF cannot be used for

multiclass pattern recognition since it is only optimum for a single image. Therefore

distortion invariant composite filters have been proposed in various papers [1].

Distortion invariant composite filters are a generalization of matched spatial filtering

for the detection of a single object to the detection of a class of objects, usually in the

image domain. Typically the object class is represented by a set of training exemplars.

The exemplar images represent the image class through the entire range of distortions of

a single object. The goal is to design a single filter which will recognize an object class

through the entire range of distortion. Under the design criterion the filter is equally

Figure 1-2. Block diagram for image recognition process using correlation filter.

15

matched to the entire range of distortion as opposed to a single viewpoint as in a matched

filter.

The most well known of such composite correlation filters belong to the synthetic

discriminant function (SDF) class [11] and its variations. One of the appeals of the SDF

class is that it can be computed analytically and effectively using frequency domain

techniques. In the conventional SDF approach, the filter is matched to a composite

template that is a linear combination of the training image vectors such that the cross

correlation output at the origin has the same value with all training images. The hope

is that this composite template will correlate equally well not only with the training

images but also with distorted versions of the training images, as well as with test images

in the same class. One of the problems with the original SDF filters is that because

only the origin of the correlation plane is constrained, it is quite possible that some

other value in the correlation plane is higher than this value at the origin even when the

input is centered at the origin. Since the processing of resulting correlation outputs is

based on detecting peaks, we can expect a high probability of false peaks and resulting

misclassifications in such situations.

Minimum variance SDF (MVSDF) filter has been proposed in [12] taking into

consideration additive input noise. The MVSDF minimizes the output variance due to

zero-mean input noise while satisfying the same linear constraints of the SDF. One of the

major difficulties in MVSDF is that the noise covariance is not known exactly; even when

known, an inversion is required and it may be computationally demanding.

Another correlation filter that is widely used is the minimum average correlation

energy (MACE) filter [13]. The MACE minimizes the average correlation energy of the

output over the training images to produce a sharp correlation peak subject to the same

linear constraints as the MVSDF and SDF filters. In practice, the MACE filter performs

better than the MVSDF with respect to out-of-class rejection. The MACE filter however,

has been shown to have poor generalization properties, that is, images in the recognition

16

class but not in the training exemplar set are not recognized well. The MACE filter is

generally known to be sensitive to distortions but readily able to suppress clutter. In

general, it was observed that filters that produce broader correlation peaks(such as the

early SDFs) offer better distortion tolerance. However, they may also provide poorer

discrimination between classes since these filters tend to correlate broadly with low

frequency information in which the classes may be difficult to separate.

By minimizing the average energy in the correlation plane, we hope to keep the side

lobes low while maintaining the origin values at prespecified levels. This is indirect method

to reduce the false peak or side lobe problem. However, in their attempt to produce

delta-function type correlation outputs, MACE filters emphasize high frequencies and yield

low correlation outputs with images not in the training set.

Therefore, some advanced MACE approaches such as the Gaussain MACE (G-MACE)

[14], the minimum noise and correlation energy (MINACE) [15] and optimal trade-off

filters [16] have been proposed to combine the properties of various SDF’s. In the

G-MACE, the correlation outputs are made to approximate Gaussian shaped functions.

This represents a direct method to control the shape of the correlation outputs. The

MINACE and G-MACE variations have improved generalization properties with a slight

degradation in the average output plane variance and sharpness of the central peak

respectively.

In the most of the previous research in SDF type filters, linear constraints are

imposed on the training images to yield a known value at specific locations in the

correlation plane. However, placing such constraints satisfies conditions only at isolated

points in the image space but does not explicitly control the filter’s ability to generalize

over the entire domain of the training images.

New correlation filter design based on relaxed constraints, which is called the

maximum average correlation height (MACH) filter has been proposed in [17]. MACH

filter adopt a statistical approach that they do not treat training images as deterministic

17

representations of the object but as samples of a class whose characteristic parameters

should be used in encoding the filter.

The concept of relaxing the correlation constraints and utilizing the entire correlation

output for multi-class recognition was explicitly first addressed by the distance classifier

correlation filter (DCCF)[18].

1.2 Motivation

Most of the members of the distortion invariant filter family are linear filters, which

are optimal only when the underlying statistics are Gaussian. For the non-Gaussian

statistics case, we need to extract information beyond second-order statistics. This is the

fundamental motivation of this research.

A nonlinear version of correlation filters called the polynomial distance classifier

correlation filter (PDCCF) has been proposed in [19]. A nonlinear extension to the

MACE filter using neural network topology has also been proposed in [20]. Since the

MACE filter is equivalent to a cascade of a linear pre-processor followed by a linear

associative memory (LAM) [21], the LAM portion of the MACE can be replaced with a

nonlinear associative memory structure, specifically a feed-forward multi-layer perceptron

(MLP). It is well known that non-linear associative memory structures can outperform

their linear counterparts on the basis of generalization and dynamic range. However,

in general, they are more difficult to design as their parameters cannot be computed in

closed form. Results have also shown that it is not enough to simply train a MLP using

backpropagation. Careful analysis of the final solution is necessary to confirm reasonable

results. Experimental results in [22] showed better generalization and classification

performance than the linear MACE in the MSTAR ATR data set(at 80% probability of

false alarms, the probability of detection dropped from 4.37%(MACE) to 2.45% in the

nonlinear MACE).

Recently, kernel based learning algorithms have been applied to classification and

pattern recognition due to the fact that they easily produce nonlinear extensions to linear

18

classifiers and boost performance [23]. By transforming the data into a high-dimensional

reproducing kernel Hilbert space (RKHS) and constructing optimal linear algorithms in

that space, the kernel-based learning algorithms effectively perform optimal nonlinear

pattern recognitions in input space to achieve better separation, estimation, regression and

etc. The nonlinear versions of a number of signal processing techniques such as principal

component analysis (PCA) [24], Fisher discriminant analysis [25] and linear classifiers [25]

have already been defined in a kernel space. Also the kernel matched spatial filter (KMSF)

has been proposed for hyperspectral target detection in [26] and the kernel SDF has

been proposed for face recognition [27]. The kernel correlation filter (KCF) which is the

kernelized MACE filter after prewhitening has been proposed in [28] for face recognition.

Similar to Fisher’s idea in [20], in the KCF, prewhitening is performed in the input space

with linear methods and it may affect to the whole performance. We will later present the

difference between the KCF and proposed method, where all the computation including

prewhitening are conducted in the feature space.

More recently, a new generalized correlation function, called correntropy has been

introduced by our group [29]. Correntropy is a positive definite function, which measures

a generalized similarity between random variables (or stochastic processes) and it involves

high-order statistics of input signals, therefore it could be a promising candidate for

machine learning and signal processing. Correntropy defines a new reproducing kernel

Hilbert space (RKHS), which has the same dimensionality as the one defined by the

covariance matrix in the input space and it simplifies the formulation of analytic solutions

in this finite dimensional RKHS. Applications to the matched filter [30], chaotic time

series prediction have been presented in the literature.

Based on the promising properties of correntropy and the MACE filter, the main goal

of this research is to exploit the generalized nonlinear MACE filter for image recognition.

As the first step, we applied the kernel method to the SDF and obtained the kernel SDF.

Application of the kernel SDF to face recognition has been presented in this research. As

19

the main part of the research, this dissertation establishes the mathematical foundations of

the correntropy MACE filter (called here the CMACE filter) and evaluates its performance

in face recognition and synthetic aperture radar (SAR) ATR applications.

The formulation exploits the linear structure of the RKHS induced by correntropy

to formulate the correntropy MACE filter in the same manner as the original MACE,

and solves the problem with virtually the same equations (e.g without regularization)

in the RKHS. Due to the nonlinear relation between the input and this feature spaces,

the CMACE corresponds to a nonlinear filter in the input space. In addition, the

CMACE preserves the shift-invariant property of the linear MACE, however it requires an

additional computation for each input image shift. In order to reduce the computational

complexity of the CMACE, the fast CMACE filter based on the Fast Gauss Transform

(FGT) is also proposed.

Also we introduce the dimensionality reduction method based on random projections

(RP) and apply RP to the CMACE in order to decrease the storage and meet more

readily available computational resources and show the RP method works well with the

CMACE filter for image recognition.

We can say the CMACE formulation for image recognition is one application of

the general case of the energy minimization problems. We can also apply the same

methodology which is minimizing correntropy energy of the output for the beamforming

problem, whose conventional linear solution is obtained by minimizing the output power.

Appendix E presents the new application of the correntropy to the beamforming problem

in wireless communications with some preliminary results.

20

CHAPTER 2FUNDAMENTAL DISTORTION INVARIANT LINEAR

CORRELATION FILTERS

2.1 Introduction

Distortion invariant composite filters are a generalization of matched spatial filtering

for the detection of a single object to the detection of a class of objects, and those are

widely used for image recognition. There are a lot of variations of the correlation filters.

The SDF and the MACE filter are fundamental correlation-based distortion invariant

filters for object recognition. Most of the correlation filters are based on them. In this

research, we present the nonlinear extensions to the SDF and MACE. The formulations

of the SDF and MACE filter are briefly introduced in this chapter. We consider a

2-dimensional image as a d × 1 column vector by lexicographically reordering the image,

where d is the number of pixels.

2.2 Synthetic Discriminant Function (SDF)

The SDF filter is matched to a composite image h, where h is a linear combination of

the training image vectors xi

h =N∑

i=1

aixi, (2–1)

where N is the number of training images and the coefficients ai are chosen to satisfy the

following constraints

hTxj = uj, j = 1, 2, · · · , N, (2–2)

where T denotes the transpose and uj is a desired cross correlation output peak value. In

vector form, we define the training image data matrix X as

X = [x1,x2, · · · ,xN ], (2–3)

where the size of matrix X is d × N . Then the SDF is the solution to the following

optimization problem

minhTh, subject to XTh = u. (2–4)

21

Figure 2-1. Example of the correlation output plane of the SDF [31].

It is assumed that N < d and so the problem formulation is a quadratic optimization

subject to an under-determined system of linear constraints. The optimal solution is

h = X(XTX)−1u. (2–5)

Once h is determined, we apply an appropriate threshold to the output of the cross

correlation, which is the inner product of the test input image and the filter h and decide

on the class of the test image.

Figure 2-1 shows the general shape of the correlation output plane of the SDF uisng

inverse synthetic aperture radar (ISAR) imagery [31]. As stated early, the SDF has

a broad output plane response and it means that the SDF has a good generalization

performance with the true class images, but a poorer discrimination between true class

and out of class images.

22

2.3 Minimum Average Correlation Energy (MACE) Filter

Let us denote the ith image vector be xi after reordering. The conventional MACE

filter is better formulated in the frequency domain. The discrete Fourier transform (DFT)

of the column vector xi is denoted by Xi and we define the training image data matrix X

as

X = [X1, X2, · · · ,XN ] , (2–6)

where the size of X is d × N and N is the number of training image. Let the vector h be

the filter in the space domain and represent by H its Fourier transform vector. We are

interested in the correlation between the input image and the filter. The correlation of the

ith image sequence xi(n) with filter sequence h(n) can be written as

gi(n) = xi(n)⊗ h(n). (2–7)

By Parseval’s theorem, the correlation energy of the ith image can be written as a

quadratic form

Ei = HHDiH, (2–8)

where Di is a diagonal matrix of size d × d whose diagonal elements are the magnitude

squared of the associated element of Xi, that is, the power spectrum of xi(n) and the

superscript H denotes the Hermitian transpose. The objective of the MACE filter is

to minimize the average correlation energy over the image class while simultaneously

satisfying an intensity constraint at the origin for each image. The value of the correlation

at the origin can be written as

gi(0) = XHi H = ci, (2–9)

for i = 1, 2, · · · · ·· , N training images, where ci is the user specified output correlation

value at the origin for the ith image. Then the average energy over all training images is

expressed as

Eavg = HHDH, (2–10)

23

Figure 2-2. Example of the correlation output plane of the MACE [31].

where

D = (1/N)N∑

i=1

Di. (2–11)

The MACE design problem is to minimize Eavg while satisfying the constraint,XHH =

c, where c = [c1, c2, · · · , cN ] is an N dimensional vector. This optimization problem can

be solved using Lagrange multipliers, and the solution is

H = D−1X(XHD−1X)−1c. (2–12)

It is clear that the spatial filter h can be obtained from H by an inverse DFT. Once h is

determined, we apply an appropriate threshold to the output correlation plane and decide

whether the test image belongs to the class of the template or not.

Figure 2-2 shows the general shape of the correlation output plane of the MACE [31].

It shows a sharp peak at the origin and as a result the abilities of finding the location

of the target and discrimination between true class and out of class images have been

24

Figure 2-3. Example of the correlation output plane of the OTSDF [31].

improved. However, a sharp output plane causes a worse distortion tolerance and poor

generalization.

2.4 Optimal Trade-off Synthetic Discriminant (OTSDF) Function

The optimal trade-off filter (OTSDF) is a well know correlation filter to overcome the

poor generalization of the MACE when noise input is presented. The OTSDF wishes to

trade-off the MACE filter criterion versus the MVSDF filter criterion.

The OTSDF filter in the frequency domain is given by

H = T−1X(XHT−1X)−1c, (2–13)

where T = αD +√

1− α2C, and 0 ≤ α ≤ 1, and D is the diagonal matrix in the MACE

and C is the diagonal matrix containing the input noise power spectral density as its

diagonal entries.

25

The correlation output response of the OTSDF is shown in Figure 2-3[31]. As

compared to the MACE filter response, the output peak is not nearly as sharp, but still

more localized than the SDF case.

26

CHAPTER 3KERNEL-BASED CORRELATION FILTERS

3.1 Brief review on Kernel Method

3.1.1 Introduction

Kernel-based algorithms have been recently developed in the machine learning

community, where they were first used to solve binary classification problems, the so-called

Support Vector Machine (SVM) [32]. And there is now an extensive literature on SVM

[33],[34] and the family of kernel-based algorithms [23].

A kernel-based algorithm is a nonlinear version of a linear algorithm where the data

has been previously (and most often nonlinearly) transformed to a higher dimensional

space in which we only need to be able to compute inner products (via a kernel function).

It is clear that many problems arising in signal processing are of statistical nature

and require automatic data analysis methods. Furthermore, the algorithms used in

signal processing are usually linear and their transformation for nonlinear processing is

sometimes unclear. Signal processing practitioners can benefit from a deeper understanding

of kernel methods, because they provide a different way of taking into account nonlinearities

without loosing the original properties of the linear method. Another aspect is dealing

with the amount of available data in a space of a given dimensionality, one needs methods

that can use little data and avoid the curse of dimensionality.

Aronszajn [35] and Parzen [36] were some of the first to employ positive definite

kernel in statistics. Later, based on statistical learning theory, support vector machine

[70] and other kernel-based learning algorithms [63] such as kernel principal component

analysis [64], kernel Fisher discriminant analysis [58] and kernel independent component

analysis [4] have been introduced.

3.1.2 Kernel Method

Many algorithms for data analysis are based on the assumption that the data can

be represented as vectors in a finite dimensional vector space. These algorithms, such

27

Figure 3-1. The example of kernel method (left: Input space, right: feature space).

as linear discrimination, principal component analysis, or least squares regression, make

extensive use of the linear sstructure. Roughly speaking, kernels allow one to naturally

derive nonlinear versions of linear algorithms through the implicit nonlinear mapping. The

general idea is the following. Given a linear algorithm (i.e. an algorithm which works in

a vector space), one first maps the data living in a space χ (the input space) to a vector

space H (the feature space) via a nonlinear mapping Φ(·) : χ −→ H; and then runs the

algorithm on the vector representation Φ(x) of the data. In other words, one performs

nonlinear analysis of the data using a linear method. The purpose of the map Φ(·) is to

translate nonlinear structures of the data into linear ones in H.

Consider the following discrimination problem (see Figure 3-1) where the goal is

to separate two sets of points. In the input space, the problem is nonlinear, but after

applying the transformation Φ(x1, x2) = (x21,√

2x1x2, x22) which maps each vector to the

three monomials of degree 2 formed by its coordinates, the separation boundary becomes

linear. We have just transformed the data and we hope that in the new representation,

linear structures will emerge.

Working directly with the data in the feature space may be difficult because the space

can be infinite dimensional, or the transformation implicitly defined. The basic idea of

kernel algorithm is to transform the data x from the input space to a high dimensional

feature space of vectors Φ(x), where the inner products can be computed using a positive

28

definite kernel function satisfying Mercer’s condition [23],

κ(x,y) =< Φ(x), Φ(y) > . (3–1)

Mercer’s Theorem: Suppose κ(t, s) is a continuous symmetric non-negative

function on a closed finite interval T × T. Denote by {λk, k = 1, 2, . . . } a sequence

of non-negative eigenvalues of κ(t, s) and by {ϕk(t), k = 1, 2, . . . } the sequence of

corresponding normalized eigenfunctions, in other word, for all integers t and s,

∫

T

κ(t, s)ϕk(t)dt = λkϕk(s), s, t ∈ T (3–2)∫

T

ϕk(t)ϕj(t)dx = δk,j (3–3)

where δk,j is the Kronecker delta function, i.e., equal to 1 or 0 according as k = j or k 6= j.

Then

κ(t, s) =∞∑

k=0

λkϕk(t)ϕk(s) (3–4)

where the series above converges absolutely and uniformly on T × T [37].

This simple and elegant idea allows us to obtain nonlinear versions of any linear

algorithm expressed in terms of inner products, without even knowing the exact mapping

function Φ. A particularly interesting characteristic of the feature space is that it is a

reproducing kernel Hilbert space (RKHS), i.e., the span of functions {κ(·,x) : x ∈ χ}defines a unique functional Hilbert space [35]. The crucial property of these space is the

reproducing property of the kernel

f(x) =< κ(·,x), f >, ∀f ∈ F. (3–5)

In particular, we can define our nonlinear mapping from the input space to RKHS as

Φ(x) = κ(·,x), then we have

< Φ(x), Φ(y) >=< κ(·,x), κ(·,y) >= κ(x,y), (3–6)

29

and thus Φ(x) = κ(·,x) defines the Hilbert space associated with the kernel.

In this research, we use the Gaussian kernel, which is the most widely used Mercer

kernel,

κ(x− y) =1√2πσ

exp− (‖x− y‖2

2σ2). (3–7)

3.2 Kernel Synthetic Discriminant Function

Based on the kernel methodology, the previous optimization problem for the SDF can

be solved in an infinite dimensional kernel feature space by transforming each element of

the matrix of exemplars X to Φ(Xij) and h to Φ(h) with sample by sample mapping, thus

forming a higher dimensional matrix Φ(X) whose (i, j)th feature vector is Φ(Xij). Let the

N training images matrix X be

X = [x1,x2, · · · ,xN ], (3–8)

where xi is ith training image vector given by

xi = [xi(1), · · · , xi(d)], (3–9)

Then we can extend the SDF optimization problem to the nonlinear feature space by

min ΦT (h)Φ(h), subject to ΦT (X)Φ(h) = u. (3–10)

where the dimensions of the transformed Φ(X) and Φ(h) are ∞ × N and ∞ × 1,

respectively for the Gaussian kernel. Then the solution in kernel space becomes

Φ(h) = Φ(X)(ΦT (X)Φ(X))−1u. (3–11)

We denote KXX = ΦT (X)Φ(X) ,which is a N ×N full rank matrix whose (i, j)th element

is given by

(KXX)ij =d∑

k=1

k(xi(k), xj(k)), i, j = 1, 2, · · · , N. (3–12)

30

Although Φ(h) is a infinite dimensional vector, the output of this filter is going to be an

N × 1, which can be easily computed using these kernels.

Let Z be the matrix of vector images for testing and its number of testing images are

L. We denote KZX = ΦT (Z)Φ(X), which is L×N matrix whose each element is given by

(KZX)ij =d∑

k=1

k(zi(k), xj(k)), i = 1, 2, · · · , L, j = 1, 2, · · · , N. (3–13)

Then the L× 1 output vector of the kernel SDF is given by

y = ΦT (Z)Φ(h) = KZXK−1XXu. (3–14)

We can compute KXX off-line with given training data. Then KZX and y are can be

computed on-line with a given test image. Given N training images and one test image,

the computational complexity are O(dN2) and O(dN) + O(N3) during off-line and on-line,

respectively. In general, N ¿ d, so the dominant part of the computational complexity for

matrix inversion in (3–14), O(N3), is not a critical computational issue in the KDSF. Also

the required memory for KXX which is O(N2) is much less than the case of depending on

the number of image pixels, d.

By applying an appropriate threshold to the output in (3–14), we can detect and

recognize the testing data without generating the composite filter in a feature space. In

object recognition and classification senses, the proposed kernel SDF is simpler than the

kernel matched filter.

3.3 Application of the Kernel SDF to Face Recognition

3.3.1 Problem Description

In this section, we show the performance of the proposed kernel based SDF filter for

face image recognition. In the simulations, we used the facial expression database collected

at the Advanced Multimedia Processing Lab at the Electrical and Computer Engineering

Department of Carnegie Mellon university [38]. The database consists of 13 subjects,

whose facial images were captured with 75 varying expressions. The size of each image is

31

64×64. Sample images are depicted in Figure 3-2. In this research, we tested the proposed

kernel SDF method with the original database images as well as with noisy images.

Sample images with additive Gaussian noise with a 10 dB SNR are shown in Figure 3-2(c).

In order to evaluate the performance of the SDF and kernel SDF filter in this data set,

we examined 975(13×75) correlation outputs. From these results and the ones reported

in [5] we picked and report the results of the two most difficult cases who produced the

worst performance with the conventional SDF method. We test with all the images of

each person’s data set resulting in 75 outputs for each class. The simulation results have

been obtained by averaging (Monte-Carlo approach) over 100 different training sets

(each training set has been chosen randomly ) to minimize the problem of performance

differences due to splitting the relatively small database in training and testing sets. In

this data set, it has been observed that the kernel size around 30%-50% of the standard

deviation of the input data would be appropriate.

3.3.2 Simulation Results

(a)

(b)

(c)

Figure 3-2. Sample images: (a) Person A (b) Person B (c) Person A with additiveGaussian noise (SNR=10dB).

32

10 20 30 40 50 60 700.6

0.7

0.8

0.9

1

outp

ut

number of testing images

10 20 30 40 50 60 700.6

0.7

0.8

0.9

1

outp

ut


True classFalse class


Figure 3-3. The output peak values when only 3 images are used for training (N=3),(Top): SDF, (Bottom): Kernel SDF.

Figure 3-3 shows the average output peak values for image recognition when we use

only N = 3 images as training. The desired output peak value should be close to one

when the test image belongs to the training image class. Figure 3-3 (Top) shows that the

correlation output peak values of the conventional SDF in both true and false classes not

only overlap but are also close to one. As a result the system will have great difficulty to

differentiate these two individuals because they can be interpreted as belonging to the

same class. Figure 3-3 (Bottom) shows the output values of kernel SDF and we can see

that the two images can be recognized well even with a small number of training images.

Figure 3-4 shows the ROC curves with different number of training images (N). In the

kernel SDF with N = 3, the probability of detection with zero false alarm rate is 1.

However, the conventional SDF needs at least 25 images for training in order to have the

same detection performance as the kernel SDF.

33

Figure 3-4. The comparison of ROC curves with different number of training images.

One of the major problems of the conventional SDF is that the performance can

be easily degraded by additive noise in the test image since SDF does not have any

special mechanism to consider input noise. Therefore, it has a poor rejecting ability for

a false class image. Figure 3-5 (Top) shows the noise effect on the conventional SDF.

When the class images are seriously distorted by additive Gaussian noise with a very

low SNR (-2dB), the correlation output peaks of some test images become great than

1, hence wrong recognition happens. The results in Figure 3-5 (Bottom) are obtained

by the kernel SDF. The kernel SDF shows a much better performance even in a very

low SNR environment. The comparison of ROC curves between the kernel SDF and the

conventional SDF in the case of noisy test input with different SNRs is shown in Figure

3-6. We can see that the kernel SDF outperforms the SDF and achieves a robust pattern

recognition performance in a very high noisy environment.

34

10 20 30 40 50 60 700.8

0.85

0.9

0.95

1

1.05


outp

ut

10 20 30 40 50 60 700

0.2

0.4

0.6

0.8

1


outp

ut

Figure 3-5. The output values of noisy test input images with additive Gaussian noisewhen 25 images are used for training (N=25),(Top): SDF, circle-true classwith SNR=10dB, cross-false class with SNR=-2dB, diamond-false class withno noise, (Bottom): Kernel SDF, circle-true class with SNR=10dB, cross-falseclass with SNR=-2dB.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probabolity of False Alarm

Pro

babi

lity

of D

etec

tion

SDF, SNR = −2dBKernel SDF, SNR = −2 dBSDF, SNR = −0.5 dBKernel SDF, SNR = −0.5 dBSDF, SNR = 1.5dBKernel SDF, SNR = 1.5dB

Figure 3-6. The ROC curves of noisy test input images with different SNRs when 10images are used for training (N=10).

35

CHAPTER 4A RKHS PERSPECTIVE OF THE MACE FILTER

4.1 Introduction

This section presents the interpretation of the MACE filter in the RKHS. The original

linear MACE filter was formulated in the frequency domain, however, the MACE filter

can also be understood by the theory of Hilbert space representations of random functions

proposed by Parzen [39]. Parzen analyzed the connection between RKHS and second-order

random (or stochastic) processes by using the isometric isomorphism 1 that exists

between the Hilbert space spanned by the random variables of a stochastic process and the

RKHS determined by its covariance function. Here, we present first the basic theory of the

RKHS, then show the interpretation the MACE filter formulation in the RKHS.

4.2 Reproducing Kernel Hilbert Space (RKHS)

A reproducing kernel Hilbert space (RKHS) is a special Hilbert space associated

with a kernel such that reproduces (via an inner product) each function in the space,

or, equivalently, every point evaluation functional is bounded. Let H be a Hilbert space

of functions on some set E, define an inner product 〈·, ·〉H in H and a complex-valued

1

Consider two Hilbert space H1 and H2 with inner products denoted as 〈f1, f2〉1 and〈g1, g2〉2 respectively, H1 and H2 are said to be isomorphic if there exists a one-to-one andsurjective mapping ψ from H1 to H2 satisfying the following properties

ψ(f1 + f2) = ψ(f1) + ψ(f2) and ψ(αf) = αψ(f) (4–1)

for all functionals in H1 and any real number α. The mapping ψ is called an isomorphismbetween H1 and H2. The Hilbert spaces H1 and H2 are said to be isometric if there exista mapping ψ that preserves inner products,

〈f1, f2〉1 = 〈ψ(f1), ψ(f2)〉2, (4–2)

for all functions in H1. A mapping ψ satisfying both properties (4–1) and (4–2) issaid to be an isometric isomorphism or congruence. The congruence maps bothlinear combinations of functionals and limit points from H1 into corresponding linearcombinations of functionals and limit points in H2.

36

bivariate function κ(x, y) on E×E. Then the function κ(x, y) is said to be positive definite

if for any finite point set {x1, x2, . . . , xn} ∈ E and for any not all zero corresponding

complex number {α1, α2, . . . , αn} ∈ C,

n∑i=1

n∑j=1

αiαjκ(xi, xj) > 0. (4–3)

Any positive definite bivariate function κ(x, y) is a reproducing kernel because of the

following fundamental theorem.

Moore-Aronszajn Theorem: Given any positive definite function κ(x, y), there

exists a uniquely determined (possibly finite dimensional) Hilbert space H consisting of

functions on E such that

(i) for every x ∈ E, κ(x, ·) ∈ H and (4–4)

(ii) for every x ∈ E and f ∈ H, f(x) = 〈f, κ(x, ·)〉H. (4–5)

Then H := H(κ) is said to be a reproducing kernel Hilbert space with reproducing kernel

κ. The properties (i) and (ii) are called the reproducing property of κ(x, y) in H(κ).

Parzen [39] analyzed the connection between RKHSs and orthonormal expansions for

second-order stochastic processes obtaining a general expression for the reproducing kernel

inner product in terms of the eigenvalues and eigenfunctions of a certain operator defined

on an appropriate Hilbert space. In addition, Parzen showed that there exists an isometric

isomorphism between the Hilbert space spanned by the random variables of a stochastic

process and the RKHS determined by its covariance function.

Given a zero mean second-order random vector {xi : i ∈ I} with I being an index set,

the covariance function is defined as

R(i, j) = E [xixj] . (4–6)

It is well known that the covariance function R is non-negative definite, therefore it

determines a unique RKHS, H(R), according to the Moore-Aronszajn Theorem. By the

37

Mercer’s theorem [35],

R(i, j) =∞∑

k=0

λkϕk(i)ϕk(j), (4–7)

where {λk, k = 1, 2, · · · } and {ϕk(i), k = 1, 2, · · · } are a sequence of non-negative

eigenvalues and corresponding normalized eigenfunctions of R(i, j), respectively.

H(R) has two important properties which make it a reproducing kernel Hilbert space.

First, let R(i, ·) be the function on I with value at j in I equal to R(i, j), then by the

Mercer’s theorem, eigen-expansion for the covariance function (4–7), we have

R(i, j) =∞∑

k=0

λkakϕk(j), ak = ϕk(i). (4–8)

Therefore, R(i, ·) ∈ H(R) for each i in I. Second, for every function f(·) ∈ H(R) of form

given by f(i) =∑∞

k=0 λkakϕk(i) and every i in I,

〈f,R(i, ·)〉 =∞∑

k=0

λkakϕk(i) = f(i). (4–9)

By the Moore-Aronszajn Theorem, H(R) is a reproducing kernel Hilbert space with R(i, j)

as the reproducing kernel. It follows that

〈R(i, ·), R(j, ·)〉 =∞∑

k=0

λkϕk(i)ϕk(j) = R(i, i). (4–10)

Thus H(R) is a representation of the random vector{xi : i ∈ I} with covariance function

R(i, j).

One may define a congruence G form H(R) onto linear space L2(xi, i ∈ I) such that

G(R(i, ·)) = xi. (4–11)

The congruence G can be explicitly represented as

G(f) =∞∑

k=0

akξk, (4–12)

38

where the set of ξk is an orthogonal random variables belong to L2(ϕ(i), i ∈ I) and f is

any element in H(R) in the form of f(i) =∑∞

k=0 λkakϕk(i) and every i in I.

Summary Let {xi : i ∈ I} be a continuous random function defined on a closed finite

interval I. Then the following conclusions hold:

• The covariance kernel R possesses the expansion (4–7).

• There exists a Hilbert space L2(ϕ(i), i ∈ I) of sequences which is a representation of

the random function.

• There exists a reproducing kernel Hilbert space H(R) of functions on I, which is a

representation of the random function.

4.3 Interpretation of the MACE filter in the RKHS

The original MACE filter was derived in the frequency domain for simplicity, however,

it can also be derived in the space domain [40] and this helps us understand the RKHS

perspective of the MACE. Let us consider the case of one training image and construct the

following matrix

U =

x(d) 0 · · · 0 0

x(d− 1) x(d) 0 · · · 0

......

......

...

x(1) x(2) · · · · · · x(d)

0 x(1) x(2) · · · x(d− 1)

0 0...

......

0 · · · · · · · · · x(1)

, (4–13)

where the dimension of matrix U is (2d − 1) × d. Here we denote the ith column of the

matrix U as Ui. Then the column space of U is

L2(U) = {d∑

i=1

αiUi | αi ∈ R, i = 1, · · · , d}, (4–14)

39

which is congruent to the RKHS induced by the correlation kernel

R(i, j) =< Ui, Uj >= UTi Uj, i, j = 1, · · · , d, (4–15)

where < ·, · > represents the inner product operation. If all the columns in U are

linearly independent, R(i, j) is positive definite and the dimensionality of L2(U) is d.

If U is singular, the dimensionality is smaller than d. However, in either case, all the

vectors in this space can be expressed as a linear combination of the column vectors. The

optimization problem of the MACE is to find a vector go =d∑

i=1

hiUi in L2(U) space with

coordinates h = [h1h2 · · ·hd]T such that gT

o go is minimized subject to the constraint

that the dth component of go (which is the correlation at zero lag) is some constant.

Formulating the MACE filter from this RKHS viewpoint only provides a new perspective

but no additional advantage. However, as explained next, it will help us derive a nonlinear

extension to the MACE with a new similarity measure.

40

CHAPTER 5NONLINEAR VERSION OF THE MACE IN A NEW RKHS :

THE CORRENTROPY MACE (CMACE) FILTER

5.1 Correntropy Function

5.1.1 Definition

Correlation is one of the fundamental operations of statistics, machine learning and

signal processing because it quantifies similarity. However, correlation only exploits second

order statistics of the random variables or random processes, which limits its optimality to

Gaussian distributed data. Correntropy was introduced in [29] as a generalized measure of

similarity. Its name stresses the connection to correlation, but also indicates the fact that

its mean value across time or dimensions is associated with entropy, more precisely to the

argument of the log in Renyi’s quadratic entropy estimated with Parzen windows, which

is called the information potential. Information potential (IP) is the argument of Renyi’s

quadratic entropy of a random variable X with PDF fX(x) as,

H2(x) = − log

∫f 2

X(x)dx, (5–1)

where , IP (x) =∫

f 2X(x)dx.

A nonparametric estimator of the information potential using Parzen window from N

samples data is

IP (x) =1

N2

N∑i=1

N∑j=1

κσ(xi − xj), (5–2)

where κσ is the Gaussian kernel in (5–4) [41].

This relation to entropy shows that the correntropy contains information beyond

second order moments, and can therefore generalize correlation without requiring moment

expansions.

Definition: Cross correntropy or simply correntropy is a generalized similarity

measure between two arbitrary vector random variables X and Y defined as

V (X, Y ) = E[κσ(X − Y )], (5–3)

41

where E is the mathematical expectation and κσ is the Gaussian kernel given by

κσ(X − Y ) =1√2πσ

exp

{−‖X − Y ‖2

2σ2

}, (5–4)

where σ is the kernel size of bandwidth.

In practice, given finite number of data samples {(xi, yi)}di=1, the cross correntropy is

estimated by

V (X,Y ) =1

d

d∑i=1

κσ(xi − yi). (5–5)

5.1.2 Some Properties

Correntropy has very nice properties that make it useful for machine learning and

nonlinear signal processing. First and foremost, it is a positive function also defining a

RKHS, but unlike the RKHS defined by the covariance function of the random variable

(process) it contains higher order statistical information. This new function quantifies

the average angular separation in the kernel feature space between the dimensions of the

random variable (or between temporal lags of the random process). Therefore, correntropy

can be the metric for similarity measurements in feature space. Several properties of

correntropy and their proofs are presented in [29][42][43]. Here we present, without proofs,

only the properties that are relevant to this dissertation.

Property 1:Correntropy is a similarity measure between X and Y incorporating

higher order moments of the random variable X − Y [29].

Applying the Taylor series expansion for the Gaussian kernel, we can rewrite the

correntropy function in (5–3) as

V (X, Y ) =1√2πσ

∞∑

k=0

(−1)k

(2σ2)kk!E[(X − Y )2k], (5–6)

which contains all the even-order moments of the random variable X − Y . The kernel

size controls the emphasis of the higher order moments with respect to the second, since

the higher order terms of the expansion decay faster for larger σ. As σ increases, the

42

high-order moments decay and the second order moment tends to dominate. In fact, for

kernel size larger than 10 times the one chosen from density estimation considerations (e.g.

Silverman’s rule [44]), correntropy starts to approach correlation. The kernel size has to be

chosen according to the application, but here this issue will not be further addressed and

Silverman’s rule will be used by default.

Property 2: Let {xi, i ∈ T} be a random vector(process) with T being an index

set, the auto-correntropy function of random vector(process) V (i, j) = E[κσ(xi − xj)] is a

symmetric and positive definite function; therefore it defines a new RKHS, called VRKHS

[29].

Since κσ(xi − xj) is symmetrical, it is obvious that V (i, j) is also symmetrical. Also

since κσ(xi − xj) is positive definite, for any set of n point {x1 , · · · , xn} and none zero

real numbers {α1 , · · · , αn}, we have

n∑i=1

n∑j=1

αiαjκσ(xi − xj) > 0. (5–7)

It is true that for any strictly positive function g(·, ·) of two random variables x and y,

E[g(x, y)] > 0. Thus we have E

[n∑

i=1

n∑j=1

αiαjκσ(xi − xj)

]> 0. This equals to

n∑i=1

n∑j=1

αiαjE[κσ(xi − xj)] =n∑

i=1

n∑j=1

αiαjV (i, j) > 0. (5–8)

Thus V (i, j) is both symmetric and positive definite. Now, the Moore-Aronszajn theorem

[35] proves that for every real symmetric positive definite function k, there exists a unique

RKHS with k as its reproducing kernel. Hence V (i, j) is a reproducing kernel.

As shown in property 1, VRKHS contains higher order statistical information, unlike

the RKHS defined by the covariance function of random processes.

Property 3:Assume the samples {(xi, yi)}di=1 are drawn from the joint PDF

fX,Y (x, y) and fX,Y,σ(x, y) is the Parzen estimator with kernel size σ. The correntropy

43

0.05

0.10.15

0.2

0.25

0.25

0.3

0.3

0.350.

35

0.35

0.4

0.4

0.4

0.45

0.45

0.45

0.45

0.5

0.5

0.5

0.5

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 5-1. Contours of CIM(X,0) in 2D sample space (kernel size is set to 1)

estimator with kernel size σ′ =√

2σ is the integral of fX,Y,σ(x, y) along the line x = y [42],

V√2σ(X,Y ) =

∫ +∞

−∞fX,Y,σ(x, y)|x=y=udu. (5–9)

Property 4: Correntropy, as a sample estimator, induces a metric in the sample

space. Given two vectors X = [x1, x2, · · · , xN ]T and Y = [y1, y2, · · · , yN ]T in the sample

space, the function CIM(X, Y ) = (κσ(0) − V (X,Y ))1/2 ,where κσ is Gaussian kernel in

(5–4) with κσ(0) = 1/√

2πσ, defines a metric in the sample space and is named as the

Correntropy Induced Metric (CIM) [42]. Therefore, correntropy can be the metric for

similarity measurement in feature space.

Figure 5-1 shows the contours of distance from X to the origin in a two dimensional

space. The interesting observation from the figure is as follows: when X is close to zero,

44

CIM behaves like an L2 norm 1 , which is clear from the Taylor expansion in (5–6); further

out CIM behaves like an L1 norm; eventually as X departs from the origin, the metric

saturates and becomes insensitive to distance (approaching a L0 norm2 ). larger differences

saturate so the metric is less sensitive to large deviations what makes it more robust.

This property inspired us to investigate the inherent robustness of CIM to outliers. The

kernel size controls this very interesting behavior of the metric across neighborhoods. A

small kernel size leads to a tight linear (Euclidean) region and to a large L0 region, while

a larger kernel size will enlarge the linear region. In this dissertation, we mathematically

prove that 1) when the kernel size goes to infinity, the CIM norm is equivalent to the

L2 norm and 2) when the kernel size goes to zero (from the positive side), the CIM is

equivalent to the L0 norm.

Let us define E = X − Y = [e1, e2, ..., eN ]T , then

CIM(E) = [κσ(0)− 1N

N∑i=1

κσ(ei)]1/2

= { 12√

2πNσ3 [2σ2

N∑i=1

(1− exp(−e2i /2σ

2))]}1/2.

(5–10)

First, let us take a look at the following limit

limσ→∞

2σ2(1− exp(−e2i /2σ

2))

= limt→0

1−exp(−te2i )

t(t =: 1/2σ2)

= limt→0

exp(−te2i )e2

i

1(LHospital)

= e2i .

(5–11)

1 Given vector X, Lp norm of X is defined by ‖ X ‖p= (N∑

i=1

|xi|p)1/p, where p is a real

number with p >= 1.

2 L0 norm of X is defined as limp→0 ‖ X ‖pp, that is the zero norm of X is simply the

number of non-zero elements of X. Despite its name, the zero norm is not a true norm; inparticular, it is not positive homogeneous.

45

Therefore,

limσ→∞

(2√

2πNσ3)1/2CIM(E) = ||E||2. (5–12)

Second, look at the following limit

limσ→0+

(1− exp(−e2i /2σ

2)) =

0 if ei = 0

1 if ei 6= 0(5–13)

Therefore,

limσ→0

√2πNσ[CIM(E)]2 = ||E||0. (5–14)

Property 5: Given data samples {xi}di=1, the correntropy kernel creates another data

set {f(xi)}di=1 preserving the similarity measure as

V (i, j) = E[κσ(xi − xj)] = E[f(xi)f(xj)]. (5–15)

The proof of property 5 is in Appendix B.

According to property 5, there exists a scalar nonlinear mapping f which makes the

correntropy of xi the correlation of f(xi). (5–15) allows the computation of the correlation

in feature space by the correntropy function in the input space [45][46].

5.2 The Correntropy MACE Filter

According to the RKHS perspective of the MACE filter in capter 4, we can extend it

immediately to VRKHS 3 . Applying the correntropy concept to the MACE formulation of

chapter 4, the definition of the correlation in (4–15) shall be substituted by

V (i, j) =1

2d− 1

2d−1∑n=1

κσ(Uin −Ujn) i, j = 1, · · · , d, (5–16)

where, Uin is (i, n)th elements in (4–13). This function is positive definite and thus

induces the VRKHS. According to the Mercer’s theorem [35], there is a basis {ηi, i =

3 In this dissertation, we call the RKHS induced by correntropy VRKHS

46

1, · · · , d} in this VRKHS such that

< ηi, ηj >= V (i, j), i, j = 1, · · · , d. (5–17)

Since it is a d dimensional Hilbert space, it is isomorphic to any d dimensional real vector

space equipped with the standard inner product structure. After an appropriate choice

of this isomorphism {ηi, i = 1, · · · , d}, which is nonlinearly related to the input space,

a nonlinear extension of the MACE filter can be readily constructed on this VRKHS,

namely, finding a vector v0 =d∑

i=1

fhiηi with fh = [fh1 · · · fhd]T as coordinates such

that vT0 v0 is minimized subject to the constraint that the dth component of v0 is some

pre-specified constant.

Let the ith image vector be xi = [xi(1) xi(2) · · · xi(d) ]T and the filter be h =

[h(1) h(2) · · · h(d)]T , where T denotes transpose. From property 5, the CMACE filter can

be formulated in feature space by applying a nonlinear mapping function f onto the data

as well as the filter. We denote the transformed training image matrix and filter vector

whose size are d×N and d× 1, respectively, by

FX = [fx1 , fx2 , · · · , fxN], (5–18)

fh = [f(h(1)) f(h(2)) · · · f(h(d))]T , (5–19)

where fxi= [f(xi(1)) f(xi(2)) · · · f(xi(d))]T for i = 1, · · · , N . Given data samples, the

cross correntropy between the ith training image vector and the filter can be estimated as

voi[m] =1

d

d∑n=1

f(h(n))f(xi(n−m)), (5–20)

for all the lags m = −d+1, · · · , d− 1. Then the cross correntropy vector voi can be formed

including all the lags of voi[m] denoted by

voi = Sifh, (5–21)

where Si is the matrix of size (2d− 1)× 1 as

47

Si =

f(xi(d)) 0 · · · 0 0

f(xi(d− 1)) f(xi(d)) 0 · · · 0

......

......

...

f(xi(1)) f(xi(2)) · · · · · · f(xi(d))

0 f(xi(1)) f(xi(2)) · · · f(xi(d− 1))

0 0...

......

0 0 0 f(xi(1)) f(xi(2))

0 0 · · · 0 f(xi(1))

. (5–22)

Since the scale factor 1/d has no influence on the solution, it will be ignored throughout

the dissertation. The correntropy energy of the ith image is given by

Ei = vToivoi = fT

h STi Sifh, (5–23)

Denoting Vi = STi Si and using the definition of correntropy in (5–15), the d × d

correntropy matrix Vi is

Vi =

vi(0) vi(1) · · · vi(d− 1)

vi(1) vi(0) · · · vi(d− 2)

......

. . ....

vi(d− 1) · · · vi(1) vi(0)

, (5–24)

where, each element of the matrix is computed without explicitly knowledge of the

mapping function f by

vi(l) =d∑

n=1

κσ(xi(n)− xi(n + l)), (5–25)

for l = 0, · · · , d − 1. The average correntropy energy over all the training data can be

written as

Eav =1

N

N∑i=1

Ei = fTh VXfh, (5–26)

48

where,

VX =1

N

N∑i=1

Vi. (5–27)

Since our objective is to minimize the average correntropy energy in feature space, the

optimization problem is formulated as

min fTh VXfh subject to FT

Xfh = c, (5–28)

where, c is the desired vector for all the training images. The constraint in (5–28) means

that we specify the correntropy values between the training input and the filter as the

desired constant. Since the correntropy matrix VX is positive definite, there exists an

analytic solution to the optimization problem using the method of Lagrange multipliers in

the new finite dimensional VRKHS. Then the CMACE filter in feature space becomes

fh = V−1X FX(FT

XV−1X FX)−1c. (5–29)

Unlike the KSDF in (3–11) which has a ∞× 1 dimensionality in the RKHS by the

conventional kernel method, the CMACE filter is defined in the finite dimensional VRKHS

which has the same dimensionality as the input space with the size of d × 1. In general,

the kernel method creates a infinite dimensional feature space, so the solution often needs

the regularization to limit the bound of the solution. Therefore, the KSDF may be needed

additional regularization terms for better performance. In the computational complexity of

the CMACE compared to the KSDF, the additional O(d3) operation and O(d2) storage for

V−1X is needed. It makes us to need fast version of the CMACE as well as a dimensionality

reduction method for practical applications.

49

5.3 Implications of the CMACE Filter in the VRKHS

5.3.1 Implication of Nonlinearity

From (5–17) and (5–22), we can say that the RKHS induced by correntropy (VRKHS)

is a Hilbert space spanned by the basis {ηi}di=1 of size (2d− 1)× 1 as

ηi = [0, · · · , 0, f(x(d)), · · · , f(x(1)), 0, · · · , 0]T , (5–30)

where f(·) is a nonlinear scalar function and f(x(d)) is located in the ith element. It

is obvious that unlike the RKHS induced by correlation, the VRKHS for the CMACE

filter is nonlinearly related to the original input space. This statement can be simply

exemplified by the CIM metric. Suppose two vectors x = [x1, 0, · · · , 0]T and the origin

in the input space y = [0, · · · , 0]T . Then the Euclidean distance in the input space is

given by ED(x,y) = x1 and the distance in VRKHS becomes CIM(x,y) = ( 1N

(κσ(0) −κσ(x1))

1/2,where N is the dimension of the vectors. Now, scaling both vectors by α, we

obtain new vectors x = αx and y = αy. The Euclidean distance between x and y becomes

ED(x, y) = αED(x,y), whereas CIM(x, y) = 1N

(k(0)− k(αx1))1/2 6= αCIM(x,y).

Also, as α goes to infinity, the Euclidean distance in the input space linearly increases

too. However, the CIM distance saturates. This shows that distances in VRKHS are

not linearly related to distances in the input space. This argument can also be observed

directly from the CIM contour in Fig. 5-1.

In addition, since correntropy is different from correlation in the sense that it involves

high-order statistics of input signals, inner products in the RKHS induced by correntropy

are no longer equivalent to statistical inference on Gaussian processes. The transformation

from the input space to VRKHS is nonlinear and the inner product structure of VRKHS

provides the possibility of obtaining closed form optimal nonlinear filter solutions by

utilizing second and high-order statistics.

50

5.3.2 Finite Dimensional Feature Space

Another important difference compared with existing machine learning methods based

on the conventional kernel method, which normally yields an infinite dimensional feature

space is that VRKHS has the same dimension as the input space. In the conventional

MACE, the template h has d degree of freedom and all the image data are in the d

dimensional Euclidean space. As derived above, all the transformed images belong to a

different d dimensional vector space equipped with the inner product structure defined

with the correntropy. The goal of this new algorithm is to find a template fh in this

VRKHS such that the cost function is minimized subject to the constraint. Therefore,

the degrees of freedom of this optimization problem is still d, so regularization, which will

be needed in traditional kernel methods, is not necessary here. Further work needs to be

done regarding this point, but we hypothesize that in our methodology, regularization

is automatically achieved by the kernel through the expected value operator (which

corresponds to a density matching step utilized to evaluate correntropy). The fixed

dimensionality also carries disadvantages because the user has no control of the VRKHS

dimensionality. Therefore, the quality of the nonlinear solution depends solely on the

nonlinear transformation between the input space and VRKHS. The theoretical advantage

of using this feature space is justified by the CIM metric, which is very suitable to

quantify similarity in feature spaces and should improve the robustness to outliers of the

conventional MACE.

5.3.3 The kernel correlation filter vs. The CMACE filter: Prewhitening inFeature Space

One of the kernel method applied to correlation filters is the kernel class-dependent

feature analysis (KCFA)[28]. The KCFA is the kernelized version of the linear MACE filter

using the kernel trick after prewhitening preprocess. The correlation output of the MACE

filter h and an input image vector z can be expressed as

ymace = ZX(XT X)−1c, (5–31)

51

where, Z = D−1/2Z and X = D−1/2X indicate pre-whitened version of Z and X in the

frequency domain. Then (5–31) is equivalent to the linear SDF with prewhitened data and

applying the kernel trick yields the KCFA as follows [27]

yKCF = KZXK−1XXc, (5–32)

where (i, j)th elements of the matrix KXX and KZX are computed by

(KXX)ij =d∑

k=1

κσ(xki − xkj), i, j = 1, 2, · · · , N, (5–33)

(KZX)ij =d∑

k=1

κσ(zki − xkj), i = 1, 2, · · · , L, j = 1, 2, · · · , N, (5–34)

where N is the number of training images and L is the the number of test input images.

In the CMACE, we denote FX = V−1/2X FX , and we can decompose fh as

fh = V−1/2X V

−1/2X FX(FT

XV−1/2X V

−1/2X FX)−1c

= V−1/2X FX(FT

XFX)−1c (5–35)

The main difference between the CMACE and KCFA is the prewhitening process. In the

KCFA, prewhitening is conducted in the input space using D, on the other hand, in the

CMACE, (5–35) implies that the image is implicitly whitened in the feature space by the

correntropy matrix VX . In the space domain MACE filter, the autocorrelation matrix

can be used as a preprocessor for prewhitening. Since the CMACE filter uses the same

formulation in the feature space, we can also expect that the correntropy matrix can be

used for prewhitening. However, in practice, we cannot obtain whitened data explicitly

since the mapping function is not explicitly known. In addition, the solution in the KCFA

is defined in a infinite feature space like the KSDF therefore additional regularization term

may be neede for a better performance.

52

CHAPTER 6THE CORRENTROPY MACE IMPLEMENTATION

6.1 The Output of the CMACE Filter

Since the nonlinear mapping function f is not explicitly known, it is impossible to

directly use the CMACE filter fh in the feature space. However, the correntropy output

can be obtained by the inner product between the transformed input image and the

CMACE filter in the VRKHS. In order to test this filter, let Z be the matrix of L vector

testing images and FZ be the transformed matrix of Z, then the L × 1 output vector is

given by

y = FTZV−1

X FX(FTXV−1

X FX)−1c. (6–1)

Here, we denote TZX = FTZV−1

X FX and TXX = (FTXV−1

X FX)−1. Then the output becomes

y = TZX(TXX)−1c, (6–2)

where TXX is N × N symmetric matrix and TZX is L × N matrix whose (i, j)th element

is expressed by

(TXX)ij =d∑

l=1

d∑

k=1

wlkf(xi(k))f(xj(l))

∼=d∑

l=1

d∑

k=1

wlkκσ(xi(k)− xj(l)), i, j = 1, · · · , N, (6–3)

(TZX)ij =d∑

l=1

d∑

k=1

wlkf(zi(k))f(xj(l))

∼=d∑

l=1

d∑

k=1

wlkκσ(zi(k)− xj(l)), i = 1, · · · , L, j = 1, · · · , N, (6–4)

where wlk is the (l, k)th element of V−1X .

The final output expressions in (6–3) and (6–4) are obtained by approximating

f(xi(k))f(xj(l)) and f(zi(k))f(xj(l)) by κσ(xi(k) − xj(l)) and κσ(zi(k) − xj(l)),

53

respectively, which is similar to the kernel trick and holds on average because of property

5. Unfortunately (6–3) and (6–4) involve weighted versions of the functionals therefore the

error in the approximation requires further theoretical investigation.

The CMACE is formulated in the linear VRKHS but has a nonlinear behavior since

the VRKHS is nonlinearly related to the input space. However, the CMACE preserves

the shift-invariant property of the linear MACE. The proof of the shift-invariant property

is given in Appendix C. Although the output of the CMACE gives us only one value, it

is possible to construct the whole output plane by shifting the test input image and as a

result, the shift invariance property of the correlation filters can be utilized at the expense

of more computation. Applying an appropriate threshold to the output of (6–1), one can

detect and recognize the testing data without generating the composite filter in feature

space. As will be shown in the simulation results section, even with this approximation,

the CMACE outperforms the conventional MACE.

6.2 Centering of the CMACE in Feature Space

With the Gaussian kernel, the correntropy value is always positive, which brings the

need to subtract the mean of the transformed data in feature space in order to suppress

the effect of the output DC bias. This centering of the correntropy should not be confused

with the spatial centering of the input images.

Given d data samples {x(i)}di=1, let us denote the mean of the transformed data in

feature space as E[f(x(i))] = mf , then the centered correntropy, that can be properly

called the generalized covariance function, is give by

Vc(i, j) = E[{f(x(i))−mf}{f(x(j))−mf}]

= E[f(x(i))f(x(j))]−m2f

= V (i, j)−m2f . (6–5)

54

The square of the mean of the transformed data f(·) coincides with the estimate of the

information potential of the original data, that is,

m2f =

1

d2

d∑i=1

d∑j=1

κσ(x(i)− x(j)). (6–6)

In order to show the validity of (6–6), let us consider the sample estimation of correntropy

(and ignoring the scalar factor 1/d) then we have

d∑i=1

f((x(i))f(x(i + t)) =d∑

i=1

κσ(x(i)− x(i + t)). (6–7)

We arrange the double summation (6–6) as an array and sum along the diagonal direction

which yields exactly the autocorrelation function of the transformed data at different lags,

thus the correntropy function of the input data at different lags can be written

1

d2

d∑i=1

d∑j=1

f(x(i))f(x(j)) =1

d2{

d−1∑t=0

d−t∑i=1

f(x(i))f(x(i + t)) +d−1∑t=1

d∑i=1+t

f(x(i))f(x(i− t))}

≈ 1

d2(

d−1∑t=0

d−t∑i=1

κσ(x(i)− x(i + t)) +d−1∑t=1

d∑i=1+t

κσ(x(i)− x(i− t)))

=1

d2

d∑i=1

d∑j=1

κσ(x(i)− x(j)). (6–8)

As we see in (6–8), when the summation is far from the main diagonal, smaller and

smaller data sizes are involved which leads to poor approximation. Notice that this is

exactly the same problem when the auto correlation function is estimated from windowed

data. However, when d is large, the approximation improves. Therefore, in the CMACE

output equation (6–1), we can use the centered correntropy matrix VXC by subtracting

the information potential from the correntropy matrix VX as

VXC = VX −m2favg · 1d×d, (6–9)

where, m2favg is the average estimated information potential over N training images and

and 1d×d is a d × d matrix with all the entries equal to 1. Using the centered correntropy

55

matrix VXC , a better rejection ability for out of class images is achieved since the offset of

the output can be removed except for the center value in case of training images.

6.3 The Fast CMACE Filter

In practice, the drawback of the proposed CMACE filter is its computation

complexity. In the MACE, the correlation output can be obtained by multiplication in

the frequency domain and the computation time can be drastically reduced by the FFT.

However in the CMACE, the output of the CMACE filter is obtained by computing

the product of two matrices in (6–3) and (6–4), which depends on the image size and

the number of training images. Each element involves a double summation of weighted

kernel functions. Therefore, each elements of the matrix requires O(d2) computations,

where d is the number of image pixels. When the number of training images is N , the

total computation complexity for one test output is O(Nd2 + N2). A similar argument

shows that the computation needed for training is O(d2(N2 + 1) + N2). On the other

hand, the MACE only requires O(4(d(2N2 + N + 2) + N2) + Nd log2(d)) for training

and O(4d + d log2(d)) for testing one input image. Table 6-1 shows the computational

complexity of the MACE and CMACE. More details about the required computation costs

are given in Appendix D. Constructing the whole output plane would significantly increase

the computational complexity of the CMACE. This quickly becomes too demanding

in practical settings. Therefore a method to simplify the computation is necessary for

practical implementations.

Here the Fast Gauss Transform (FGT) [47] is proposed to reduce the computation

time with a very small approximation error. The FGT is one of a class of very interesting

and important families of fast evaluation algorithms that have been developed over the

past decades to enable rapid calculation of weighted sums of Gaussian functions with

arbitrary accuracy. In nonparametric probability density estimation with Gaussian kernel,

the FGT can reduce the complexity of O(dM) to O(d + M) for M evaluations with d

sources.

56

6.3.1 The Fast Gauss Transform

In many problems in mathematics and engineering, the function of interest can be

decomposed into sums of pairwise interactions among a set of sources. In particular, this

type of problem is found in nonparametric probability density estimation as

G(z) =d∑

j=1

qjκσ(z − x(j)), (6–10)

where k is a kernel function centered at the source points x(j) and qj are scalar weighting

coefficients. With the Gaussian kernel, (6–10) can be interpreted as a ”Gaussian” potential

filed due to sources of strengths qj at the points x(j), evaluated at the target point z.

Suppose that we have M evaluation target points, then the computation of (6–10) requires

O(dM) calculations, which constrains the computation bandwidth for large data sets

d and M in real world applications. The Fast Gauss Transform (FGT) can reduce the

complexity to O(d + M) for (6–10). The FGT is one of a class of very interesting and

important families of fast evaluation algorithms that have been developed over the past

decades to enable rapid calculation of approximations at arbitrary accuracy. The basic

idea is to cluster the sources and target points using appropriate data structures and

the Hermite expansion, and then reduce the number of summations with a given level of

precision.

6.3.2 The Fast Correntropy MACE Filter

The major part of the computation burden in the correntropy MACE filter is given by

T =d∑

i=1

d∑j=1

wije−(z(i)−x(j))2/2σ2

. (6–11)

This is very similar to the density estimation problem that evaluates at d targets z(i) with

given d source samples x(j). However, the weighting factor wij in (6–11) are dependent

on both target and source, which is different from the original FGT applications, where

the weight vector is always the same at every evaluation target points. In our case, the

weight vector wi = [wi1, · · · , wid]T is varying on every evaluation point z(i). We can say

57

that (6–11) is a more general expression than the original FGT formulation and it can be

written as

T =d∑

i=1

Gi(z), (6–12)

where

Gi(z) =d∑

j=1

wijκσ(z(i)− x(j)). (6–13)

This means that clustering and the Hermite expansion should be performed at every

target z(i) with a different weight vector wi, which causes an extra computation for

clustering. However, since the sources are clustered in the FGT, if one expresses the

clustered sources about its center into the Hermite expansion, then there is no need to do

clustering and the Hermite expansion at every evaluation. The only thing that is necessary

is to use different weight vectors at every evaluation point. This process does not require

additional complexity compared to the original FGT formulation except that more storage

is required to keep the weight vectors. By using the Hermite expansion around the target

s, the Gaussian centered at x(j) evaluated at z(i) can be obtained by

exp

{−(z(i)− x(j))2

2σ2

}=

p−1∑n=0

1

n!

(x(j)− s√

2σ

)n

hn

(z(i)− s√

2σ

)+ ε(p), (6–14)

where the Hermite function hn(x) is defined by

hn(x) = (−1)n dn

dxn(exp(−x2)). (6–15)

Also, in this research, we use a simple greedy algorithm for clustering [48], which computes

a data partition with a maximum radius at most twice the optimum. This clustering

method and the Hermite expansion with order p requires O(pd). In the case of (6–3) and

(6–4), since the number of sources and targets are the same, they can be interchanged,

that is, the test image can be the source so that the clustering and Hermite expansion can

58

Table 6-1. Estimated computational complexity for training with N images and testingwith one image. Matrix inversion and multiplication are considered(In this simulation, d=4096, N=60, p=4,kc=4)

Training (Off line) Testing (On line)

MACE O(4(d(2N2 + N + 2) + N2) + Nd log2(d)) O(4d + d log2(d))CMACE O(d2(N2 + 1) + N2) O(d2N + N2)Fast CMACE O(N2pd(kc + 1) + d2 + N2) O(pd(kc + 1)N + N2)

be done only one time per test. Thus T in (6–11) can be approximated by

T ≈d∑

i=1

∑B

p−1∑n=0

1

n!hn

(x(i)− sB√

2σ

)Cn(B), (6–16)

where B represents a cluster with a center sB and Cn(B) is given by

Cn(B) =∑

z(j),wij∈B

wTij

(z(j)− sB√

2σ

)n

. (6–17)

From (6–16), we can see that evaluation at kc expansions at all the evaluation points

costs O(pkcd), so the total number of operations is O(pd(kc + 1)) per computation of

each element in (6–3) and (6–4). The final aim is to obtain the output of the CMACE

filter with N training images and L test images. In order to compute the output of one

test image, the original direct method requires O(d2N(N + 1)) operations to obtain TXX

and TZX , and we can reduce the operation count reduces to O(pd(kc + 1)N(N + 1)) by

applying this enhanced FGT. Typically p and kc are around 4 while d and N are 4,096

and around 100 respectively in our application, which results in a computational savings

of roughly 100 times. Additionally, clustering with the test image is performed only once

per test which reduces the computation time even more. However, from the Table 6-1,

we see that the computational complexity of the CMACE for the testing still depends on

the number of training images, resulting in more computations than the MACE. More

work is necessary to reduce even further the computation time of the CMACE and its

memory storage requirements, but the proposed approach enables practical applications

with present day computers.

59

CHAPTER 7APPLICATIONS OF THE CMACE TO

IMAGE RECOGNITION

7.1 Face Recognition


In this section, we show the performance of the proposed correntropy MACE filter for

face image recognition. In the simulations, we used the same facial expression database

used in the chapter 3. We used only 5 images to composite template (filter) per person

( the MACE filter shows a reasonable recognition result with a small number of training

image in this database [5]). we picked and report the results of the two most difficult

cases who produced the worst performance with the conventional MACE method. We

test with all the images of each person’s data set resulting in 75 outputs for each class.

The simulation results have been obtained by averaging (Monte-Carlo approach) over 100

different training sets (each training set consists of randomly chosen 5 images) to minimize

the problem of performance differences due to splitting the relatively small database

in training and testing sets. The kernel size, σ, is chosen to be 10 for the correntropy

matrix during training and 30 for test output. In this data set, it has been observed that

the kernel size around 30%-50% of the standard deviation of the input data would be

appropriate. Moreover, we can control the performance by choosing a different kernel size

during training for prewhitening.

7.1.2 Simulation Results

Figure 7-1 shows the average test output peak values for image recognition. The

desired output peak value should be close to one when the test image belongs to the

training image class (true class) and otherwise it should be close to zero. Figure 7-1 (Top)

shows that the correlation output peak values of the conventional MACE in false classes

is close to zero and it means that the MACE has a good rejecting ability of false class.

However, some outputs in the test image set, even in the true class, are not recognized as

the true class. Figure 7-1 (Bottom) shows the output values of the proposed correntropy

60

10 20 30 40 50 60 700

0.2

0.4

0.6

0.8

1

Index of test image

outp

ut

10 20 30 40 50 60 700

0.2

0.4

0.6

0.8

1

Index of test image

outp

ut



Figure 7-1. The averaged test output peak values (100 Monte-Carlo simulations withN=5), (Top): MACE, (Bottom): CMACE.

10 20 30 40 50 60 70

0

0.5

1

index of test image

outp

ut

10 20 30 40 50 60 70

0

0.5

1

index of test image

outp

ut

Figure 7-2. The test output peak values with additive Gaussian noise (N=5),(Top):MACE, circle-true class with SNR=10dB, cross-false class with SNR=2dB,(Bottom): CMACE, circle-true class with SNR=10dB, cross-false class withSNR=2dB.

61

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of False Alarm

Pro

babi

lity

of D

etec

tion

MACE (No noise)MACE (SNR:2dB)MACE (SNR:0dB)Proposed method(No noise, SNR:2dB, 0dB)

Figure 7-3. The comparison of ROC curves with different SNRs.

MACE and we can see that the generalization and rejecting performance are improved.

As a result, the two images can be recognized well even with a small number of training

images. One of problems of the conventional MACE is that the performance can be easily

degraded by additive noise in the test image since the MACE does not have any special

mechanism to consider input noise. Therefore, it has a poor rejecting ability for a false

class image when noise is added into a false class. Figure 7-2 (Top) shows the noise effect

on the conventional MACE. When the class images are seriously distorted by additive

Gaussian noise ( SNR =2dB), the correlation output peaks of some test images from false

class become great than that of the true class, hence wrong recognition happens. The

results in Figure 7-2 (Bottom) are obtained by the proposed method. The correntropy

MACE shows a much better performance especially for rejecting even in a very low SNR

environment. Figure 7-3 shows the comparison of ROC curves with different SNRs. In

the conventional MACE, we can see that the false alarm rate is increased as additive

noise power is increased. However, in the proposed method, the probability of detection

62

Table 7-1. Comparison of standard deviations of all the Monte-Carlo simulation outputs(100×75 outputs)

True False True False(No noise) (No noise) (SNR:0dB) (SNR:0dB)

MACE 0.0498 0.0086 0.0527 0.0245CMACE 0.0488 0.0051 0.0485 0.0038

with zero false alarm rate is 1. The correntropy MACE shows much better recognition

performance than the conventional MACE.

One of advantage of the proposed method is that it is more robust than the

conventional MACE. That is, the variation of the test output peak value due to a different

training set is smaller than that of the MACE. Figure 7-4 shows standard deviations

of 100 Monte-Carlo outputs per test input when the test input are noisy false class

images. Table 1 shows the comparison of the standard deviation of 750 outputs (100

Monte-Carlo outputs for 75 inputs) for each class. From the 7-1, we can see that the

variations of the correntropy MACE outputs due to different training set is much less than

those of the conventional MACE and it tells us that our proposed nonlinear version of

the MACE outperforms the conventional MACE and achieves a robust performance for

distortion-tolerant pattern recognition.

Table 7-2 shows the area under the ROC for different kernel sizes in the case of

no additive noise. In this simulation, the kernel sizes lie in the range between 0.1 to 15

provide the perfect ROC performance. The kernel size obtained by Silverman’s rule of

thumb [44], which is given by σi = 1.06σid−1/5, where σi is the standard deviation of the

ith training data and d is the number of samples, is 9.63 and it also results in the best

performance. As expected from the property of correntropy, it is noticed that correntropy

approaches correlation with large kernel size (ROC area of the MACE is about 0.96).

63

10 20 30 40 50 60 700

0.05

0.1

0.15

0.2

0.25

0.3

0.35

index of test image

stan

dard

dev

iatio

n

MACEProposed method

Figure 7-4. The comparison of standard deviation of 100 Monte-Carlo simulation outputsof each noisy false class test images.

Table 7-2. Comparison of ROC areas with different kernel sizes

Kernel size ROC area Kernel size ROC area

0.1 1 20 0.99010.5 1 50 0.98041.0 1 100 0.98109.6 1 200 0.9820

10.0 1 500 0.979615.0 1 1000 0.9518

7.2 Synthetic Aperture Radar (SAR) Image Recognition


In this section, we show the performance of the proposed correntropy based nonlinear

MACE filter for the SAR image recognition problem in the MSTAR/IU public release

data set[49]. The MSTAR (Moving and Stationary Target Acquisition and Recognition)

data is a standard dataset in the SAR ATR community, allowing researchers to test and

compare their ATR algorithms. The database consists of X-band SAR images with 1 foot

by 1 foot resolution at 15, 17, 30 and 45 degree depression angles. The data was collected

by Sandia National Laboratory (SNL) using the STARLOS sensor. The original dataset

64

consists of different military vehicles, where the poses (aspect angles) of the vehicles

lie between 0 and 359 degree, and the target image sizes are 128×128 pixels or more.

Since the MACE and the CMACE have a constraint at the origin of the output plane,

we centered all images and cropped the centered images to the size of 64×64 pixels. (in

practice, with uncentered images, one needs to compute the whole output plane and search

for the peak). The selected area contains the target, its shadow and background clutter.

In this simulation we use the images which lie in the aspect angles of 0 to 179 degree. The

original SAR image is composed of magnitude and phase information, but here only the

magnitude data is used in this simulation.

This dissertation compares the recognition performance of the proposed CMACE filter

against the conventional MACE considering two distortion factors. The first distortion

case is due to a different aspect angle between training and testing, and the second case

is a different depression angle between test and training data. In the simulations, the

performance is measured by observing the test output peak value and creating the ROC

(Receiver Operating Characteristic) curve. The kernel size, σ, is chosen to be 0.1 for the

estimation of correntropy in the training images and 0.5 for test output in (6–3) and (6–4).

The value of 0.1 for the kernel size corresponds to the standard deviation of the training

data which is consistent with the Silverman’s rule. Experimentally it was verified that a

larger kernel size for testing provided better results.

7.2.2 Aspect Angle Distortion Case

In the first simulation, we selected the BTR60 (Armored personal carrier) as a target

(true class) and the T62 (Tank) as a confuser (false class). Both of them are taken at 17

degree depression angles. The goal is to design a filter which will recognize the BTR60

with minimal confusion from the T62. Figure 7-5 (a) shows the training images, which are

used to compose the MACE and the CMACE filters. In order to evaluation the effect of

the aspect angle distortion, training images were selected at every 3 index numbers from a

total of 120 exemplar images for each vehicle (most of index numbers have a 2◦ difference

65

(a) Training images (BTR 60) of aspect angle 0, 35, 124, 159 degrees

(b) Test images from BTR 60 of aspect angle 3, 53, 104, 137 degrees

(c) Test images from confuser (T62) of aspect angle 2, 41, 103, 137degrees

Figure 7-5. Case A: Sample SAR images (64x64 pixels) of two vehicle types for a targetchip (BTR60) and a confuser (T62).

and some have a 1◦ difference in aspect angle). That means that the total number of

training images used to construct a filter is 40 (N=40). Figure 7-5 (b) shows test images

for the recognition class and (c) represents confusion vehicle images. Testing is conducted

with all of 120 exemplar images for each vehicle. We are interested only in the center of

the output plane, since the images are already centered. The peak output responses over

all exemplars in the test set are shown in Figure 7-6. In the simulation, the constraint

value for the MACE as well as the CMACE filter is one for the training, therefore the

desired output peak value should be close to one when the test image belongs to the target

class and should be close to zero otherwise. Figure 7-6 (Top) shows the correlation output

peak value of the MACE and Figure 7-6 (Bottom) shows the output peak values of the

CMACE filter for both a target and a confuser.

Figure 7-6 illustrates that results are perfect for both the MACE and the CMACE

within the training images. However, in the MACE filter, most of the peak output values

on test images are less than 0.5. This shows that the MACE output generalizes poorly

66

20 40 60 80 100 120

0

0.5

1

Pea

k ou

tput

0 20 40 60 80 100 120−0.5

0

0.5

1

1.5

Pea

k ou

put

Index of Test images (aspect angle 0~179 degrees)

Index of Test images (aspect angle 0~179 degrees)

Figure 7-6. Case A: Peak output responses of testing images for a target chip (circle) anda confuser (cross) : (Top) MACE, (Bottom) CMACE.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of D

etec

tion

MACE (N=40)MACE (N=60)Correntropy MACE (N=40)Correntropy MACE (N=60)KCFA (N=40)KCFA (N=60)

Figure 7-7. Case A: ROC curves with different numbers of training images.

67

for the images of the same class not used in training, which is one of known drawbacks of

the conventional MACE. For the confuser test images, most of the output values are near

zero but some are higher than those of target images, creating false alarms. On the other

hand for the CMACE, most of the peak output values of test images are above 0.5, which

means that CMACE generalizes better than the MACE. Also, the rejecting performance

for a confuser is better than the MACE. As a result, recognition performance between

two vehicles is improved by the CMACE, as best quantified in the ROC curves of Figure

7-7. From the ROC curves we can see that the detecting ability of the proposed method

is much better than both the MACE and the KCFA. For the KCFA, prewhitened images

are obtained by multiplying D−0.5 in the frequency domain and applied the kernel trick

to the prewhitened images to compute the output in (5–31). Gaussian kernel with kernel

size of 5 is used for the KCF. From the ROC curves in Figure 7-7, we can also see that the

CMACE outperforms the nonlinear kernel correlation filter in particular for high detection

probability.

Figure 7-8 (a) shows the MACE filter output plane and (b) shows the CMACE filter

output plane, for a test image in the target class not present in the training set. Figure

7-8 (c) and (d) show the case of a confuser (false class) test input. In Figure 7-8 (a)

and (b), we can see that both the MACE and the CMACE produce a sharp peak in the

output plane. However, the peak value at the origin of the CMACE is higher (closer to

the desired value) than that of the MACE. Moreover, the CMACE has less sidelobes and

the values of sidelobes around the origin are lower than those of the MACE. These points

tell us that the detection ability of the propose method is better that the MACE. On

the other hand, for the confuser test input in Figure 7-8. (c) and (d), the output values

around the origin of the CMACE have lower values than the MACE, which means that

the CMACE has better rejection ability than the MACE.

In order to demonstrate the shift-invariant property of the CMACE, we apply the

images of Figure 7-9. The test image was cropped for the object to be shifted 13 pixels

68

020

4060

80

0

20

40

60

80−0.5

0

0.5

1

corr

elat

ion

outp

ut

0.74

0.48

(a) True class in the MACE

020

4060

80

0

20

40

60

80−0.5

0

0.5

1

corr

entr

opy

outp

ut

0.98

0.44

(b) True class in the CMACE

020

4060

80

0

20

40

60

80−0.5

0

0.5

1

corr

elat

ion

outp

ut

peak :0.87

center:0.22

(c) False class in the MACE

020

4060

80

0

20

40

60

80−0.5

0

0.5

1

corr

entr

opy

outp

ut

peak : 0.62

center : 0.006

(d) False class in the CMACE

Figure 7-8. Case A: The MACE output plane vs. the CMACE output plane

in both x and y pixel positions. Figure 7-10 shows the output planes of the MACE and

CMACE when the shifted image is used as the test input while all the training images

are centered. In Figure 7-10, the maximum peak value should happen at the position

of (77,77) in the output plane since the object is shifted by 13 pixels in both x and y

directions. In the CMACE output plane, the maximum peak happens at (77,77) and the

value is 0.9585. However, in the MACE, the maximum peak happens at (74,93) with

0.9442 and the value at the position of (77,77) is 0.93. In this test, the CMACE shows

better shift invariance property than the MACE.

69

(a) original (x,y) (b) shifted to (x−13,y−13)

Figure 7-9. Sample images of BTR60 of size (64× 64) pixels (a) The cropped image of size(64× 64) pixels at the center of the original of size (128x128) pixels. (b) Thecropped image of size (64× 64) pixels with (x− 13, y − 13) of the original ofsize (128× 128) pixels.

0

50

100

0

50

100

−0.5

0

0.5

1

X: 74Y: 93Z: 0.9442

(a) The MACE output plane

0

50

100

0

50

100

−0.5

0

0.5

1

X: 77Y: 77Z: 0.9585

(b) The CMACE output plane

Figure 7-10. Case A: Output planes with shifted true class input image

70

The CMACE performance sensitivity to the kernel size is studied next. In order to

find the appropriate kernel size for the CMACE, the easiest step is to apply Silverman’s

rule of thumb developed for the kernel density estimation problem, which is given by

σi = 1.06σid−1/5, where σi is the standard deviation of the ith training data and d is the

number of samples [44]. A more principled alternative is to apply cross validation to find

the best kernel size. For cross validation, we use one image of training set which is not

included in filter design. Since we are considering images as 1-dimensional vectors, we have

N different training data set. Therefore, we obtain one proper kernel size, σ, by averaging

N different kernel sizes with σ = 1N

∑Ni=1 σi. In this simulation, when N = 60, the value

of the kernel size given by Silverman’s rule is 0.0185 and the best one from cross validation

is 0.1. Figure 7-11 shows the ROC curves for the kernel size obtained by Silverman’s rule

and the one obtained by cross validation. We see that the ROC performance from the

Silverman’ rule is very close to that of the optimal kernel size by cross validation. Also

when we increase the kernel size to be 10, its performance is similar to that of the MACE.

As expected from the property of correntropy, it is noticed that correntropy approaches

correlation with large kernel size.

Table 7-5 shows the area under the ROC for different kernel sizes, and we conclude

that kernel sizes between 0.01 to 1 provide little change in detectability. This may be

surprising when contrasted with the problem of finding the optimal kernel size in density

estimation, but in correntropy the kernel size enters in the argument of an expected value

and plays a different role in the final solution, namely it controls the balance between the

effect of second order moments versus the higher order moments (see property 1).

7.2.3 Depression Angle Distortion Case

In the second simulation, we selected the vehicle 2s1 (Rocket Launcher) as a

target and the T62 as a confuser. These two kinds of images look very similar in shape

therefore they represent a difficult object recognition case, useful to test the performance

improvement of the proposed method. In order to show the effect of the depression angle

71

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of D

etec

tion

CMACE(σ=0.0185, Silverman’s rule)

CMACE (σ=0.1, the best from cross validation)MACE CMACE(σ=10)

Figure 7-11. The ROC comparison with different kernel sizes

Table 7-3. Case A: Comparison of ROC areas with different kernel sizes

Kernel size ROC area Kernel size ROC area

0.01 0.9623 0.6 0.98060.0185 0.9686 0.7 0.97710.05 0.9631 0.8 0.97540.1 0.9847 0.9 0.97490.2 0.9865 1.0 0.96020.3 0.9797 2.0 0.93970.4 0.9797 5.0 0.92560.5 0.9808 10.0 0.9033

distortion, training data are selected from target images which were collected at 30 degree

depression angle and the MACE and CMACE are tested with data taken at 17 degree

depression angle.

Figure 7-12 depicts some sample images. As we can see in Figure 7-12 (a) and (b),

due to the big change in depression angle (13 degree of depression is considered a huge

distortion), test images have more shadows and the image size of the vehicles also change,

making detection more difficult. In this simulation, we use all the images (120 images

72

(a) Training images (2S1) of aspect 0,35,124,159 degrees

(b) Test images (2S1) of aspect 3,53,104,137 degrees

(c) Test images from confuser (T62) of aspect 2,41,103,137degrees

Figure 7-12. Case B: Sample SAR images (64x64 pixels) of two vehicle types for a targetchip (2S1) and a confuser (T62).

covering 180 degrees of pose) at 30 degree depression angle for training and also test with

all of 120 exemplar images at 17 degree depression angle.

Figure 7-13 (Top) shows the correlation output peak value of the MACE and

(Bottom) shows the output peak values of the CMACE filter with a target and a confuser

test data. We see that the conventional MACE is very poor in this case, either under or

overshooting the peak value of 1 for the target class, but the CMACE can improve the

recognition performance because of its better generalization. Figure 7-14 depicts the ROC

curve and summarizes the CMACE advantage over the MACE in this large depression

angle distortion case. More interestingly, the KCFA performance is closer to the linear

MACE, due to the same input space whitening which is unable to cope with the large

distortion.

73

20 40 60 80 100 120

0

0.5

1

Index of Test Images (aspect angle 0~179 degree)

Pea

k ou

tput

20 40 60 80 100 120

0

0.5

1

Index of Test Images (aspect angle 0~179 degree)

Pea

k ou

tput

Figure 7-13. Case B: Peak output responses of testing images for a target chip (circle) anda confuser (cross) : (Top) MACE, (Bottom) CMACE.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of D

etec

tion MACE

Correntropy MACEKCF

Figure 7-14. Case B: ROC curves.

74

Table 7-4. Comparison of computation time and error for one test image between thedirect method (CMACE) and the FGT method (Fast CMACE) with p = 4 andkc = 4

Direct (sec) FGT (sec) Error

Train : KXX 7622.8 68.31 9.9668e-06KZX 122.8 1.15 8.7575e-06Test(true)output 2.8225e-03KZX 128.6 1.18 3.8844e-05Test(false)output 8.4377e-03

7.2.4 The Fast Correntropy MACE Results

This section shows both the computation speed improvement and the effect on

accuracy of the fast CMACE filter in the aspect angle distortion case with N = 60

training images. Computation time was clocked with MATLAB version 7.0 on a 2.8GHz

Pentium 4 processor with 2GByte of RAM running Windows XP.

Table 2 shows the comparison of computation time for (6–3) and (6–4) between

the direct implementation of the CMACE filter and the fast method with a Hermite

approximation order of p = 4 and kc = 4 clusters. The computation time and absolute

errors for one test image were obtained by averaging 120 test images. This simulation

shows that the FGT method is about 100 times faster than the direct method with a

reasonable error precision. Figure 7-15 presents the comparison in terms of ROC curves

of the MACE, the CMACE and the fast CMACE. Form the ROC curve we can observe

that the approximation with p = 4 and kc = 4 is very close to the original ROC. Table

3 shows the effect of different orders (p) and clusters (kc) on the computation time and

accuracy for the fast CMACE filter. We conclude that the computation time increases

roughly proportional to p and kc, while the absolute error linearly decreases.

7.2.5 The effect of additive noise

This section presents the effect of additive noise on the recognition performance of

both the MACE and CMACE. For this simulation, we design the template with training

data which are selected at every 3 index numbers from a total of 120 exemplar images

for the BTR60 without noise and test with all 120 images distorted by additive noise for

75

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fast Correntropy MACEMACECorrentropy MACE

Figure 7-15. Comparison of ROC curves between the direct and the FGT method in caseA

Table 7-5. Comparison of computation time and error for one test image in the FGTmethod with a different number of orders and clusters

Order Time (sec) Error Cluster Time (sec) Error

2 0.8116 1.48e-02 2 0.7181 5.61e-026 1.5140 8.23e-04 6 1.6693 3.87e-04

10 2.2119 8.58e-06 10 2.5595 4.71e-0514 2.8533 4.16e-07 14 3.5660 6.93e-0620 3.8097 1.25e-09 20 5.3067 1.14e-06

each vehicle. Figure 7-16 shows sample images of the original and noisy image with signal

to noise ratio (SNR) of 7dB. Also in this simulation, we compare the CMACE with the

optimal trade-off filter (OTSDF), which is a well know correlation filter to overcome the

poor generalization of the MACE when noise input is presented. The OTSDF filter is

given by

H = T−1X(XHT−1X)−1c, (7–1)

where T = αD +√

1− α2C, and 0 ≤ α ≤ 1, and D is the diagonal matrix in the

MACE and C is the diagonal matrix containing the input noise power spectral density as

its diagonal entries. Figure 7-17 shows the comparison of ROC curves of the MACE,

76

(a) Original (b) Noisy with SNR=7dB

Figure 7-16. Sample SAR images (64x64 pixels) of BTR60.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MACE (SNR=7dB)OTSDF(SNR=7dB)CMACE(SNR=7dB)CMACE (No noise)MACE (No noise)

Figure 7-17. ROC comparisons with noisy test images (SNR=7dB) in the case A(N = 40).

77

OTSDF and CMACE when white Gaussian noise with signal to noise ratio (SNR) of

7dB are presented to all test images and we see that the MACE performance is degraded

due to the additive input noise and the OTSDF with α = 0.7 shows almost the same

performance as the MACE without noise. However, the performance of the CMACE with

noisy test data is almost the same as no noise case. Although, the CMACE does not take

explicitly the noise into consideration like the OTSDF, the CMACE is robust to the input

noise. In practice, the additive noise information in unknown, therefore, the OTSDF is

impractical.

78

CHAPTER 8DIMENSIONALITY REDUCTION WITH RANDOM PROJECTIONS

8.1 Introduction

In many pattern recognition and image processing applications, the high dimensionality

of the observed data make many efficient algorithms of statistical approaches impractical.

Therefore, a variety of data compression and dimensionality reduction methods have been

proposed to overcome the curse of dimensionality [50]. Dimensionality reduction provides

compression and coding necessary to avoid excessive memory usage and computation.

Principal Component Analysis (PCA) is the most widely known way of reducing

dimension and it is optimal in the mean square error sense. PCA determines the basis

vectors by finding the directions of maximum variance in the data and it minimizes

the error between the original data and the one reconstructed from its low dimensional

representation. PCA has been very popular in face recognition [51] and many pattern

recognition applications [52]. Finding the principal components is a well established

numerical procedure through eigen decomposition of the data covariance matrix, although

it is still expensive to compute. There are other less expensive methods [51] based

on recursive algorithms [53] for finding only a few eigenvectors and eigenvalues of a

large matrix, but the computational complexity is still a burden. Moreover, subspace

projections by PCA do not preserve discrimination [54], so there may be a loss of

performance. Variant to Singular Value Decomposition (SVD) are considered for image

compression utilizing the Karhunen-Loeve Transformation (KLT). Like PCA, SVD method

is also expensive to compute.

Discrete Cosine Transform (DCT) [50] is a widely used method for image compression

and as it can also be used in dimensionality reduction of image data. DCT is computationally

less burdensome than PCA and its performance approaches that of PCA. DCT is optimal

for human eye: the distortions introduced occur at the highest frequencies only and the

human eye tends to neglect these as noise. The image is transformed to the DCT domain

79

and dimensionality reduction is done in the inverse transform by discarding the transform

coefficients corresponding to the highest frequencies.

8.2 Motivation

Both the conventional MACE and the CMACE are memory-based algorithms,

therefore, in practice, the drawback of this class of algorithms is both the storage

requirements and the high computational demand. The output of the CMACE filter is

obtained by computing the product of two matrices defined by the image size and the

number of training images, and each elements of the matrix requires O(d2) computations,

where d is the number of image pixels. This quickly becomes too complex in practical

settings even for relatively small images. The fast correntropy MACE filter using the fast

Gauss transform (FGT) has been presented to increase the computational speed of the

CMACE filter, but the storage is still high. When the number of training images is N , the

total computation complexity of one test output of the CMACE is O(d2N(N +1)) and this

can be reduced to O(pcdN(N +1)), where p is the order of the Hermite approximation and

c is the number of clusters utilized in the FGT (p, c ¿ d). In general, image has a high

dimensionality and the applications using large images need a large memory capacity. The

main goal of this chapter is to find a simple but powerful dimensionality reduction method

for image recognition with the CMACE filter.

Recently, random projection (RP) has merged as an alternate dimensionality

reduction method in machine learning and image compression [55],[56],[57],[58],[59]

[56],[60] due to its simple complexity and good performance. Many experiments in the

literature show that RP is computationally simple while preserving similarity to a high

degree. In random projection, the original high dimensional data is projected onto a lower

dimensional subspace using a random matrix with only a small distortion of the distanced

between the points while preserving similarity information. Even though the projected

data by random selection incudes key information of the original data, we need to extract

the information properly. Since correntropy has ability to extract higher order moments of

80

the data and in that sense, correntropy can be the promising tool for random projection

applications.

In this chapter we present a dimensionality reduction pre-processor based on

random projections (RP) to decrease the storage and meet more readily available

computational resources and show the RP method works well with the CMACE filter

for image recognition.

8.3 PCA and SVD

Principal Component Analysis (PCA) is the best linear dimensionality reduction

technique in the mean-square error sense. Being based on the covariance matrix of the

random variables it is a second-order method. In various fields, it is also known as the

singular value decomposition (SVD), the Karhunen-Loeve Transformation (KLT),the

empirical orthogonal function (EOF) method and so on.

PCA seeks to reduce the dimensionality of the data by finding a few orthogonal linear

combinations of the original variables with the largest variance.

Let us suppose that we are given a data matrix X in Rd, whose size is d × N , where

N is the number of vectors in d-dimensional space. The goal is to find a k-dimensional

subspace (k < d) such that the projection of X on that subspace minimizes expected

squared error.

Then the projection of the original data onto a lower k-dimensional subspace can be

obtained by

Xrp = PpcaX, (8–1)

where P is k × d and it contains the k eigenvectors corresponding to the k largest

eigenvalues.

8.4 Random Projections

Random projection (RP)is a simple yet powerful dimensionality reduction technique

that uses random projection matrices to project data into a low dimensional subspace.

In RP, the original high dimensional space is projected onto a low dimensional subspace

81

using a random matrix whose columns have unit length. In contrast to other methods,

such as PCA, that use data driven optimization criteria, RP does not use such criteria,

therefore, RP is data independent. Moreover, RP is computationally simple and preserves

the structure of the data without introducing significant distortion. RP theory is far

from complete, so it has to be used with caution. The following lemma from Johnson and

Lindenstrauss (JL) provides theoretical support for RP.

JL lemma For any 0 < ε < 1 and any integer N , let k be a positive integer such that

k ≥ 4(ε2/2− ε3/3)−1 ln N (8–2)

then for any set V of N point in Rd, there is a map f : Rd −→ Rk such that for all

u, v ∈ V ,

(1− ε) ‖ u− v‖2 ≤‖ f(u)− f(v)‖2 ≤ (1 + ε) ‖ u− v‖2. (8–3)

Furthermore this map can be found in polynomial time.

JL lemma states that any N point set in d dimensional Euclidean space can be

mapped down onto a k ≥ O(logN/ε2) dimensional subspace without distorting the

distance between any pair of points by more than a factor of (1 ± ε) for any 0 < ε < 1,

with probability O(1/N2). A proof of this lemma as well as tighter bounds on ε and k are

given in [61].

Let us suppose that we are given a data matrix X, whose size is d × N , where N is

the number of vectors in d-dimensional space. Then the projection of the original data

onto a lower k-dimensional subspace can be obtained by

Xrp = PX, (8–4)

where P is k × d and it is called the random projection matrix.

The complexity of RP is very simple compared to other dimension reduction methods.

RP needs only order of O(kdN) for projecting d × N data matrix into k dimensions. The

82

computational complexity of constructing the random matrix, O(kd), is negligible when

compared with PCA, O(Nd2) + O(d3) [62].

8.4.1 Random Matrices

The choice of the random matrix P is one of the issues of RP. There are some simple

methods satisfying the JL lemma in the literature [58],[63]. Here we present three such

methods.

• The Gaussian ensemble : The entries, pij, of k × d random matrix P are identically

and independently sampled from a normal distribution with zero mean and unit

variance.

pij := 1√krij, where rij is i.i.d N(0, 1)

• The binary ensemble : The entries,pij, of k × d random matrix P are identically and

independently sampled from a symmetric Bernoulli distribution.

pij := 1√krij, where rij is i.i.d P (rij = ±1) = 1/2

• The related ensemble : pij := 1√krij, where

rij :=√

3

+1 with probability 1/6

0 with probability 2/3

−1 with probability 1/6

In most applications, the Gaussian ensemble would satisfy JL lemma well. The other two

methods yield significant computational savings [64].

8.4.2 Orthogonality and Similarity Properties

In fact, to preserve the similarity between the original data and the transformed data,

a projection matrix should be orthogonal. However, in a high enough dimensional space,

it is possible to use a non-orthogonal random projection matrix due to the fact that data

is sparse and in a high dimensional space, there exist a much larger number of almost

orthogonal directions [65]. Thus, vectors having random directions in a high dimensional

83

space is linear independent and these might be sufficiently close to orthogonal to provide

an approximation of a basis.

The inner product of two vectors x and y that have been obtained by random

projection of the vectors u and v with the random matrix R can be expressed as

xTy = uTRTRv. (8–5)

The matrix RTR can be decomposed into two terms

RTR = I + ε, (8–6)

where εij = rTi rj for i 6= j and εij = 0 for all i.

If all the entries in ε are equal to zero, i.e., the vectors ri and rj are orthogonal, the

matrix RTR would be equal to I and the similarity between the original data and the

projected data would be preserved exactly in the random mapping. In practice the entries

in ε will ne small but not equal to zero.

Here let us consider the case of that the entries of the random matrix are identically

and independently sampled from a normal distribution with zero mean and unit variance

and thereafter the length of all the ri’s is normalized. Then it is evident that εij is an

estimate of the correlation coefficient between two i.i.d normal distributed random variable

and if the dimensionality k of the reduced dimensional space is large, εij is approximately

normally distributed with zero mean and its variance σ2ε can be approximated by

σ2ε ≈ 1/k. (8–7)

That is, the distortion of the inner product produced by the random projection is zero

on the average and its variance is at most the inverse of the dimensionality of the reduced

space. This result causes the scaling factor 1/√

k in the choice of random projection

matrix examples to preserve the distance. (In some applications which are not concerning

the distance, we do not need to scale the projection matrix by 1/√

k). Moreover, the error

84

becomes much smaller when the data is sparse and this result tells us the relevance of the

random projection in compressive sampling for sparse signal recovery [66]. However, the

methodology for building Rp may affect the subsequent algorithms used for processing,

and this is an area that is much less studied.

8.5 Simulations

In this section, we show the performance of the CMACE filter with the RP

dimensionality reduction in the face recognition example of chapter 7. We project the

original data into a lower dimensional space with random projection and apply the

CMACE filter to the reduced dimensional data. In this simulation, we use the Gaussian

Ensemble method to generate a random projection matrix. In order to compare the

performance of the CMACE after a preprocessing with random projection, we compute the

area under the ROC curve.

Figure 8-1 shows the ROC area values with different reduced dimensions by RP.

200 400 600 800 1000 1200 1400 16000.4

0.5

0.6

0.7

0.8

0.9

1

X: 144Y: 0.9684

Dimensionality k after random projection

RO

C A

RE

A

maximunminimumaverage

Figure 8-1. The comparison of ROC areas with different RP dimensionality (50 trials withdifferent training images and RP matrices).

85

200 400 600 800 1000 1200 1400 16000.4

0.5

0.6

0.7

0.8

0.9

1

X: 256Y: 0.9502

Dimensionality k after ramdom projection

RO

C A

RE

A

averagemaximunminimum

Figure 8-2. The comparison of ROC areas with different RP dimensionality (50 trials withdifferent RP matrices, but fixed training images).

Since RP is in fact a random function we present results in terms of mean, maximum

and minimum performance obtained in 50 Monte-carlo simulations. At every trial we

use randomly chosen different training images (N = 5) and different random projection

matrices. When the MACE is applied to the original data, the average ROC area is 0.96

(the best case is 0.99 and the worst case is 0.8868). In Figure 8-1 the performance of the

CMACE with the reduced dimensionality k ≥ 144 is always better than that of the MACE

filter with original data. The range of performance between the best and the worst cases is

due to both the effect of different training images and different RP matrices.

In order to monitor only the effect of RP, we fixed the training images and run 50

Monte-carlo simulations with different RP matrices. The results on ROC areas are shown

in Figure 8-2. We can see that the variations of the performance due to different RP

matrices is substantially smaller and the CMACE obtains consistent performance with RP

when the image size is above 16× 16 ( dimensionality k = 256 ).

86

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CMACE(averaging)CMACE(RP)CMACE(bilinear interpolation)CMACE(subsampling)MACE(averaging)MACE(bilineat interpolation)MACE(subsampling)MACE(RP)

Figure 8-3. ROC comparison with different dimensionality reduction methods for MACEand CMACE (reduced image size is 16× 16).

The comparison among the four dimensionality reduction methods (subsampling,

pixel-averaging, bilinear interpolation and Gaussian (RP)) for images of size 16 × 16 (from

64× 64) is shown in Figure 8-3.

For the CMACE, the Gaussian method (RP) and pixel-averaging methods work very

well, with the subsampling the worst, but still with robust performance. Subsampling is

the simplest technique but it can also loose important detail information. In the MACE

case, the Gaussian method is the worst, with pixel-averaging method still performing some

discrimination, but at a much reduced rate (compare with Figure 7-3). It is surprising

that local pixel averaging, the simplest method of dimensionality reduction provides such

a robust performance in this application for both the MACE and CMACE. It indicates

that coarse features are sufficient for discrimination up to a certain level of performance.

However, notice that the pixel averaging looses with respect to CMACE-RP when the

operating point in the ROC is close to 100%, as can be expected (finer detail is needed to

discriminate between classes).

87

We have also applied PCA to the MACE and CMACE. There are different ways to

apply PCA to this task. One method for dimensionality reduction with PCA, uses training

images from all the data set, and then projects all the images to the subspace spanned by

principal components. With this method, when we choose 10 images (5 from true class

and 5 from false class) and project all true and false class images onto this subspace, the

performance of both the MACE and CMACE are perfect. However, the training data

must be sufficient to find the principal directions to cover the whole test data and large

computation is required. However, in practice, it is impossible to use out of class images

as a training set for a MACE filter which is designed only with data from one class. In

this more realistic case, the test image class does not belong to the training set for PCA

and the performance of the discrimination will be very poor. Figure 8-4 shows the ROC

curves when only true class images are used for PCA. Even for this case, we had to use all

the true class images (75) to find 75 principal directions, project all the images, and then

choose the 5 projected images to composite the MACE and CMACE filter. For testing,

we also project the test image onto the subspace obtained from the training set. Since

the false class test images are not used to determine the PCA subspace, the projected

data of the false class is not guarantee to preserve the information of the original images,

therefore, the rejecting performance becomes very poor. The ROC area value for the

MACE and CMACE are 0.4015 and 0.7283, respectively.

We could not obtain reasonable results in the MACE with the RP method as shown

in Figure 8-3. We will explain this MACE behavior due to the Gaussian dimensionality

reduction procedure, but it partially also applies to the other methods. Although RP

preserves similarity in the reduced projections, it changed the statistics of the original data

classes. After random projection with the Gaussian ensemble method, all the projected

images display statistics very close to white noise with similar variance. This result is

shown in Figure 8-5, where two classes sample images of size 16 × 16 after applying RP

to all images are depicted. The first row shows the training image set, while the second

88

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of D

etec

tion

CMACEMACE

Figure 8-4. ROC comparison with PCA for MACE and CMACE (reduced dimensionalityis k = 75).

row displays the in class test set and the third row the out of class test images. We see

that the projected images in the true class and the false class, although slightly different in

detail, seem to have very similar statistics.

The MACE, which extracts only second order information is unable to distinguish

between the projected image set, however the CMACE succeds in this task. In order

to explain the effectiveness of correntropy function, we compare the correlation and

correntropy in the projected space. This result is shown in Figure 8-6. We consider the 2D

images as long 1D vectors. In Figure 8-6 (a) we show the autocorrelation of one original

image vector in the true class, (b) depicts the autocorrelation of one of the training images

after RP, which leads us to conclude that the projected image has been whitened (the only

peak occurs at zero lag), and (c) shows that the cross correlation between the reduced

training image vector and test image vector in the false class after RP is practically the

same as the auto correlation of the reduced training image vector after RP. Therefore the

covariance information of the images after RP is totally destroyed. Since the conventional

89

(a)

(b)

(c)

Figure 8-5. Sample images of size 16× 16 after RP. (a) Training images. (b) True classimages. (c) False class images.

MACE filter utilizes only the second order information, it is unable to discriminate

between in class and out of class images. However, in (e) and (f), we can see that the cross

correntropy between in class and out of class images is still preserved after RP, due to

the fact that correntropy has ability to extract higher order information of the reduced

dimensional data. Therefore, the CMACE filter seems very well posed to work with the

reduced dimensional images by random projection for this and other applications.

We can also see the overall detection and recognition performance of the CMACE-RP

through a further analysis of the output plane. Figure 8-7 shows correlation output planes

for the MACE and correntropy output planes (CMACE) after dimensionality reduction

with k = 64 random projections. Figure 8-7 (a) shows the desirable correlation output

plane of the MACE filter given the true class test image, however (b) shows the poor

rejecting ability for the false class test image. On the other hand, in the CMACE filter,

the true and false class image output plane in Figure 8-7 (c) and (d) show the expected

responses even with such a small dimensional images.

90

−5000 0 50000

5000

10000

15000

(a)−200 0 200

−0.5

0

0.5

1

1.5

2

2.5x 10

5

(b)−200 0 200

−0.5

0

0.5

1

1.5

2

2.5x 10

5

(c)

−5000 0 50000

0.005

0.01

0.015

(d)−200 0 200

0

0.005

0.01

0.015

(e)−200 0 200

0

1

2

3x 10

−3

(f)

Figure 8-6. The cross correlation vs. cross correntropy. (a) Autocorrelation of one of theoriginal training image vector. (b) Autocorrelation of one of the reducedtraining image vector after RP. (c) Cross correlation between one of thereduced training image vector and test image vector in the false class after RP.(d) Autocorrentropy of one of the original training image vector. (e)Autocorrentropy of one of the reduced training image vector after RP. (f)Cross correntropy between one of the reduced training image vector and testimage vector in false class after RP.

The initial idea to use a preprocessor based on random projections was to alleviate

the storage and computation complexity of the CMACE. Table 8-1 presents comparisons

between the original CMACE and the CMACE with RP. The dominant component for

storage is the correntropy matrix (VX). In single precision (32bit), 64Mbytes are needed

to store VX of 64 × 64 pixel images, but only 256Kbytes with 16 × 16 pixel images after

RP. We need an additional 4Mbytes to perform random projection with the Gaussian

ensemble method. In the binary ensemble case, no additional storage for RP is needed.

The table also presents the computational complexity of (6–1) with one test image, given

N = 5 training images and clocked with MATLAB version 7.0 on a 2.8GHz Pentium 4

processor with 2Gbytes of RAM.

91

0

5

10

15

0

5

10

150

0.2

0.4

0.6

0.8

1

(a) With a true class test image in the MACE

0

5

10

15

0

5

10

150

0.2

0.4

0.6

0.8

1

(b) With a false class test image in the MACE

0

5

10

15

0

5

10

150

0.2

0.4

0.6

0.8

1

(c) With a true class test image in the CMACE

0

5

10

15

0

5

10

150

0.2

0.4

0.6

0.8

1

(d) With a false class test image in the CMACE

Figure 8-7. Correlation output planes vs. correntropy output planes after dimensionreduction with random projection (reduced image size is 8× 8).

Table 8-1. Comparison of the memory and computation time between the originalCMACE (image size of 64× 64) and CMACE-RP (16× 16,with Gaussianensemble method) for one test image with N = 5

CMACE CMACE-RP(d = 4096) (k = 256)

Memory (byte) O(4d2) O(4k(k + d))(single precision) = 64 MB + α = 4.2 MB + βComplexity O(d2(N2 + N + 1)) O(k2(N2 + N + 1) + kdN)

= O(5.2× 108) = O(7.3× 106)Time (sec) 58.584 0.4297

92

CHAPTER 9CONCLUSIONS AND FUTURE WORK

9.1 Conclusions

In my research, we have evaluated the correntropy based nonlinear MACE filter for

image recognition. We presented experimental results for face recognition using CMU’s

facial expression data and SAR image recognition using the MSTAR public release data.

Correntropy induces a new RKHS that has the same dimensionality as the input

space but is nonlinearly related to it. Therefore, it is different from the conventional

kernel methods, in both scope and detail. Here we illustrate that the optimal MACE

filter formulation can be directly solved in the VRKHS. This CMACE overcomes the

main shortcomings of the MACE which is poor generalization. We believe this is due

to the utilization for the matching of higher order statistical information in the target

class. The CMACE also shows a good rejecting performance as well as robust results

with additive noise. This is due to the prewhitening effect in feature space and the new

metric created by the correntropy that reduces outliers. Simulation results show that the

detection and recognition performance of the CMACE exhibits better distortion tolerance

than the MACE in some kinds of distortions(in face recognition, different facial expression,

and in SAR, aspect angle as well as depression) . Also the CMACE outperforms the

nonlinear kernel correlation filter, which is the kernelized SDF with prewhitened data in

the input space, especially for the large distortion case. Moreover the CMACE preserves

the shift-invariant property well.

The sensitivity of the CMACE performance on the kernel size is experimentally

demonstrated to be small, but a full understanding of this parameter requires further

investigation. In addition, there is still an approximation in (6–3) and (6–4) to compute

the products of the projected data functionals by a kernel evaluation, which is guaranteed

on average. For large images, this approximation seems to be good, but its error needs to

be understood and quantified to obtain the best performance of the CMACE filter.

93

In practice, the drawback of the proposed CMACE filter is the required storage and

its computation complexity. Since one does not have direct access to the filer weights in

VRKHS, the computations in the test set must take into consideration the training set

data, so the total computation complexity of one test output is O(d2N(N + 1)) and the

storage depends on the image dimension, O(d2). The MACE produces easily the entire

correlation output plane by FFT processing. The CMACE can also construct the whole

output plane by shifting the test input image and as a result, there is no need to center

all images provided that the input image is appropriately shifted. However, computing

the whole output plane is a big burden in the CMACE. For this reason, this research also

proposes the fast CMACE to save computation time by using the Fast Gauss Transform

(FGT) algorithm, which results in a computational savings of about 100 fold for 64 × 64

pixel images. With the fast Gauss transform, we were able to reduce the computation to

O(pcdN(N + 1)), where p, c ¿ d.

However, this needs still huge storage and is not very competitive with other methods

for object recognition. The random projection (RP) method may make the CMACE

useful for practical applications using standard computing hardware. RP is a preprocessor

that extracts features of the data, but unlike PCA it is very easy to compute on O(kd).

Reducing the data into features has a double effect of addressing both the storage and

computation requirements. For instance instead of 64Mbytes for 64 × 64 pixel images,

the storage for images with RP to 16x16 pixel images is 4.2Mbytes (binary ensemble case

is 256Kbytes). Computational speed improves by more than 100 times. The method of

random projections and its impact on subsequent pattern recognition algorithms is still

poorly understood. Here we verified that the MACE is incompatible with the Gaussian

method of random projections since it destroys the second order statistics that make the

MACE work. The pixel-averaging method seems to preserve second order statistics to a

certain degree. However, the CMACE combined with RP is a better alternative, and it

is less sensitive to the method of data reduction. This can be understood if we remember

94

that the CMACE is preserving higher order statistics of the data, unlike the MACE filter.

The performance of the CMACE-RP of 16×16 is better than that of the MACE of 64×64.

But further work is necessary to quantify extensively the performance of the CMACE-RP

versus other algorithms.

These tests with the CMACE and data reduction clearly showed a new application

domain for correntropy in signal processing. The conventional data reduction methods

average locally or globally data and tend to destroy mean and variance, but apparently

they preserve some of the higher order information contained in the data that can still be

utilized by correntropy. Therefore, in applications where data reduction at the front-end

is a necessity, correntropy may provide still usable algorithms, in cases where second order

methods fail. This argument is also very relevant in compressive sampling (CS), where

convex optimization needs to be utilized to minimize the l1 norm, since the l2 norm creates

a lot of artifacts in reconstruction. We think that the correntropy induced metric (called

CIM in [42]) can be a candidate to simplify the reconstruction in CS. We have however

to fully understand why correntropy is able to still distinguish between images or signals

that have been heavily distorted, because we can perhaps even propose new data reduction

procedures that preserve the discriminability of correntropy.

9.2 Future Work

The correntropy MACE filter was obtained by solving the constrained optimal

solution in the RKHS induced by correntropy, where the dimension of the RKHS is the

same as the input dimension. The data points in this new RKHS are nonlinearly related

to the original data, therefore, we still can find a closed form solution to the nonlinear

MACE filter that outperforms the linear MACE filter. However, there are still several

work to be investigated.

First, the proposed correntropy MACE filter has hard constraints on the center of

the output plane. As the same as the traditional SDF type filters, linear constraints are

imposed on the training images to yield a known value at specific locations in the output

95

plane. However, placing such constraints satisfies conditions only at isolated points in

the image space but does not explicitly control the filter’s ability to generalize over the

entire domain of the training images. Unlike the general classification problem, the goal

of this research is to find an appropriate template for a specific object that we want to

identify without any information on out-of-classes. We have to suppress the response to

all the images except for the true target image. Therefore, we think that constraining

only one location is not the best solution. Finding new constraints that give us good

generalization as well as rejection ability is one of the future works. One of the idea is

to use the randomly mixed images of the true target as the out-of-class images. These

generated out-of-class images are totally different from the true class images but have the

same statistical information as the true class images. Therefore, we can expect that this

idea may help improve performance.

Second, the computation of the correntropy MACE output requires an approximation.

Unfortunately (6–3) and (6–4) involve weighted versions of the functionals therefore the

error in the approximation should be addressed and it requires further investigation for a

good approximation.

Finally, my research presented the simulation results on applications to face

recognition and SAR image recognition. In addition to face recognition, the proposed

algorithm can be applied to the biometric verification such as iris and fingerprint. Also in

SAR application, there is three-class (BMP2,BTR70 and T72) object classification among

MSTAR/IU public release data set [67]. Most of literatures are applying their algorithm to

the three-class problem to compare the performance. Therefore in order to convince other

researchers we need to compare our algorithm with the three-class problem.

Summarizing the future works

• Find new constraints that provide a better generalization as well as rejecting ability.

• Observe the approximation errors due to the weighted values and find a goodapproximation.

96

• Applications to biometric verification and more detail comparison on the three-classSAR classification.

97

APPENDIX ACONSTRAINED OPTIMIZATION WITH LAGRANGE MULTIPLIERS

If y is a weighted sum of variables, y = aTx, then dy/dx = a. The general quadratic

form for y in matrix notation is y = xTAx, where A = {aij} is an N × N matrix of

weights. Here, we assume that x is a real vector, then the partial derivatives of y with

respect to each variable is dy/dx = (A + AT)x. If A is symmetric, then dy/dx = 2Ax.

The method of Lagrange multipliers is useful for minimizing a quadratic function

subject to a set of linear constraints. Suppose that B = [b1b2 · · ·bM ] is an N × M

matrix with vectors bi of length N as its columns and c = [c1c2 · · · cM ] is a vector of M

constants. We want to find the vector x which minimizes the quadratic term y = xTAx

while satisfying the linear equations BTx = c. If A is positive semi-definite, then the y is

convex and there is at least one solution. We form the cost function

J = xTAx− 2λ1(bT1 x− c1)− 2λ2(b

T2 x− c2)− · · · − −2λM(bT

Mx− cM), (A–1)

where the scalar parameters λ1, λ2, · · · , λM are known as the Lagrange multipliers.

Setting the gradient of J with respect to x to zero yields

2Ax− 2(λ1b1 + λ2b2 + · · ·+ λMbM) = 0. (A–2)

Defining m = [λ1λ2 · · ·λM ]T , then (A–2) can be expressed as

Ax−Bm = 0. (A–3)

or

x = A−1Bm. (A–4)

Substituting (A–4) for x into the constraint BTx = c yields

BTA−1Bm = c. (A–5)

98

The Lagrange multiplier vector m can be obtained as

m = (BTA−1B)−1c. (A–6)

Using (A–4) and (A–6) we obtain the following solution to the constraint optimization

problem

x = A−1B(BTA−1B)−1c. (A–7)

99

APPENDIX BTHE PROOF OF PROPERTY 5 OF CORRENTROPY

Let pij(x, y) be the joint PDF of (xi, xj) such that

pij(x, y) =∞∑i

αiϕi(x)ϕj(y), (B–1)

where ϕ(·), αi are the eigen functions and the eigenvalues of pij(x, y), respectively. Here,

E[f(x)f(y)] =∫ ∫

pij(x, y)f(x)f(y)dxdy

=∑i

αi

∫ϕi(x)f(x)dx

∫ϕi(y)f(y)dy

=∑i

αiβ2i .

(B–2)

where, βi =∫

ϕi(x)f(x)dx.

Now, let ψi(·), λi be the eigen functions and the eigenvalues of the kernel k, then

E[k(x, y)] =∫ ∫

pij(x, y)k(x, y)dxdy

=∫ ∫ ∑

j

∑i

αiϕi(x)ϕi(y)λjψj(x)ψj(y)dxdy

=∑j

∑i

αiλjγij

(B–3)

Observing (5–15) and (B–1) we can construct f such that βi =√∑

j

λjγij.

Then there exist f(x) =∑i

βiϕi(x) satisfying (5–15).

100

APPENDIX CTHE PROOF OF A SHIFT-INVARIANT PROPERTY OF THE CMACE

A shift invariant system is one for which a shift or delay of the input sequence

causes a corresponding shift in the output sequence. All the components of the CMACE

output are determined by the kernel function and by proving that the Gaussian kernel is

shift-invariant, we can easily say that the CMACE is shift-invariant.

Let the output of the Gaussian kernel be

y(n) = κσ(x1(n)− x2(n)) = exp

(−(x1(n)− x2(n))2

2σ2

). (C–1)

Start with a shift of the input x1s(n) = x1(n− so) and x2s = x2(n− so), then the response

y1(n) of the shifted input is

y1(n) = κσ(x1s(n)− x2s(n)) = exp

(−(x1(n− so)− x2(n− so))2

2σ2

). (C–2)

Now the shift of the output defined as y(n− so) becomes

y(n− so) = exp

(−(x1(n− so)− x2(n− so))2

2σ2

). (C–3)

Then clearly y1(n) = y(n− so), therefore, the Gaussian kernel is time-invariant.

101

APPENDIX DCOMPUTATIONAL COMPLEXITY OF THE MACE AND CMACE

Here, only the computational complexity for the matrix inversion and multiplication

in both the MACE and CMACE are considered. Let us assume that all the elements of

matrices are real.

In order to construct the MACE template in the frequency domain with a given

training image set, O(d) multiplications are needed for the inversion of the diagonal matrix

D of size d × d and O(N2) for the inversion of the Toeplitz matrix (XHD−1X) of size of

N × N . The number of multiplications are O(N2) for D−1X, O(dN2 + d) for (XHD−1X)

and O(dN2) for D−1X(XHD−1X)−1. In addition, the FFT needs O((N + 1)d log2(d))

multiplications with N training images. In reality the elements of the matrices are

complex valued, therefore, the MACE requires a total of O(4(d(2N2 + N + 2) + N2) +

Nd log2(d)) multiplications to compose the template for the true class in the frequency

domain. For the test of one input image after building a template, the MACE requires

only O(4d + d log2(d)) multiplications.

The CMACE needs O(d2) and O(N2) multiplications for the inversion of both the

Toeplitz matrix VX of size d× d and TXX of size N ×N and O((Nd)2) to compute TXX ,

therefore, the total number of multiplications in off-line mode with the given training

image set is O(d2(N2 + 1) + N2). For the testing of one image, O(N2) multiplications

for the output and O((Nd)2) operations for obtaining TZX are needed, therefore, the

total computational complexity of the CMACE for one test image requires O(d2N + N2)

multiplications.

The fast CMACE with the FGT reduces the computational complexity to O(N2pd(kc+

1) + d2 + N2) for the training set and to O(pd(kc + 1)N + N2) for one testing image.

102

APPENDIX ETHE CORRENTROPY-BASED ROBUST NONLINEAR BEAMFORMER

E.1 Introduction

Beamforming is often used with an array of radar antenna in order to transmit

or receive signals in different directions without having to mechanically steer the

array [68],[69], and has found numerous applications in radar, sonar, seismology, radio

astronomy, medical imaging, speech processing, and wireless communications. The

classical approach for beamforming is a natural extension of Fourier-based spectral

analysis to spatio-temporally sampled data, which is called the conventional Bartlett

beamformer [70]. This algorithm maximizes the energy of the beamforming output for

a given input signal. Because it is independent of the signal characteristics, but only

depends on a certain direction, its major difficulties are low spatial resolution and high

sidelobes. In an attempt to alleviate the limitations of the conventional beamformer, the

Capon beamformer is introduced [71],[72].

A Capon beamformer attempts to minimize the output energy contributed by

interference coming from other directions than from the ”look direction”. Moreover, it

maintains a fixed constant gain in the ”look direction” (normalized to one) in order not to

risk the loss of the signal containing the information. This Capon beamformer is sensitive

to the mismatch between the assumed and actual array steering vector, which occurs

often in practice. Recently a robust beamformer was proposed by extending the Capon

beamformer to the case of uncertain array steering vectors [73],[74].

In a statistical point of view, most of these techniques are based on linear models,

which make use of only the first and second order moment information (e.g. the mean and

the variance) of the data. Therefore, they are not an appropriate choice in non-Gaussian

distributed data such as impulsive noise scenarios. In order to deal with more realistic

situation, further research into signal modeling has led to the realization that many

natural phenomena can be better represented by distributions of a more impulsive nature.

103

One type of distribution that exhibits heavier tails than the Gaussian is the class of stable

distributions introduced by Nikias and Shao [75]. Alpha-stable distributions have been

used to model diverse phenomena such as random fluctuations of gravitational fields,

economic market indexes [76], and radar clutter [77].

To overcome the limitation of the linear model for the non-Gaussian statistics case,

a nonlinear beamformer has been proposed in [78] but most of nonlinear beamforming

methods are complicated for weight vector computation. Recently, kernel based learning

algorithms have been heavily researched due to the fact that linear algorithms can be

easily extended to nonlinear versions through kernel methods [23]. Some kernel based

methods have been presented in [79],[80],[26] for beamforming and target detection

problems.

The correntropy MACE (CMACE) filter [46][81], which is the nonlinear version of the

correlation filter, has been shown to possess good generalization and rejecting performance

for image recognition applications.

In this appendix, we apply correntropy to the beamforming problem and exploit

the linear structure of the RKHS induced by correntropy to formulate the correntropy

beamformer. Due to the fact that it involves high-order statistics with the nonlinear

relation between the input and this feature spaces, the correntropy beamformer shows

better performance than the Capon and kernel methods and is robust to impulsive noise

scenarios.

E.2 Standard Beamforming Problem

E.2.1 Problem

Consider the standard beamforming model. Let the uniformly spaced linear array of

M sensors receive signals xk generated by a narrow-band source sk arriving from direction

θ. Using a complex envelop representation, the M × 1 vector of received signals at kth

snapshot can be expressed as

xk = a (θ) sk + nk, (E–1)

104

where a (θ) ∈ CM×1 is the steering vector of the array toward direction θ as

a (θ) =

[1 ej(2π/λ)d cos θ . . . ej(2π(M−1)/λ)d cos θ

]T

, (E–2)

and nk is the M × 1 vector of additive white noise. Also, the beamformer output is given

by

yk = wHxk = wHa (θ) sk + wHnk, (E–3)

where w ∈ CM×1 is a vector of weights and H denotes the conjugate transpose. The goal

is to satisfy wHa (θ) = 1 and minimize the effect of the noise (wHnk), in which case, yk

recovers sk.

Besides, we also assume that each element of nk follows a symmetric α-stable (SαS)

distribution described by the following characteristic function

ϕ (w) = exp (jδw − γ |w|α) , (E–4)

where α is the characteristic exponent restricted to the values 0 < α ≤ 2, δ (−∞ < δ < ∞)

is the location parameter, and γ (γ > 0) is the dispersion of the distribution. The value

of α is related to the degree of the impulsiveness of the distribution. Smaller values of α

correspond to heavier tailed distributions and hence to more impulsive behavior, while

as α increases, the tails are lighter and the behavior is less impulsive. The special case of

α = 2 corresponds to the Gaussian distribution (N (δ, 2γ)), while α = 1 corresponds to the

Cauchy distribution.

E.2.2 Minimum Variance Beamforming

Since the look-direction frequency response is fixed by the constraints, minimization

of the non-look-direction noise energy is the same as minimization of the total output

energy. The energy of beamformer output (yk) is minimized subject to the constraint of a

distortionless response in the direction of the desired signal as given by

minw

E[y2

k

]subject to wHa (θ) = 1. (E–5)

105

The constraint wHa (θ) = 1 prevents the gain in the direction of the signal from being

reduced. This is commonly referred to as Capon’s method [71],[72]. Equation (E–5) has an

analytical solution given by

wcapon =R−1

x a (θ)

a (θ)H R−1x a (θ)

, (E–6)

where Rx denotes the covariance matrix of the array output vector. In practical

applications, Rx is replaced by the sample covariance matrix Rx, where

Rx =1

N

N∑

k=1

xkxHk , (E–7)

with N denoting the number of snapshots. Substituting wcapon into equation (E–3), the

constrained least squares estimate of the look-direction output is

ycapon,k = wHcaponxk =

a (θ)H R−1x xk

a (θ)H R−1x a (θ)

. (E–8)

E.2.3 Kernel-based beamforming

The basic idea of kernel algorithm is to transform the data xi from the input space

to a high dimensional feature space of vectors Φ (xi), where the inner products can be

computed using a positive definite kernel function satisfying Mercer’s conditions [23] :

κ (xi,xj) = 〈Φ(xi), Φ(xj)〉. This simple and elegant idea allows us to obtain nonlinear

versions of any linear algorithm expressed in terms of inner products, without even

knowing the exact mapping Φ.

Using the constrained least-squares approach that was explained in the previous

section it can easily be shown that the equivalent solution wkernel in the feature space is

given by

wkernel =R−1

Φ(x)Φ (a (θ))

Φ (a (θ))H R−1Φ(x)Φ (a (θ))

, (E–9)

where RΦ(x) is the correlation matrix in the feature space. The estimated correlation

matrix is given by

RΦ(x) =1

NXΦXH

Φ , (E–10)

106

assuming the sample mean has already been removed from each sample(centered), where

XΦ = [Φ (x1) , Φ (x2) , . . . , Φ (xN)] is a full rank matrix whose columns are the mapped

input reference data in the feature space. Its output is given by

ykernel,k = wHkernelΦ (xk) =

Φ (a (θ))H R−1Φ(x)Φ (xk)

Φ (a (θ))H R−1Φ(x)Φ (a (θ))

. (E–11)

Due to the high dimensionality of the feature space, equation (E–11) can not be directly

implemented in the feature space. It needs to be converted in terms of the kernel functions

by the eigenvector decomposition procedure of the kernel PCA [26]. The kernelized version

of the beamformer output is given by

ykernel,k =KH

a(θ)K−1Kxk

KHa(θ)K

−1Ka(θ)

, (E–12)

where

KTa(θ) = Φ (a (θ))T XΦ

= [κ (a (θ) ,x1) , κ (a (θ) ,x2) , . . . , κ (a (θ) ,xN)] ,(E–13)

KTxk

= Φ (xk)T XΦ

= [κ (xk,x1) , κ (xk,x2) , . . . , κ (xk,xN)] ,(E–14)

and K = XHΦ XΦ is an N ×N Gram matrix whose entries are the dot products κ (xi,xj) =

〈Φ(xi), Φ(xj)〉.E.3 Nonlinear Beamformer using Correntropy

The correntropy beamformer is formulated in the RKHS induce by correntropy and

the solution is obtained by solving the constrainted optimization problem which is to

minimize average correntropy output energy. We denote the transformed received data

matrix and filter vector whose size are M ×N and M × 1, respectively, be

FX = [fx1 , fx2 , · · · , fxN], (E–15)

fw = [f(w(1))f(w(2)) · · · f(w(M))]H . (E–16)

107

where,

fxk= [f(xk(1))f(xk(2)) · · · f(xk(M))]H (E–17)

for k = 1, 2, · · · , N . Given data samples, the cross correntropy between the received signal

at kth snapshot and the filter can be estimated as

voi[m] =1

d

d∑n=1

f(w(n))f(xk(n−m)), (E–18)

for all the lags m = −M + 1, · · · ,M − 1.

The correntropy energy of the kth received signal output is given by

Ek = vTokvok = fH

w Vxkfw, (E–19)

and the M ×M correntropy matrix Vxkis

Vxk=

vk(0) vk(1) · · · vk(d− 1)

vk(1) vk(0) · · · vk(d− 2)

......

. . ....

vk(d− 1) · · · vk(1) vk(0)

, (E–20)

where, each element of the matrix is computed without explicitly knowledge of the

mapping function f by

vk(l) =M∑

n=1

κσ(xk(n)− xk(n + l)), (E–21)

for l = 0, · · · ,M − 1.

The average correntropy energy over all the received data can be written as

Eav =1

N

N∑

k=1

Ek = fTwVXfw, (E–22)

where VX = 1N

∑Nk=1 Vxk

. Since our objective is to minimize the average correntropy

energy in the linear feature space, we can formulate the optimization problem by

min fHwVXfw, subject to fHwfa(θ) = 1. (E–23)

108

where, fa(θ) of size M × 1 is the transformed vector of the steering vector . Then the

solution in feature space becomes

fw =V−1

X fa(θ)

fHa(θ)V−1X fa(θ)

. (E–24)

Then the output is given by

ycorrentropy,k =fHa(θ)V

−1X fxk

fHa(θ)V−1X fa(θ)

=Tax

Ta

, (E–25)

where

Ta =M∑i=1

M∑j=1

wijf(a(j))f(a(i)) ∼=M∑i=1

M∑j=1

wijκσ(a(j)− a(i)), (E–26)

Taz =M∑i=1

M∑j=1

wijf(xk(j))f(a(i)) ∼=M∑i=1

M∑j=1

wijκσ(xk(j)− a(i)), (E–27)

where wij is the (i, j)th element of V−1X , xk(i) is the ith element of the received signal at

kth snapshot and a(i) is ith element of the steering vector.

The final output expressions in (E–26) and (E–27) are obtained by approximating

f(a(j))f(a(i)) and f(xk(j))f(a(i)) by κσ(a(j) − a(i)) and κσ(xk(j) − a(i)), respectively,

which is similar to the kernel trick and holds on average because of property 5.

E.4 Simulations

In this simulation, we present comparison results among the Capon, kernel and

correntropy beamformer in the wireless communications with multiple receiving antennas.

In all of the experiments, we assume a uniform linear array with M = 25 sensor elements

and half-wavelength array spacing. Note that as the number of elements increases, the side

lobes become smaller. Also, as the total width of the array increases, the central beam

becomes narrower. For the source scenario, we assume that narrow band signals arrives

from far field and the target of interest is located at angle θ = 45◦. We use BPSK (Binary

Phase Shift Keying) signalling which has unity power and is uncorrelated. In order to

make the result independent of the input and noise, we perform Monte-Carlo simulations

with 100 different inputs and noises.

109

In the first experiment, we investigate the effect of the number of snapshots

(N) in a spatially white Gaussian noise case. Figure E-1 shows the beampatterns of

Capon, kernel and correntropy beamformers with N = 100 and 1000 for the case that

signal-to-noise-ratios (SNR) is 10dB. The Capon beamformer has the poor performance,

i.e., higher side-lobes for the small number of N , while the kernel method and correntropy

beamformer show a good beampattern even a small number of N . It is well known that

one of the problem of the standard Capon has a poor performance with a small number

of training data. In Figure E-2, we show the performance of BER with N = 100 and

1000 for the range of SNR between 5 and 15dB. It has been shown from Figure E-2(a)

that for N = 100 the Capon beamformer exhibits a high BER floor, but the proposed

beamformer has a much better BER performance than the Capon and kernel beamformer.

For N = 1000 in Figure E-2(b), compared with the Capon when SNR is under of 9dB,

the Capon beamformer shows better BER performance than other two methods, but when

SNR increases, BER of the correntropy beamformer becomes the best.

Next, we test the robustness of the Capon, kernel and correntropy beamformers to

the impulsive noise with N = 1000. We select γ such that SNR is 10dB when an α-stable

noise with α = 2 and δ = 0 (Gaussian noise). Figure E-3 shows BER performance of

three beamformers at different α levels. The correntropy beamformer displays superior

performance for decreasing α, that is, increasing the strength of impulsiveness. From this

result, we can say that the proposed method is robust in terms of BER to the impulsive

noise environment for wireless communications.

Figure E-4(a) and (b) show the beampattern of three beamformer at α = 1.5 and

α = 1.0, respectively. When α = 1.5 in Figure E-4(a), the beampattern of Capon is similar

to that of kernel, and the gain of its side lobe is higher than that of correntropy by 2dB.

As decreasing α, the gap of the side lobe gain between Capon and correntropy is increased

as shown Figure E-4(b).

110

One interesting result of the kernel method is that its BER performance is poor

both in Gaussian and impulsive noise cases even though it shows a nice beampattern.

The output values of the kernel method are far from the original transmitted signals,

±1 therefore, it results in the poor BER performance. The kernel method shown in this

dissertation use a constraint, however, the solution of the optimization problem exists on

the infinite dimensional feature space, therefore, additional regularization for the output

to be close to the original signal may be needed. One important difference compared

with existing conventional kernel method, which normally yields an infinite dimensional

feature space, is that RKHS induced by correntropy (we call it VRKHS) has the same

dimension as the input space. In the beamforming problem, the weight vector w has M

degree of freedom and all the received data are in the M dimensional Euclidean space.

As derived above, all the transformed data belong to a different M dimensional vector

space equipped with the inner product structure defined with the correntropy. The goal

of the proposed beamformer is to find a template fw in this VRKHS such that the cost

function is minimized subject to the constraint. Therefore, the degrees of freedom of this

optimization problem is still M , so regularization, which will be needed in traditional

kernel methods, is not necessary here. Further work needs to be done regarding this point.

E.5 Conclusions

In this research, we have presented a correntropy-based nonlinear beamformer and

compared it with the Capon beamformer, which is a widely used linear technique, and the

kernel-based beamformer, which is one of the nonlinear beamformer. From simulation

results in BPSK wireless communications, it has been shown that the correntropy

beamformer outperforms the Capon and kernel beamfomers significantly in term of

sidelobe suppression of beam shaping and a reduced bit-error-rate. Also, the correntropy

beamformer has a clear advantage over the Capon beamformer in those case where small

data sets are available for training and where non-Gaussian noise is present. Compared to

the kernel beamformer, the correntropy beamformer is computationally much simpler. In

111

the computation complexity, the kernel method needs to compute the inverse of N × N

Gram matrix. On the other hand, the correntropy beamformer needs the inverse of M×M

correntropy matrix, where M ¿ N (in this simulation, M = 25 and N = 1000). In

addition, we hypothesize that in our methodology, regularization is automatically achieved

by the kernel through the expected value operator (which corresponds to a density

matching step utilized to evaluate correntropy).

112

0 10 20 30 40 50 60 70 80 90−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

0

1

Degree (θ)

Arr

ay B

eam

pat

tern

(d

B)

CaponKernelCorrentropy

(a) N = 100

0 10 20 30 40 50 60 70 80 90−18

−16

−14

−12

−10

−8

−6

−4

−2

0

Degree (θ)

Arr

ay B

eam

pat

tern

(d

B)


(b) N = 1000

Figure E-1. Comparisons of the beampattern for three beamformers in Gaussian noisewith 10dB of SNR.

113

5 10 1510

−4

10−3

10−2

10−1

100

SNR (dB)

BE

R


(a) N = 100

5 10 1510

−5

10−4

10−3

10−2

10−1

100

SNR (dB)

BE

R


(b) N = 1000

Figure E-2. Comparisons of BER for three beamformers in Gaussian noise with differentSNRs.

114

0.511.5210

−5

10−4

10−3

10−2

10−1

100

ALPHA (α)

BE

R


Figure E-3. Comparisons of BER for three beamformers with different characteristicexponent α levels.

115

0 10 20 30 40 50 60 70 80 90−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

0

1

Degree (θ)

Arr

ay B

eam

pat

tern

(d

B)


(a) α = 1.5

0 10 20 30 40 50 60 70 80 90−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

0

1

Degree (θ)

Arr

ay B

eam

pat

tern

(d

B)


(b) α = 1.0

Figure E-4. Comparisons of the beampattern for three beamformers in non-Gaussiannoise.

116

LIST OF REFERENCES

[1] B. V. Kumar, “Tutorial survey of composite filter designs for optical correlators,”Appl.Opt, vol. 31, pp. 4773–4801, 1992.

[2] A. Mahalanobis, A. Forman, M. Bower, R. Cherry, and N. Day, “Multi-class SARATR using shift invariant correlation filters,” special issue of Pattern Recognition oncorrelation filters and neural networks, vol. 27, pp. 619–626, 1994.

[3] A. Mahalanobis, B. Vijaya Kumar, D. W. Carlson, and S. Sims, “Performanceevaluation of distance classifier correlation filters,” in Proc. SPIE, 1994, vol. 2238, pp.2–13.

[4] R. Shenoy and D. Casasent, “Correlation filters that generalize well,” in Proc. SPIE,March 1998, vol. 3386, pp. 100–110.

[5] M. Savvides, B. V. Kumar, and P. Khosla, “Face verification using correlation filters,”in Proc. Third IEEE Automatic Identification Advanced Technologies, Tarrytown, NY,2002, pp. 56–61.

[6] B. V. Kumar, M. Savvides, C. Xie, and K. Venkataramani, “Biometric verificationwith correlation filters,” Applied Optics, vol. 43, no. 2, pp. 391–402, Jan 2004.

[7] B. V. K. V. Kumar, A. Mahalanobis, and R. Juday, Correlation Pattern Recognition,Cambridge University Press, 2005.

[8] G. Turin, “An introduction to matched filters,” IEEE Trans. Information Theory,vol. 6, pp. 311–329, 1960.

[9] S. M. Kay, Fundamentals of Statistical signal processing,Volume II Detection Theory,Prentice-Hall, 1998.

[10] A. VanderLugt, “Signal detection by complex spatial filtering,” IEEE Trans.Information Theory, , no. 10, pp. 139–145, 1964.

[11] C. Hester and D. Casasent, “Multivariant technique for multiclass patternrcognition,” Appl.Opt, vol. 19, pp. 1758–1761, 1980.

[12] B. V. Kumar, “Minimum variance synthetic discriminant functions,” J.Opt.Soc.Am.A,vol. 3, no. 10, pp. 1579–1584, 1986.

[13] A. Mahalanobis, B. V. Kumar, and D. Casasent, “Minimum average correlationenergy filters,” Appl.Opt, vol. 26, no. 17, pp. 3633–3640, 1987.

[14] D. Casasent and G. Ravichandran, “Advanced distortion-invariant minimum averagecorrelation energy MACE filters,” Appl.Opt, vol. 31, no. 8, pp. 1109–1116, 1992.

[15] G. Ravichandran and D. a. Casasent, “Minimum noise and correlation energy filters,”Appl.Opt, vol. 31, no. 11, pp. 1823–1833, 1992.

117

[16] P. Refregier and J. Figue, “Optimal trade-off filter for pattern recognition and theircomparison with weiner approach,” Opt. Computer Process., vol. 1, pp. 3–10, 1991.

[17] A. Mahalanobis, B. V. Kumar, S. Song, S. Sims, and J. Epperson, “Unconstrainedcorrelation filters,” Appl.Opt, vol. 33, pp. 3751–3759, 1994.

[18] A. Mahalanobis, B. V. Kumar, and S. Sims, “Distance-classifier correlation filters formulticlass target recognition,” Appl.Opt, vol. 35, no. 17, pp. 3127–3133, June 1996.

[19] M. Alkanha and B. V. Kumar, “Polynomial distance classifier correlation filter forpattern recognition,” Appl.Opt, vol. 42, no. 23, pp. 4688–4708, Aug. 2003.

[20] J. Fisher and J. Principe, “A nonlinear extension of the MACE filter,” NeuralNetworks, vol. 8, pp. 1131–1141, 1995.

[21] J. Fisher and J. Principe, “Formulation of the mace filter as a linear associativememory,” in Proc. Int. Conf. on Neural Networks, 1994, vol. 5.

[22] J. Fisher and J. Principe, “Recent advances to nonlinear MACE filters,” OpticalEngineering, vol. 36, no. 10, pp. 2697–2709, Oct. 1998.

[23] B. Scholkopf and A. J. Smola, Learning with Kernels, The MIT Press, 2002.

[24] B. Scholkopf, A. J. Smola, and K. Muller, “Kernel principal component analysis,”Neural Computation, vol. 10, pp. 1299–1319, 1998.

[25] A. Ruiz and E. Lopez-de Teruel, “Nonlinear kernel-based statistical pattern analysis,”IEEE Trans. on Neural Networks, vol. 12, pp. 16–32, 2001.

[26] H. Kwon and N. M. Nasrabadi, “Kernel matched signal detectors for hyperspectraltareget detection,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing (ICASSP),2005, vol. 4, pp. 665–668.

[27] K. Jeong, P. Pokharel, J. Xu, S. Han, and J. Principe, “Kernel synthetic distriminantfunction for object recognition,” in Proc. Int. Conf. Acoustics, Speech, SignalProcessing (ICASSP), France, May 2006, vol. 5, pp. 765–768.

[28] C. Xie, M. Savvides, and B. V. Kumar, “Kernel correlation filter based redundantclass-dependence feature analysis KCFA on FRGC2.0 data,” in Proc. 2nd Int.Workshop Analysis Modeling of Faces Gesture (AMFG), Beijing, 2005.

[29] I. Santamarıa, P. Pokharel, and J. Principe, “Generalized correlation function:Definition,properties and application to blind equalization,” IEEE Trans. SignalProcessing, vol. 54, no. 6, pp. 2187–2197, June 2006.

[30] P. Pokharel, R. Agrawal, and J. Principe, “Correntropy based matched filtering,” inProc. IEEE Int. Workshop on Machine Learning for signal Processing (MLSP), Sept.2005, pp. 148–155.

118

[31] J. W. Fisher, Nonlinear Extensions to the Miminum Average Correlation EnergyFilter, Ph.D. dissertation, University of Florida, Gainesville, FL, 1997.

[32] B. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal marginclassifiers,” in Proc. 5th COLT, 1992, pp. 144–152.

[33] V.Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, 1995.

[34] N. Cristianini and J. S. Taylor, An Introduction to Support Vector Machines,Cambridge University Press, 2000.

[35] N. Aronszajn, “Theory of reproducing kernels,” Trans. Amer. Math. Soc., vol. 68, pp.337–404, 1950.

[36] E. Parzen, “On the estimation of probability density function and mode,” The Annalsof Mathematical Statistics, 1962.

[37] J. Mercer, “Functions of positive and negative type, and their connection with thetheory of integral equations,,” Philosophical Trans. of the Royal Society of London,vol. 209, pp. 415–446, 1909.

[38] “http://www.amp.ece.cmu.edu : Advanced multimedia processing lab at electricaland computer eng., CMU,” .

[39] E. Parzen, “Statistical methods on time series by hilbert space methods,” Tech. Rep.Technical Report No 23, Applied Mathematics and Statistics Laboratory, StanfordUniversity, 1959.

[40] S. Sudharsanan, A. Mahalanobis, and M. Sundareshan, “Unified framework for thesynthesis of synthetic discriminant functions with reduced noise varianve and sharpcorrelation struture,” Optical Engineering, 1990.

[41] J. C. Principe, D. Xu, and J. Fisher, “Information theoretic learning,” in Unsuper-vised Adaptive Filtering, S. Haykin, Ed., pp. 265–319. JOHN WILEY, 2000.

[42] W. Liu, P. P. Pokharel, and J. C. Principe, “Correntropy: properties and applicationsin non-gaussian signal processing,” in press, IEEE Trans. on Signal Processing.

[43] W. Liu, P. Pokharel, , and J. Principe, “Correntropy: a localized similarity measure,”in Proc. 2006 IEEE World Congress on Computational Intelligence (WCCI), Canada,July 2006, pp. 10018–10023.

[44] B. W. Silverman, Density Estimation for Statistics and Data Analysis, CRC Press,1986.

[45] P. Pokharel, J. Xu, D. Erdogmus, and J. Principe, “A closed form solution for anonlinear wiener filter,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing(ICASSP), France, May 2006, vol. 3, pp. 720–723.

119

[46] K. Jeong and J. Principe, “The correntropy MACE filter for image recognition,”in Proc. IEEE Int. Workshop on Machine Learning for signal Processing (MLSP),Ireland, July 2006, pp. 9–14.

[47] L. Greengard and J. Strain, “The fast gauss transform,” SIAM J. Sci. Statist.Comput., vol. 12, no. 1, pp. 79–94, Jan. 1991.

[48] T. Gonzalez, “Clustering to minimize the maximum intercluster distance,” TheoreticalComputer Science, vol. 38, pp. 293–306, 1985.

[49] T. Ross, S. Worrell, V. Velten, J. Mossing, and M. Bryant, “Standard SAR ATRevaluation experiments using the MSTAR public release data set,” in Proc. SPIE,April 1998, vol. 3370, pp. 566–573.

[50] R. Gonzalez and R. Woods, Digital Image Processing, Second Edition, Prentice Hall,2002.

[51] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of CognitiveNeuroscience, vol. 3, no. 1, pp. 71–86, 1991.

[52] R. O. Duda, P. E. Hart, and S. D. G, Pattern Classification, Second Edition, JohnWilly and sons, 2001.

[53] D. Erdogmus, Y. Rao, H. Peddaneni, A. Hegde, and J. Principe, “Recursive principalcomponents analysis usingeigenvector matrix perterbation,” EURASIP Journal onApplied Signal Processing, , no. 13, pp. 2034–2041, Mar. 2004.

[54] K. Fukunaga, Introduction to statistical pattern recognition, Second Edition, AcademicPress Professional, 1990.

[55] S. Kaski, “Dimensionality reduction by random mapping: Fast silimarity computationfor clustering,” in Proc. Int. Joint Conf. on Neural Networks (IJCNN), 1998, pp.413–418.

[56] D. Fradkin and D. Madigan, “Experiments with random projection for machinelearning,” in Proc. Conference on Knowledge Discovery and Data Mining, 2003, pp.517–522.

[57] E. Brigham and H. Maninila, “Random projection in dimensionality reduction:applications to image and text data,” in Proc. Conference on Knowledge Discoveryand Data Mining, 2001, pp. 245–250.

[58] D. Achlioptas, “Database-friendly random projections,” in Symposium on Principlesof Database Systems(PODS), 2001, pp. 274–281.

[59] S. Dasgupta, “Experiments with random projection,” in Proc. Conference onUncertainty in Artificial Intelligence, 2000.

120

[60] E. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signalreconstruction from highly incomplete frequency information,” IEEE Trans. onInformation Theory, vol. 52, no. 2, pp. 489–509, 2006.

[61] S. Dasgupta and A. Gupta, “An elementary proof of the johnson-lindenstrausslemma,” Tech. Rep. Technical Report TR-99-006, International Computer ScienceInstitute, Berkeley, CA, 1999.

[62] G. H. Golub and C. F. v. Loan, Matrix Computations, North Oxford Academic,Oxford, UK, 1983.

[63] E. J. Candes and T. Tao, “Near-optimal signal recovery from random projections:Universal encoding strategies?,” IEEE Trans. of Information Theory, vol. 52, no. 12,pp. 5406–5425, Dec. 2006.

[64] D. Achlioptas, “Database-friendly random projections: Johnson-lindenstrauss withbinary coins,” Journal of Computer and System Sciences, vol. 66, pp. 671–687, 2003.

[65] R. Hecht-Nielsen, Context vectors: general purpose approximate meaning representa-tions self-organized from raw data, IEEE Press, 1994.

[66] D. Donoho, “Compressed sensing,” IEEE Trans. of Information Theory, vol. 52, no.4, pp. 1289–1306, 2006.

[67] D. Casasent and N. A., “Confuser rejection performance of EMACH filters forMSTAR ART,” in Proc. SPIE, April 2006, vol. 6245, pp. 62450D1–12.

[68] B. D. Van Veen and K. M. Buckley, “Beamforming: a versatile approach to spatialfiltering,” IEEE ASSP Magazine, vol. 5, no. 2, pp. 4–22, April 1988.

[69] H. Krim and M. Viberg, “Two decades of array signal processing research: theparametric approach,” IEEE Signal Processing Magazine, vol. 13, no. 4, pp. 67–94,July 1996.

[70] M. S. Bartlett, “Smoothing periodograms from time series with continuous spectra,”Nature, vol. 161, no. 4096, pp. 686–687, May 1948.

[71] J. Capon, “High-resolution frequency-wavenumber spectrum analysis,” Proceedings ofthe IEEE, vol. 57, no. 8, pp. 1408–1418, August 1965.

[72] R. T. Lacoss, “Data adaptive spectral analysis methods,” Geophysics, vol. 36, no. 4,pp. 134–148, August 1971.

[73] P. Stoica, Z. Wang, and J. Li, “Robust capon beamforming,” IEEE Signal ProcessingLetters, vol. 10, no. 6, pp. 172–175, June 2003.

[74] R. G. Lorenz and S. P. Boyd, “Robust minimum variance beamforming,” IEEETrans. Signal Processing, vol. 53, no. 5, pp. 1684–1696, May 2005.

121

[75] M. Shao and C. L. Nikias, “Signal processing with fractional lower order moments:Stable processes and their applications,” Proceedings of the IEEE, vol. 81, no. 7, pp.986–1010, July 1993.

[76] R. Adler, R. E. Feldman, and T. M. S, A Practical Guide to Heavy Tails: StatisticalTechniques and Applications, Boston, MA: Birkhauser, 1998.

[77] P. Tsakalides, R. Raspanti, and C. L. Nikias, “Angle/doppler estimation inheavy-tailed clutter backgrounds,” IEEE Trans. Aerospace and Electronic Systems,vol. 35, no. 2, pp. 419–436, April 1999.

[78] T. Lo, H. Leung, and J. Litva, “Nonlinear beamforming,” Electronics Letters, vol. 27,no. 4, pp. 350–352, February 1991.

[79] S. Chen, L. Hanzo, and A. Wolfgang, “Kernel-based nonlinear beamformingconstruction using orthogonal forward selection with the fisher ratio class separabilitymeasure,” IEEE Signal Processing Letters, vol. 11, no. 6, pp. 478–481, May 2004.

[80] M. Martinez-Ramon, J. L. Rojo-Alvarez, and G. Camps-Valls, “Kernel antenna arrayprocessing,” IEEE Trans. Antennas and Propagation, vol. 55, no. 3, pp. 642–650,March 2007.

[81] K. H. Jeong, W. Liu, S. Han, and J. C. Principe, “The correntropy MACE filter,”submitted to IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI).

122

BIOGRAPHICAL SKETCH

Kyu-Hwa Jeong was born June, 1972 in Korea and received the M.S degree

in electronics engineering from Yonsei University, Seoul, Korea, in 1997, where he

focused on adaptive filter theory and its applications to acoustic echo cancelation. In

1997-2003, he was a senior research engineer with Digital Media Research Lab. in LG

Electronics, Seoul, Korea, and belonged to optical storage group. He mainly participated

in CD/DVD recorder projects. Since 2003, he has been pursuing the Ph.D. degree

with the Computational NeuroEngineering Lab in electrical and computer engineering,

University of Florida, Gainesville, FL. His research interests are in the field of signal

processing, machine learning and its applications to image pattern recognition.

123

Documents

The Correntropy Mace Filter for Image Recognition