Regularized Double Nearest Neighbor Feature Extraction for Hyperspectral Image Classification Hsiao-Yun Huang Department of Statistics and Information

Regularized Double Nearest Neighbor Feature Extraction for Hyperspectral Image Classification

Hsiao-Yun Huang

Department of Statistics and Information Science,Fu-Jen University

Hyperspectral Image Introduction 1

(image credit: AFRL)

Hyperspectral Image Introduction 2

(image credit: AFRL)

Applications of Hyperspectral Image

Military: military equipment detection. Commercial: mineral exploration, agriculture

and forest production. Ecology: chlorophyll, leaf water, cellulose,

lignin. Agriculture: illness or type of the plants.

Classification of Hypectral Image Pixels How to distinguish different land cover types

precisely and automatically in the hyperspectral images is an interesting and important research problem.

Generally, each pixel in a hyperspectral image is consisted of about hounds or even thousands of bands. This makes the discrimination among pixels a high-dimensional classification problem.

High-Dimensional Data Analysis

“We can say with complete confidence that in the coming century, high-dimensional data analysis will be a very significant activity, and completely new methods of high-dimensional data analysis will be developed;…” (Lecture on August 8, to the American Mathematical S

ociety ‘Math Challenges of the 21st Centure’ by David L. Donoho (2000))

Blessing: The Power of Increasing Dimensionality

xx11xx22

xx33

xx22

xx11

xx33

xx11

xx33

xx22

-5 0 5 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-5 0 5 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-5 0 5 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

xx11 xx22 xx33

Curse: Hughes Phenomenon

m=2510

20

50100

200

1000

500

m =

1 100050020010050201052

MEASUREMENT COMPLEXITY n (Total Discrete Values)

0.50

0.55

0.60

0.65

0.70

0.75

ME

AN

RE

CO

GN

ITIO

N A

CC

UR

AC

Y

The Curse of Dimensionality

In statistics, it is about the situation that the convergence of any estimator to the true value of a smooth function defined on a space of high dimension is very slow. That is, we need an extremely large amount of observations. (Bellman, 1961 ) http://www.stat.ucla.edu/~sabatti/statarray/textr/

node5.html

The Challenge

Unfortunately, in hyperspectral image classification, the p > N case is the usual situation due to the access of training samples (ground truth data) can be very difficult and expensive.

The large dimension but few samples problems might cause the accuracy rate of the hyperspectral image classification to be unsatisfied.

Dimensionality Reduction

One common way to deal with the curse of dimensionality is to reduce the number of dimensions.

Two major reduction ideas: Feature Selection Feature Extraction

xx11

xxpp

xx11

xxpp

ff11

ff22

ff22

ff11

Feature selection:select l out of p measurements

Feature extraction:map p measurements to l measurements

Feature Extraction v.s. Feature Selection

-5 0 50

50

100

150

-5 0 50

50

100

150

-5 0 5

-6

-4

-2

0

2

4

6Selection

Extraction

Basic Ideas of Feature Extraction

Feature extraction consists of choosing those features which are most effective for preserving class separability.

Class Separability depends not only on the class distributions but also on the classifier to be used.

We seek the minimum feature set with reference to the Bayes classifier; this will result in the minimum error for the given distributions. Therefore, the Bayes error is the optimum measure of feature effectiveness.

One Consideration

A major disadvantage of the Bayes error as a criterion is that an explicit mathematical expression is not available except for a very few special cases, therefore, we cannot expect a great deal of theoretical development.

Practical Alternatives

Two types of criteria which have explicit mathematical expressions and frequently be used in practice: Functions of scatter matrices (do not relate to t

he Bayes error) Conceptually simple and give systematic algo

rithms. Bhattacharyya distance type of criteria (give u

pper bounds of the Bayes error) Only for two-class problem, and based on nor

mality assumption.

Discriminant Analysis Feature Extraction (DAFE or Fisher’s LDA)

1

1 10

10 ))(()()( where

L

i

L

ij

Tjiji

Ti

L

ii

DAb mmmmmmmmS

DAb

DAw SS 1)( of seignvector the

of composed is DAFEofmatrix ation transformfeature The

DAwS

L Is the number of classSb in pairwise st

ructure

Note: The number of extracted features are min{p,L-1} where p is the dimension of the mean vector

DAFE v.s. PCA

PCA

DAFE

Drawbacks of the Fisher’s LDA (1) In some situations, is not a good m

easure of class separability Share the same mean: No scatter of M1 and M2 aroun

d M0 Multimodal: more than L-1 features are needed

DAb

DAwLDA SStrJ

1

Unimodal share the same

mean

Multimodal and share the same mean

Multimodal

Drawbacks of the Fisher’s LDA (2) The unbiased estimate S (pooled covariance estim

ate) of the within-class scatter matrix is adopted in LDA. If it is singular, the performance will be poor.

When dim>>n, S will loose its full rank as a growing number of eigenvalues become zero. So, it is not positive definite and can not be inverted.

100

Eigenvalue(dim/n=10)

0

_: true eigenvalues- - : Sw eigenvalues

dim

Feature Extraction Methods with Other Measure of Separability Nonparametric Discriminant Analysis (NDA; Fuku

naga and Mantock, 1983). Nonparametric Weighted Feature Extraction (NW

FE; Bor-Chen Kuo and Landgrebe, 2004) Regularized Double Nearest Proportion Feature

Extraction (RDNP; Hsiao-Yun Huang and Bor-Chen Kuo, submitted)

The idea of Nonpaprametric Discriminant Analysis (NDA; Fukunaga and Mantock,1983)

i Class

j Class

jM

iM

Instead of separating the means like LDA

Try to separate the

boundary

Nearest Neighbor Structure

i Class

j Class

Xik

k NN for class j

k NN for class i

Pairwise Between-Class Scatter Matrix

)( )(ikj xM

i Class

j ClassXik

)( )(ihj xM

Xih

)( )(ikj xM

Large weight

Small weight

),(),(

)},(),,(min{)()()()(

)()()()(),(

jkNN

il

ikNN

il

jkNN

il

ikNN

ilji

l xxdxxd

xxdxxdw

NDA

Tilj

il

ilj

il

L

i

L

ijj

n

l i

jil

iNDAb xMxxMx

n

wPS

i

))())((( where )()()()(

1 1 1

),(

j. classin point kNN its to from distance theis ),(

and infinity, and zerobetween parameter control a is

),(),(

)},(),,(min{

)()()(

)()()()(

)()()()(),(

il

jkNN

il

jkNN

il

ikNN

il

jkNN

il

ikNN

ilji

l

xxxd

xxdxxd

xxdxxdw

NDAb

DAw SS 1)( of seignvector the

of composed is NDA ofmatrix ation transformfeature The

The Properties of NDA

The between-class scatter matrix Sb is usually full rank. So, the restriction about only min(#class-1, dim) features can be extracted can be liberated.

Since the parametric nature of the Sb is replaced by the nonparametric Sb which leads to preserve important boundary information for classification, NDA is more robust.

Some Considerations about NDA When Overlap Occurs (1) Based on the definition of the boundary of

NDA (the focus portion of the distribution), the points with similar distance among the considered two groups are regarded as the boundary points.

This definition of boundary will fail when overlap occurs, because the points around and within overlap region will tend to have the same weight.

Then Boundary of NDA When Overlap Occurs

Projection direction ?

)(ilx

++ +

++

+

Projection direction ?

++ +

++

+

Projection directionProjection direction

Some Considerations about NDA When Overlap Occurs (2) In NDA, the kNN is adopted for measuring the ‘local’

between-class scatter, so the selected k is a very small integer as kNN people usually do (All the experiments in the paper and book shown by Fukunaga use either k=1 or 3).

This setting of k might cause the data point j and its local mean are very similar (close). The consequence is that the entries of Sbj will be very close to zero and thus cancels out the effect of the weight or makes the Sbj even with less influence among the overall Sb.

Some Considerations about NDA When Overlap Occurs (3) Also, in the Sb of NDA

only one data point is used to represent one group and used the kNN mean to represent the other local group.

This makes the Sb may not measure the scatter between “groups” very well and be easily influenced by the outliers.

Tilj

il

ilj

il

L

i

L

ijj

n

l i

jil

iNDAb xMxxMx

n

wPS

i

))())((( )()()()(

1 1 1

),(

One Another Consideration

In the NDA, the boundary is estimated based on the sample. Even when the sample distributions are not overlapped, based on the setting of NDA, the estimated boundary might be too close to the edge (since small k and only one xj in one group is used in Sb).

Like what happened in the hard SVM, extremely clear cut support vectors (boundary) estimated from the sample might have unsatisfied performance due to the over fitting.

The Singularity Problem

In NDA, the unbiased covariance estimate S is still adopted ,thus, the singularity problem still exist in NDA.

Nonparamentric Weighted Feature Extraction (NWFE)

i Class

j Class

)(ilx

)( )(ilj xM)( )(i

li xM

)( )(itj xM

)( )(iti xM

)(itx

)( )()( itj

it xMx

)( )()( ilj

il xMx

Large Weight

Light Weight

1

1

1

in

k

(i)kj

(i)k

(i)lj

(i)l(i,j)

l

))(x,Mdist(x

))(x,Mdist(xλ

Nonparametric Weighted Feature Extraction (NWFE; Kuo & Landgrebe, 2002, 2004)

Tikj

ik

ikj

ik

L

i

L

ijj

n

k i

jik

iNWb xMxxMx

nPS

i

))())((( )()()()(

1 1 1

),(

L

i

n

k

Tiki

ik

iki

ik

i

iik

iNWw

i

xMxxMxn

PS1 1

)()()()(),(

))())(((

( ) ( , ) ( )

1

( ) , is the number of training samples of class jjn

i i j jj k kl l j

l

M x w x n

,

1

1

1

in

l

(i)lj

(i)l

(i)kj

(i)k(i,j)

k

))(x,Mdist(x

))(x,Mdist(xλ

in

l

jl

(i)k

jl

(i)k(i,j)

kl

),xdist(x

),xdist(xw

1

1)(

1)(

NWb

NWw

NWw SSS 1)](diag5.05.0[ of seignvector the

of composed are NWFE ofmatrix ation transformfeature The

Double Nearest Proportion Structure

＊

＊

Class j

Class i

self-class nearest proportion

other-class nearest proportion

)(iiM

)(ijM

Weightreference

)(ilx

Robust Against the Overlap

Class j

Class i

)(ilx

＋

)(itx

)( )(ili xM

)( )(iti xM

)( )(ilj xM

)( )(itj xM

＊

＋

＊

)( )(ilxONP

)( )(itxONP

larger weight

smaller weight

The Improvement of the Estimation of Sw (1) In Regularized Discriminant Analysis (RDA)

(Friedman, 1989), an extension of LDA, also proposed a improvement version of the Sw in LDA. The generalized version of that estimate is

∑ˆ= λ ∑ˆ +(1- λ) (σˆ)^2 I

λ is between 0 and 1.

The question is how to choose λ? (Friedman suggested using cross-validation.)

The Improvement of the Estimation of Sw (2)

In NWFE, different way to get the local mean and weight in NDA were proposed. But, the most influential effect on the performance improvement is its proposed estimation of Sw

Why 0.5?

)(diag5.05.0 NWw

NWw SS

The Shrinkage Estimation of Sw

Let Ψ denote the parameters of the unrestricted high-dimensional model, and Θ the matching parameters of a lower dimension restricted submodel. Also, let U be the estimate of Ψ and T be estimate of Θ. Then the shrinkage (regularized) estimate

U* = λ T +(1-λ )U where λ is between 0 and 1. λ can be determined analytically by Ledoit and Wolf

lemma (2003). Once the T (target) is specified, the λ can be calculated.

Some Targets

J. Schafer and K. Strimmer (2005) proposed six targets for the shrinkage estimate of the Sw.

RDNP Feature Extraction

The feature transformation matrix of RDPN is composed of the eignvectors of

where

RDNPb

RDNPw SS 1)(

L

i

ilj

ili

L

ijj

N

l i

jil

iRDNPb xMxM

NPS

i

1

)()(

1 1

),(

)()((

Tilj

ili xMxM )()(( )()(

L

i

N

l

il

ili

RDNPw

i

SPPS1 1

)()(

)()()()()( )1( il

il

il

il

il STS

iN

t

itj

iti

ilj

iliji

l

xMxMd

xMxMd

1

1)()(

1)()(),(

))(),((

))(),((

hg

ighl

hg

ighl

il ssVar 2)(

,)(

,)( )())(,

The Properties of RDNP (1)

RDNP is more likely to figure out the boundary when overlap occurs.

Use proportion mean in each group, so the between groups scatter could be measured more properly, the entries of the Sb will not be so close zero, the influence of the outliers will be reduced, and the estimated boundary will not too close to the edge.

The Properties of RDNP (2)

When NPi=Ni and NPj=Nj, then it can be easily shown that the features extracted by the RDNP is exactly the same as the features extracted by the Fisher’s LDA. Thas is, LDA is a special case of RDNP.

Washington DC Mall Image

Indian Pine Site Image

Experiment Result 1 (Washington DC Mall , Classifier: 1nn, Features 6)

# of trainingSamples

LDA NDA RDA NWFE RDNP

20

0.5771 0.5825 0.8564 0.8851 0.9217

40

0.8122 0.8160 0.8840 0.9231 0.9420

100

0.8897 0.8979 0.9206 0.9347 0.9688

Experiment Result 2 (Washington DC Mall , Classifier: SVM, Features 6)



20

0.5809 0.5990 0.8441 0.8933 0.9266

40

0.8244 0.8067 0.8799 0.9243 0.9385

100

0.8902 0.8922 0.9302 0.9330 0.9701

A color IR image of a portion of the DC data set

NWFE with 1nn

RDA with 1nn

1NN-NS (191 bands)

RDNP with 1nn

Experiment Result 3 (Indian Pine Site, Classifier: 1nn, Features 8)



20

0.5512 0.5825 0.7662 0.8012 0.8377

40

0.5729 0.6060 0.7911 0.8331 0.8503

100

0.6345 0.6495 0.8180 0.8452 0.8910

Experiment Result 4 (Indian Pine Site, Classifier: SVM, Features 8)



20

0.5512 0.5825 0.7662 0.8012 0.8377

40

0.5729 0.6060 0.7911 0.8331 0.8503

100

0.6345 0.6495 0.8180 0.8452 0.8910

Other Applications

Microarray Data Discrimination Quality Control EEG Signal Classification

The End

Thank you for your listening.

Documents

Regularized Double Nearest Neighbor Feature Extraction for Hyperspectral Image Classification Hsiao-Yun Huang Department of Statistics and Information