8
Action Classification in Polarimetric Infrared Imagery via Diffusion Maps Wesam Sakla Sensors Directorate, Layered Sensing Exploitation Division Air Force Research Laboratory WPAFB, OH, U.S.A. Abstract—This work explores the application of a nonlinear dimensionality reduction technique known as diffusion maps for performing action classification in polarimetric infrared video sequences. The diffusion maps algorithm has been used successfully in a variety of applications involving the extraction of low-dimensional embeddings from high-dimensional data. Our dataset is composed of eight subjects each performing three basic actions: walking, walking while carrying an object in one hand, and running. The actions were captured with a polarized microgrid sensor operating in the longwave portion of the electromagnetic (EM) spectrum with a temporal resolution of 24 Hz, yielding the Stokes traditional intensity (S 0 ) and linearly polarized (S 1 , S 2 ) components of data. Our work includes the use of diffusion maps as an unsupervised dimensionality reduction step prior to action classification with three conventional classifiers: the linear perceptron algorithm, the k nearest neighbors (KNN) algorithm, and the kernel-based support vector machine (SVM). We present classification results using both the low-dimensional principal components via PCA and the low- dimensional diffusion map embedding coordinates of the data for each class. Results indicate that the diffusion map lower- dimensional embeddings provide a salient feature space for action classification, yielding an increase of overall classification accuracy by ~40% compared to PCA. Additionally, we examine the utility that the polarimetric sensor may provide by concurrently performing these analyses in the polarimetric feature spaces. Keywords—dimensionality reduction; classification; polarimetric; infrared; manifold learning; diffusion maps I. INTRODUCTION Efficient dimensionality reduction techniques are necessary to relieve the computational burdens associated with processing and exploiting high-dimensional video. To motivate this, consider a dataset Xൌሼ ݔ ݔ, ݔ,…, where each sample ݔ corresponds to a single frame of a video feed from an infrared sensor that generates high spatial resolution images of size 640 ݔ480. Each pixel in these images is a variable, so that ݔ אԹ ଷଶ , providing an extremely high-dimensional representation of the samples in the dataset as an unfortunate consequence of the sensor. In the context of the Air Force Layered Sensing [1] paradigm, the problem is further exacerbated when a suite of sensors ൌ൛ ݏ ݏ, ݏ,…, used for acquiring coincident multimodal data yields a collection of high-dimensional datasets Ω ൌ ൛ , ,…, . This poses massive processing constraints on algorithms for video exploitation applications. Dimensionality reduction algorithms can also strengthen the classification and discrimination performance of supervised machine learning algorithms that are often plagued by Hughes' phenomenon, or more generally, the curse of dimensionality. In essence, the curse of dimensionality refers to the notion that in any supervised pattern recognition application involving high- dimensional data, there is a particular number of features above which the classifier performance will not improve [2, 3]. Furthermore, global similarity measures such as the standard Euclidean distance are not robust and tend to perform poorly in such high-dimensional feature spaces. For example, consider the scenario of the canonical handwritten digit recognition problem where an algorithm is presented with two images of the same number at different rotation angles. The Euclidean distance between the two images would be quite large, falsely leading one to believe that there is little similarity between them. In reality, the two are identical, separated by one degree of freedom pertaining to the extent of the rotation [4, 5]. In the most basic sense, dimensionality reduction involves shrinking the number of variables in an ܦ ݔdataset consisting of multidimensional samples belonging to Թ , where ܦis the dimensionality of each sample. In the context of machine learning and statistical pattern recognition, dimensionality reduction techniques can be divided into two categories: feature selection and feature extraction [6]. Feature selection techniques choose an optimal subset of features according to an objective function; hence, the reduced set of features retain their physical meaning. Alternatively, feature extraction techniques apply linear or nonlinear transformations that map the original high-dimensional data onto a lower- dimensional subspace. Classical global linear techniques Figure 1. Helix curve in Թ (left). Embedding in Թ using the graph Laplacian manifold learning algorithm (right) [4].

[IEEE 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2012) - Washington, DC, USA (2012.10.9-2012.10.11)] 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)

  • Upload
    wesam

  • View
    216

  • Download
    4

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2012) - Washington, DC, USA (2012.10.9-2012.10.11)] 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)

Action Classification in Polarimetric Infrared Imagery via Diffusion Maps

Wesam Sakla Sensors Directorate, Layered Sensing Exploitation Division

Air Force Research Laboratory WPAFB, OH, U.S.A.

Abstract—This work explores the application of a nonlinear dimensionality reduction technique known as diffusion maps for performing action classification in polarimetric infrared video sequences. The diffusion maps algorithm has been used successfully in a variety of applications involving the extraction of low-dimensional embeddings from high-dimensional data. Our dataset is composed of eight subjects each performing three basic actions: walking, walking while carrying an object in one hand, and running. The actions were captured with a polarized microgrid sensor operating in the longwave portion of the electromagnetic (EM) spectrum with a temporal resolution of 24 Hz, yielding the Stokes traditional intensity (S0) and linearly polarized (S1, S2) components of data. Our work includes the use of diffusion maps as an unsupervised dimensionality reduction step prior to action classification with three conventional classifiers: the linear perceptron algorithm, the k nearest neighbors (KNN) algorithm, and the kernel-based support vector machine (SVM). We present classification results using both the low-dimensional principal components via PCA and the low-dimensional diffusion map embedding coordinates of the data for each class. Results indicate that the diffusion map lower-dimensional embeddings provide a salient feature space for action classification, yielding an increase of overall classification accuracy by ~40% compared to PCA. Additionally, we examine the utility that the polarimetric sensor may provide by concurrently performing these analyses in the polarimetric feature spaces.

Keywords—dimensionality reduction; classification; polarimetric; infrared; manifold learning; diffusion maps

I. INTRODUCTION Efficient dimensionality reduction techniques are necessary

to relieve the computational burdens associated with processing and exploiting high-dimensional video. To motivate this, consider a dataset X , , … , where each sample corresponds to a single frame of a video feed from an infrared sensor that generates high spatial resolution images of size 640 480. Each pixel in these images is a variable, so that

, providing an extremely high-dimensional representation of the samples in the dataset as an unfortunate consequence of the sensor. In the context of the Air Force Layered Sensing [1] paradigm, the problem is further exacerbated when a suite of sensors , , … , used for acquiring coincident multimodal data yields a collection of high-dimensional datasets Ω , , … , . This poses massive processing constraints on algorithms for video exploitation applications.

Dimensionality reduction algorithms can also strengthen the classification and discrimination performance of supervised machine learning algorithms that are often plagued by Hughes' phenomenon, or more generally, the curse of dimensionality. In essence, the curse of dimensionality refers to the notion that in any supervised pattern recognition application involving high-dimensional data, there is a particular number of features above which the classifier performance will not improve [2, 3]. Furthermore, global similarity measures such as the standard Euclidean distance are not robust and tend to perform poorly in such high-dimensional feature spaces. For example, consider the scenario of the canonical handwritten digit recognition problem where an algorithm is presented with two images of the same number at different rotation angles. The Euclidean distance between the two images would be quite large, falsely leading one to believe that there is little similarity between them. In reality, the two are identical, separated by one degree of freedom pertaining to the extent of the rotation [4, 5].

In the most basic sense, dimensionality reduction involves

shrinking the number of variables in an dataset consisting of multidimensional samples belonging to , where is the dimensionality of each sample. In the context of machine learning and statistical pattern recognition, dimensionality reduction techniques can be divided into two categories: feature selection and feature extraction [6]. Feature selection techniques choose an optimal subset of features according to an objective function; hence, the reduced set of features retain their physical meaning. Alternatively, feature extraction techniques apply linear or nonlinear transformations that map the original high-dimensional data onto a lower-dimensional subspace. Classical global linear techniques

Figure 1. Helix curve in (left). Embedding in using the graph Laplacian manifold learning algorithm (right) [4].

Page 2: [IEEE 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2012) - Washington, DC, USA (2012.10.9-2012.10.11)] 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)

include Principal Component Analysis (PCA ) [7] and Multidimensional Scaling (MDS) [8]. However, for reducing complex high-dimensional datasets that exhibit local correlations, these methods often fail to provide a salient low-dimensional representation [9].

Hence, over the past decade, researchers have been investigating nonlinear dimensionality reduction methods, otherwise known as manifold learning. The assumption in manifold learning is that the relevant geometric structure of points belonging to in a high-dimensional dataset is embedded in a low-dimensional subspace on a nonlinear manifold which captures the intrinsic dimensionality of the data. This is illustrated using the canonical helix dataset as shown in Figure 1 [4]. The intrinsic dimensionality is the minimum number of variables, or degrees of freedom, needed to account for perceptually meaningful structure within the data, and often, [9, 10]. Intuitively, it answers the question, "how many dimensions are required for understanding the structure and variation that is being exhibited in the data?"

The diffusion maps technique, originally developed by Ronald Coifman and Stéphane Lafon [4], is one such manifold learning method that we believe can efficiently and effectively exploit high-dimensional infrared video sequences for action classification applications. This mathematical technique, based on spectral graph theory [11], represents the data in a multiscale fashion, according to the parameters of the underlying local geometry. The diffusion maps technique has been used successfully in several applications, including word-document clustering [12], clustering of the canonical iris data set [13], low-dimensional representation of images [14], shape matching [15], and radar data analysis [16].

Polarimetric imaging utilizes the fact that radiation emitted or reflected by manmade objects will have some polarized component [17]. Natural background such as vegetation typically does not have a polarized component in its reflected or emitted radiation, and this feature may be exploited to detect targets by suppressing the natural background. The literature is abundant with applications of polarimetric imagery for the exploitation of man-made targets, but it remains to be seen the utility of polarimetry for capturing discrimination between actions of individuals. This work provides a step in that direction.

The aim of this effort is the exploitation of linearly polarized long-wave infrared video for automated dismount action classification. The actions that we focus on are walking, walking while carrying a bag, and running. Additionally, we will investigate the unsupervised dimensionality reduction capabilities provided by the diffusion maps manifold learning technique for capturing structure useful to robust action classification. Section II reviews the diffusion maps technique. Section III briefly describes linearly polarized imagery. Section IV outlines the experiments carried out for this work, including a description of the data, classifiers utilized, and results and analysis. Section V provides conclusions and future work.

II. DIFFUSION MAPS The diffusion map approach is based on graph theory. A

graph , is composed of a set of nodes that are connected by a set of edges . Assume we are given a dataset X , , … , where and we wish to construct a diffusion map using the data points in X as the nodes. The edges reflect the strength of the connections between any two nodes and . Any pair of points , X are said to be adjacent nodes of the graph . A weighted graph has an associated weight function, or kernel, : X X . The weight function must satisfy the following conditions:

• symmetry: , , for any , X.

• non-negativity: , 0 for any , X.

The degree of a node , denoted by , is defined by the following:

X

( , ).xz

d k x z∈

=∑ (1)

Similar to construction of the normalized graph Laplacian in spectral graph theory [11], we then define a transition probability , as given by the following:

( ) ( )1

,, for any , X.

x

k x yp x y x y

d= ∈ (2)

Since , 0 and ∑ , 1X , the expression in (2) can be interpreted as a Markov random walk from point

to . The transition probability allows the construction of a symmetric diffusion matrix of transition probabilities where , is the probability for a single step taken from points to . By taking powers of the diffusion matrix , we can increase the number of steps taken between and . As shown in [5] for illustration, consider the 2 2 diffusion matrix of two points:

11 12

21 22

.p pp p⎡ ⎤

= ⎢ ⎥⎣ ⎦

P (3)

Each element is the probability of moving between points and . Raising to the second power yields the following:

11 11 12 21 11 12 12 222

21 11 22 21 21 12 22 22

.p p p p p p p pp p p p p p p p

+ +⎡ ⎤= ⎢ ⎥+ +⎣ ⎦

P (4)

Hence, provides all the possible paths from points to when making two steps. In general, will sum all paths of length between points and . Hence, the parameter allows for a multi-scale analysis of the data. By increasing , the local influence of each node on its nearest neighbors is incorporated, thus increasing the probability of following a path along the underlying geometric structure of the data [4, 9, 13].

A fundamental property of the diffusion map approach is its use of a local similarity measure for the weight function . The

Page 3: [IEEE 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2012) - Washington, DC, USA (2012.10.9-2012.10.11)] 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)

kernel , reflects the similarity between and within a local neighborhood. As such, it should yield values that are proportional to the "closeness" of and within that neighborhood; outside the neighborhood, , 0. For the diffusion map technique, the standard weight function for connecting points is given by the Gaussian kernel:

, exp , (5)

where · is the standard Euclidean distance and 0 is a kernel scale parameter. Notice that the kernel includes the standard Euclidean distance between points and ; however, it is valid in an area in which we trust this measure to be accurate [5, 9]. An advantage to techniques such as diffusion maps that rely on local information is that their only requirement is the definition of a local notion of similarity between two points. This is a less burdensome task than manifold learning methods such as Isomap that require the definition of a global distance between all pairs of points [18]. By the same token, a challenge for the analysis provided by the diffusion map technique is that it significantly relies on the choice of in (5) since this indirectly determines the size of the neighborhood.

Researchers have shown that an eigendecomposition (i.e., the eigenvectors and eigenvalues) of is useful in dimensionality reduction and clustering applications [13, 14, 19, 20]. Specifically, in the machine learning and manifold learning communities, the first few eigenvectors corresponding to the largest eigenvalues of have been used for dimensionality reduction. In the context of high-dimensional sensor data that we aim to efficiently exploit, a dataset X, , … , will be used to create a weighted graph with kernel . The goal involves representing each as a point where while preserving the intrinsic geometric information of the original dataset in the reduced dataset , , … , .

If the graph is connected, then for ∞ , the Markov matrix is governed by a unique stationary distribution , so that for all and ,

lim , (6)

The vector is the top left eigenvector of . For 1 ∞, the following eigendecomposition holds [10]:

, ∑ (7)

where is the sequence of eigenvalues of in descending order (i.e., | | | | ) and and are the corresponding biorthogonal left and right eigenvectors.

The diffusion distance [4, 9, 13] between two points and is given by the following:

, ∑ , , . (8)

From (8), it is clear that the diffusion distance will be small if the connectivity between both and and the other nodes of the graph are similar. Unlike Isomap, which approximates a single global geodesic distance between points using a shortest path algorithm, the diffusion distance is robust to noise as it

sums over all possible paths of length between points, and this has been empirically demonstrated [5, 9, 12]. It is computationally expensive to explicitly calculate the diffusion distances over all the nodes in a graph. As shown in detail in [9], there is a connection between the diffusion distance and the eigenvectors of :

, ∑ . (9)

The identity in (9) allows one to use the right eigenvectors to compute the diffusion distance readily from . Moreover, because of the rapid decay of the eigenvalue spectrum, only a small fraction of the terms are needed to achieve a desired accuracy in the sum in (9), yielding powerful nonlinear dimensionality reduction capabilities [4, 9]. In what follows, let

denote the desired number of retained dimensions corresponding to the dominant eigenvectors. We explicitly define the diffusion map, which maps data points into a Euclidean space according to the diffusion metric:

Ω : , , … , . (10)

The diffusion map in (10) embeds the original high-dimensional data points belonging to into a lower-dimensional Euclidean space , where . The diffusion map technique generalizes to any type of data since it represents the dataset as a graph that captures the local connectivity between samples at multiple scales. Additionally, the reduced dimensionality is a function of the properties of the random walk on the data rather than the original dimensionality [9]. It is computationally attractive for processing high-dimensional datasets since the computational complexity scales with the number of samples rather than the dimensionality of the samples .

III. POLARIMETRIC IMAGERY Radiating EM waves may exhibit orientation, depending

upon the source or medium of propagation. This orientation is referred to as polarization and can be described in a number of equivalent ways. In 1852, Sir George Gabriel Stokes wrote a paper [21] in which he proposed a convenient means to describe incoherent polarized light in the form of a vector of parameters denoted by . This idea has since gained wide popularity for interpretation of the components of these Stokes vectors as a measure of the preference of the observed energy toward a certain direction of polarization. For fully polarized energy, the Stokes vector is a set of four elements . In general, however, naturally occurring EM waves are only linearly polarized, so that 0. In terms of the measurable linear states of polarization, the Stokes vector for linearly polarized light is given by the following: 45 4545 45 (11)

where , , 45 , and 45 indicate linear horizontal preference, linear vertical preference, positive 45° preference, and negative 45° preference, respectively.

Page 4: [IEEE 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2012) - Washington, DC, USA (2012.10.9-2012.10.11)] 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)

A key quantity that is derived from the components of the Stokes vector is the degree of linear polarization (DoLP), given by the following:

(12)

The DoLP quantifies the fraction of an EM wave that is linearly polarized. A completely unpolarized wave has a DoLP value of 0, while a perfectly linearly polarized wave has a DoLP value of 1.

IV. EXPERIMENTS The objectives of the work for this paper are twofold: (1)

determine the utility provided by the polarimetric infrared modality in the context of automated dismount action classification, and (2) investigate the diffusion maps manifold learning technique as a means for providing a salient lower-dimensional feature space as compared to a traditional global linear technique such as PCA. A description of the datasets used for exploitation is provided in section A. Section B provides a brief description of the classifiers used for assessing performance, while section C provides the results of the experiments.

A. Data The data utilized in this work is the output of a Fourier

domain post-processing method applied to data acquired by a microgrid polarimetric sensor [22]. The resulting processed video sequence contains three channels corresponding to the linearly polarized Stokes components as shown in (11). Each frame has a spatial resolution of 631 x 471 pixels, and the video is acquired at approximately 24 frames per second. The polarimetric infrared sensor operates in the long wave portion of the EM spectrum where thermal emission is dominant [23].

The dataset of actions (denoted by ) is comprised of the Stokes video components of eight individuals each performing the walking, carrying, and running activities along a single direction in the imagery, yielding a total of 24 unique video sequences. Scene information not pertinent to the area of action within the imagery was discarded by cropping all image frames to a spatial resolution of 70 x 275 pixels. Hence, an image frame , so that each image is treated as a vector residing in an ambient high-dimensional space of 19250. Let the total number of image frames be denoted by , where 1430 , 1520 , and 810 correspond to the walking, carrying, and running frames of all individuals, respectively. Additionally, the DoLP polarimetric feature (eq. 12) is computed from the Stokes video, so that contains four channels of data , , , , where .

B. Classifiers We have chosen to use the classification rate of three

conventional classifiers as our metric of performance. The classifiers used are a combination of linear and nonlinear nonparametric algorithms. Doing so (1) guarantees that performance is not biased by a particular classifier, and (2) allows us to examine the utility of the nonlinear classifiers as

compared to a linear classifier in the context of both the PCA and diffusion map feature spaces.

The first classifier belongs to the family of linear discriminant functions (also known as the perceptron) [6]. The simplest way to distinguish between two classes in a D-dimensional Euclidean vector space is by constructing a hyperplane, a linear subspace of dimension 1 . The parameters defining the hyperplane are a weight vector

and a bias . The weight vector provides the orientation of the hyperplane, while specifies the distance from the origin to the hyperplane along the direction specified by . Given a set of labeled training data , , the task of the perceptron is to properly classify all samples by finding such that · 0 for all . The perceptron algorithm will never converge on data that is nonlinearly separable.

The k nearest neighbors (KNN) algorithm is a nonparametric learning algorithm that does not assume that the data is linearly separable for acceptable performance [6]. It is non parametric since it does not impose assumptions on the distribution (i.e., Gaussianity) of the data. The training phase of the KNN algorithm is fast since it simply amounts to storing the training data. For the testing stage of a traditional KNN, all training samples are used to determine the k points from the training set for which an unseen data point is "closest" to in the Euclidean sense; the class label assigned to corresponds to the class label which is assigned to the majority of the k nearest neighbors.

The support vector machine (SVM) is a robust statistical learning approach that solves the classification problem by seeking the optimal hyperplane that separates classes with the largest margin [24, 25]. A kernel formulation of the SVM is often implemented to further increase inter-class separation between samples by implicitly projecting them into a higher-dimensional feature space. The kernel SVM formulation uses the well-known kernel trick whereby dot products in the original SVM formulation are replaced by a suitable kernel function for providing nonlinear boundaries between classes. In our work, we use the SVM with a radial basis (Gaussian) kernel.

C. Results and Analysis In the experiments that follow, we apply the classifiers

mentioned in the previous section to evaluate classification performance for both the PCA and diffusion maps dimensionality reduction techniques. Moreover, the classifiers are applied to all four channels of the data independently. We have chosen to partition the entire dataset into training/testing samples by randomly allocating 15% of each of the action's samples for training (565 samples), and the remainder for testing (3195 samples). Often times, an unfortunate split in the train/test partition will lead to unstable estimates of classification performance. To prevent this, we use a 10-fold random subsampling technique for reporting classification performances [ref].

We first apply the classifiers mentioned in the previous section to gain an understanding of the classification performance on the original high-dimensional action dataset. Table 1 summarizes these classification results.

Page 5: [IEEE 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2012) - Washington, DC, USA (2012.10.9-2012.10.11)] 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)

Table 1. Classification performance in high-dimensional ambient space.

CLASSIFIER S0 S1 S2 DOLP

PERCEPTRON 41.31% 33.43% 33.64% 32.24% 35.15%

KNN (K=3) 73.27% 95.19% 97.79% 99.04% 91.32%

KERNEL SVM 98.92% 99.71% 99.79% 99.97% 99.60%

71.16% 76.11% 77.07% 77.08% 75.36%

As the table shows, the nonlinear classifiers discriminate the actions quite well for the traditional and polarized infrared data, while the linear perceptron suffers across all channels. Also, it is of interest to note that for the KNN and SVM classifiers, the classification performance in the polarized channels is higher than the channel.

Using the results above as a benchmark, we now use dimensionality reduction techniques and assess classification performance. Given that the images of the video sequences

, our goal is to explore the effect of dimensionality on classification performance in the reduced dimensionality spaces. The PCA and diffusion maps algorithms were applied to the original action dataset and the first 10% of the embedding coordinates (1925) were retained, yielding the reduced datasets and , respectively, where

. For the diffusion maps technique, we set 1, and 3.

The 3-D scatter plots of the and DoLP channels of the data are shown for both methods in Figure 2 and Figure 3. In the scatter plots, the blue crosses are samples from the walking class, red asterisks are samples from the carrying class, and magenta circles are samples from the running class. The scatter plots reveal that PCA is ineffective at providing any tangible separation between samples of the action classes, regardless of channel. The diffusion maps algorithm, however, reveals a more evident structure with some inter-class separation; also, the structure varies considerably between unpolarized and polarized channels.

We now provide plots showing classification performance

as a function of dimensionality for both PCA and diffusion maps. Figure 4, Figure 5, and Figure 6 depict the rate of correct classification for the linear perceptron classifier, KNN classifier, and SVM classifier, respectively. Furthermore, in each figure, the performances are depicted according to the particular channel of data. For PCA (top image in each figure), the rates are plotted as a function of the 100 most significant principal components. For diffusion maps, the rates are plotted as a function of the 100 most significant embedding coordinates.

In Figure 4, it is evident that, regardless of channel, PCA fails to provide a linearly separable feature space for the perceptron, leading to classification rates that are as best as chance for a 3-class classification problem. The diffusion map technique, however, does provide a linearly separable feature space, as the bottom plot of Figure 4 shows that both unpolarized and polarized channels can achieve near-perfect classification rates with ~40 of the most significant embedding coordinates.

The KNN classifier, as shown in the top plot of Figure 5, certainly performs better than the perceptron in the PCA space, with a maximum classification rate of ~81% in the channel using ~10 principal components. However, once again, the diffusion maps technique provides a more salient feature space for KNN, with near-perfect classification rates using only ~20 of the most significant embedding coordinates. In the top plot of Figure 6, the kernel-based SVM, compared to the KNN classifier, performs better in the PCA space across all the channels, with a maximum classification rate of ~83% in the channel using ~35 principal components. In the diffusion map feature space, the SVM achieves near-perfect classification rates using only ~10 of the most significant embedding coordinates across all channels.

Figure 3. 3-D scatter plots for action dataset reduced via diffusion maps for (left) and DoLP (right).

Figure 2. 3-D scatter plots for action dataset reduced via PCA for (left) and DoLP (right).

Page 6: [IEEE 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2012) - Washington, DC, USA (2012.10.9-2012.10.11)] 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)

Table 2 - Table 4 are confusion matrices highlighting the differences in classification between PCA and diffusion maps in various channels for the perceptron, KNN, and SVM algorithms, respectively. In these tables, the top row of cells for each action corresponds to PCA, while the bottom row corresponds to diffusion maps. The overall classification results for both PCA and diffusion maps are summarized in Table 5. Each entry in the table shows the number of dimensions that maximizes classification performance and the corresponding rate of correct classification. For this dataset, Table 5 shows that the nonlinear diffusion maps algorithm is superior to a traditional linear technique such as PCA, yielding an improvement of ~40% in overall classification rate across all classifiers and channels.

Table 2. Confusion matrices for perceptron classification in S0 channel for PCA (top row) and diffusion maps (bottom row).

ACTION WALK CARRY RUN

WALK 32.94% 42.52% 24.54%

98.96% 0.99% 0.05%

CARRY 34.64% 35.24% 30.12%

0.92% 98.41% 0.67%

RUN 15.90% 38.37% 45.73%

1.38% 4.87% 93.75%

Table 3. Confusion matrices for KNN classification in S1 channel for PCA

(top row) and diffusion maps (bottom row).

ACTION WALK CARRY RUN

WALK 88.20% 4.39% 7.41%

100% 0% 0%

CARRY 2.34% 87.31% 10.35%

0% 99.95% 0.05%

RUN 4.17% 27.91% 67.92%

0% 0.19% 99.81%

Table 4. Confusion matrices for SVM classification in DoLP channel for

PCA (top row) and diffusion maps (bottom row).

ACTION WALK CARRY RUN

WALK 93.39% 5.22% 1.39%

99.19% 0.81% 0%

CARRY 5.36% 85.85% 8.79%

0.15% 99.50% 0.35%

RUN 5.48% 57.09% 37.43%

0.31% 2.95% 96.74%

Figure 4. Perceptron classification rates as a function of dimensionality for PCA (top) and diffusion maps (bottom).

Page 7: [IEEE 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2012) - Washington, DC, USA (2012.10.9-2012.10.11)] 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)

Table 5. Maximum classification rates as a function of channel and classifier for PCA (top row) and diffusion maps (bottom row).

CLASSIFIER S0 S1 S2 DOLP

PERCEPTRON

38.07% (360)

36.87% (1)

34.43% (8)

35.66% (3) 36.25%

97.10% (70)

98.58% (85)

98.51% (80)

99.02% (60) 98.30%

KNN (K=3)

59.38% (9)

81.14% (10)

56.07% (10)

71.84% (3) 67.10%

99.74% (19)

99.92% (25)

99.94% (35)

99.95% (35) 99.88%

KERNEL SVM

75.75% (1025)

82.73% (35)

65.87% (40)

72.22% (3)

74.14%

99.53% (9)

98.56% (10)

97.73% (15)

98.48% (12)

98.57%

57.73% 66.91% 52.12% 59.90% 59.16%

98.79% 99.02% 98.72% 99.15% 98.92%

V. CONCLUSIONS & FUTURE WORK This work involves the exploitation of linearly polarized

longwave infrared video sequences for automated action classification. We have shown that the use of a linear unsupervised dimensionality reduction technique such as PCA is inadequate for this application. We have applied a nonlinear manifold learning technique known as diffusion maps for dimensionality reduction. Results show that the diffusion maps algorithm yields simultaneous benefits of (1) highly sparse nonlinear embedding coefficients and (2) provides a salient feature space in which to perform classification using conventional linear classifiers such as the perceptron and nonlinear classifiers such as the KNN and kernel-based SVM algorithms. Furthermore, we have also shown that the use of linearly polarized infrared data as a complementary modality proves useful as shown by empirical classification performance using conventional classifiers.

Future work in automatic dismount action classification will investigate the use of the diffusion maps technique for generating the lower-dimensional manifolds corresponding to each action of interest and applying out-of-sample extension methods for generalization to unseen data.

Acknowledgment

The author would like to acknowledge Dan LeMaster of the AFRL Sensors Directorate for fruitful discussions regarding the polarimetry phenomenology and providing the processed Stokes polarimetric data. The author would also like to acknowledge Olga Mendoza-Schrock of the AFRL Sensors Directorate, Juan Ramirez of Duke University, and Scott Clouse of North Carolina State University for fruitful discussions regarding diffusion maps. This paper is approved for public release via 88ABW-2102-5234.

Figure 5. KNN classification rates as a function of dimensionality for PCA (top) and diffusion maps (bottom).

Page 8: [IEEE 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2012) - Washington, DC, USA (2012.10.9-2012.10.11)] 2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)

REFERENCES [1] M. Bryant, P. Johnson, B. M. Kent, M. Nowak, and S. Rogers, "Layered

Sensing: Its Definition, Attributes, and Guiding Principles for AFRL Strategic Technology Development," available at http://www.wpafb.af.mil/shared/media/document/AFD-080820-005.pdf, 2008.

[2] G. F. Hughes, "On the mean accuracy of statistical pattern recognizers," IEEE Transactions on Information Theory, vol. 14, pp. 55-63, 1968.

[3] D. W. Scott, Multivariate Density Estimation, Wiley, New York, 1992. [4] S. Lafon, Diffusion Maps and Geometric Harmonics, Ph.D. dissertation,

Yale University, 2004. [5] J. Porte, B. M. Herbst, W. Hereman, S. J. Walt, "An Introduction to

Diffusion Maps," 2008.

[6] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, San Diego, 1990.

[7] J. Shlens, "A Tutorial on Principal Component Analysis," available at http://www.snl.salk.edu/~shlens/pca.pdf, 2009.

[8] W. S. Torgerson, "Multidimensional Scaling I: Theory and Method," Psychometrika, vol. 17, pp. 401-419, 1952.

[9] R. R. Coifman, Y. Keller, and S. Lafon, "Data Fusion and Multicue Data Matching by Diffusion Maps," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1784-1797, 2006.

[10] L. Maaten, E. Postma, and J. Herik, "Dimensionality Reduction: A Comparative Review," Review Literature And Arts of the Americas, vol. 10, pp. 1-35, 2007.

[11] F. Chung, Spectral Graph Theory. American Mathematical Society, 1997.

[12] S. Lafon and A. B. Lee, "Diffusion Maps and Coarse-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1393-1403, 2006.

[13] B. Nadler, S. Lafon, R. R. Coifman, and I. G. Kevrekidis, "Diffusion Maps, Spectral Clustering and Reaction Coordinates of Dynamical Systems," Applied and Computational Harmonic Analysis, vol. 21, pp. 113-127, 2006.

[14] B. Bah, Diffusion Maps: Analysis and Applications, MS dissertation, University of Oxford, 2008.

[15] N. M. Rajpoot, M. Arif, and A. H. Bhalerao, "Unsupervised learning of shape manifolds," Proceedings of the British Machine Vision Conference, 2007.

[16] Y. S. Bhat and G. Arnold, "Diffusion maps and radar data analysis," Proceedings of SPIE 6568 (65680X), 2007.

[17] E. Collett, Field Guide to Polarization, SPIE Press, 2005. [18] J. B. Tenenbaum, V. de Silva, and J. C. Langford, "A Global Geometric

Framework for Nonlinear Dimensionality Reduction," Science magazine, vol. 290, pp. 2319-223, 2000.

[19] M. Belkin, Problems of Learning on Manifolds, Ph.D. dissertation, University of Chicago, 2003.

[20] M. Belkin and P. Niyogi, "Laplacian Eigenmaps for Dimensionality Reduction and Data Representation," Neural Computation, vol. 6, no. 15, pp. 1373-1396, 2003.

[21] G. G. Stokes, "On the composition and resolution of streams of polarized light from different sources," Transactions of the Cambridge Philosophical Society, vol. 9, pp. 399-416, 1852.

[22] LeMaster, D. A. and Cain, S. C., "Multichannel blind deconvolution of polarimetric imagery," Journal of Optical Society of America 25(9), 2170-2176 (2008).

[23] Miller, J. L., [Principles of Infrared Technology: A Practical Guide to the State of the Art], Van Nostrand Reinhold (1994).

[24] V. N. Vapnik, Statistical Learning Theory, Wiley, 1998. [25] B. Scholkopf and A. Smola, Learning with Kernels-Support Vector

Machines, Regularization, Optimization and Beyond, MIT Press, 2002.

Figure 6. SVM classification rates as a function of dimensionality for PCA (top) and diffusion maps (bottom).