Upload
worddetail
View
1
Download
0
Embed Size (px)
Citation preview
A Comparative Study between ICA and PCA
A dissertation
Submitted to the Department of Statistics, University of Rajshahi, Bangladesh, for
Partial Fulfillment of the Requirements for the Degree of Master of Science.
Examination Roll No. 08054718
Examination Year 2012
Registration No. 1550
Session 2007-2008
Department of Statistics, University of Rajshahi
Rajshahi-6205, Bangladesh
November 27, 2013
AbstractThis thesis attempts to study ICA and compare it with PCA for detection of
inherent structure, cluster analysis and outlier detection in multivariate data anal-
ysis. It presents the basic theory and application of ICA, and the recent work on
the subject. It tries to get a view of the data principles underlying the working of
independent component analysis. Next it discusses on most popular algorithm used
in ICA analysis, specially FastICA algorithm, which is an efficient and a fast working
algorithm.
It considers the problem of finding latent structure of three types of datasets gener-
ated from linear mixture of several independent super and sub-gaussian distribution.
First dataset consists of 10 variables: each generated from uniform (subgaussian)
distribution, while 2nd dataset consists of 10 variables: 5 Laplace (super-gaussian),
3 binomial, 2 multinomial distribution, and 3rd dataset is the mixture of five in-
dependent distribution (uniform, Laplace, binomial, multinomial and normal). It
is assumed that the observed data are generated by unknown latent variables and
their interactions. The task is to find these latent variables and the way they inter-
act, given the observed data only by using PCA and ICA. PCA cannot detect the
source variables from mixture whereas ICA is almost successful to identify the source
variables in every case. This thesis also represents the clustering approach of mul-
tivariate dataset using last two independent components after ordering according to
their kurtosis. Generally, first two principal components are used to visualize cluster
in multivariate dataset. It uses one simulated and three real datasets for clustering
approach, which are Australin crabs, Fisher Iris and Italian olive oils datasets. ICA
always performs better than PCA for clustering. Many researchers use last two and
first two PCs to visualize outliers in multivariate datasets. Four real datasets are
used for outlier detection: Epilepsy, Stackloss, Education expenditure and Scottish
hill racing datasets. In case of outlier detection ICA is more fruitful than PCA.
We recommended using ICA in place of PCA in detecting clusters as well as out-
liers. Furthermore, we suggest that if subject domain supports the assumption of
independent non-gaussian source variables ICA, not PCA be used to identify the
latent structure.
iii
Acknowledgment
Completing a Thesis paper in a new and very challenging subject is usually a journey
through a long and winding road, where one has to tame more oneself than the actual
phenomena in research. Luckily, I was not alone in this trip. The following were my
company in this journey, and I would like to say a big thanks, as this work might not
have been possible without them.
Primarily, I would like to thank my supervisor Prof. Dr. Mohammed Nasser for
his close supervision and very fruitful collaboration in and outside the aspects of my
thesis.
Thanks to the statistics departments of University of Rajshahi, Bangladesh for giving
me a powerful personal computer with a beautiful lab. I specially thanks to one of
my American friends Mark Booth, who continuously encourage and support me to
do this thesis.
I would like to thank my honourable teachers Professor Dr. Md. Golam Hossain,
Professor Dr. A. H. M. Rahmatullah Imon and Professor Dr. M. Rezaul Karim, De-
partment of statistics, University of Rajshahi. I would like to thank the teachers and
staff of Department of statistics, University of Rajshahi for the supply of important
information about this study. I would like to thank my elder brothers, Ahshanul
Haque Apel data entry officer ICDDR,B, Mizan Alam, Senior Statistical Program-
mer, Shafi Consultancy Ltd. Faisal Ahmed, Research Statistician, Al Mehedi Hassan
, Assistant Professor Rajshahi University of Engineering and Technology (RUET),
for their continual encouragement.
I also greatful to all of my friends and younger brothers for their inspiration. I
cannot explain the role of my parents and elder brother in word - without their
encouragement and other support I could not finish my study.
Last but not least a very special acknowledgments to GOOGLE without which it is
almost impossible to do this job.
Notations Used in the Thesis
k scalar constant
d dimensionality of x before dimensionality reduction
f scalar-valued function of a scalar variable
g scalar-valued function of a scalar variable
i index of xi
j index of sj or wj
k dimensionality of x after random projection
n number of latent components; dimensionality of s or y
p probability density function
s independent latent component
x component of observed vector x
y latent component
ε scalar constant
E expectation operator
P probability
p column vector of probabilities
s column vector of independent latent components
w column vector of a projection direction
x observed column vector
y column vector of latent components
xT vector x transposed (applicable to any vector)
A mixing matrix; topic matrix
D matrix of eigenvalues
E matrix of eigenvectors
W unmixing matrix
Abbreviations Used in theThesis
BSS blind signal separation
IC independent component
ICA independent component analysis
LDA linear discriminant analysis
ML maximum likelihood
MLP multilayer perceptron
MPCA multinomial principal component analysis
M.Sc. master of science
MSE mean squared error
NMF nonnegative matrix factorization
PC principal component
PCA principal component analysis
SOM self-organizing map
SOFM self-organizing feature map
SSE sum of squared errors
SVD singular value decomposition
Contents
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Historical Background . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Objective of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Scope and Limitation of the Study . . . . . . . . . . . . . . . . . . . 6
1.6 Organization of the Subsequent Chapter . . . . . . . . . . . . . . . . 7
2 Methods and Materials 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 PCA by variance maximization . . . . . . . . . . . . . . . . . 10
2.2.2 PCA by minimum mean-square error compression . . . . . . . 12
2.2.3 PCA by singular value decomposition . . . . . . . . . . . . . . 14
2.3 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Assumptions of ICA . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Ambiguities of ICA . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Gaussian variable is forbidden for ICA . . . . . . . . . . . . . 23
2.3.4 Key of ICA estimation . . . . . . . . . . . . . . . . . . . . . . 24
2.3.5 Measure of non-Gaussianity . . . . . . . . . . . . . . . . . . . 26
2.3.6 ICA and Projection Pursuit . . . . . . . . . . . . . . . . . . . 34
vii
2.3.7 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.8 FastICA algorithm . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.9 Infomax learning algorithm . . . . . . . . . . . . . . . . . . . 40
2.4 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 PCA vs ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Computer Program Used in the Analysis . . . . . . . . . . . . . . . . 45
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Latent Structure Detection 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Simulated data set-1 . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.2 Simulated data set-2 . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.3 Simulated data set-3 . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Visualization of Clusters 56
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Simulated dataset 1 . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2 Australian crabs dataset . . . . . . . . . . . . . . . . . . . . . 59
4.2.3 I ris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.4 Italian Olive oil’s data set . . . . . . . . . . . . . . . . . . . . 62
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 Outlier Detection 65
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.1 Epilepsy dataset . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.2 Education expenditure data . . . . . . . . . . . . . . . . . . . 68
viii
5.2.3 Stackloss data . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.4 Scottish hill racing data . . . . . . . . . . . . . . . . . . . . . 71
5.3 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6 Special Application of ICA 73
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 ICA in Audio source separation . . . . . . . . . . . . . . . . . . . . . 73
6.3 ICA in Biomedical Application . . . . . . . . . . . . . . . . . . . . . 76
6.3.1 ICA of Electroencephalographic Data . . . . . . . . . . . . . . 76
6.3.2 ICA of Functional Magnetic Resonance Imaging Analysis (fMRI) 81
7 Summary, Conclusions and Future Research 84
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.1.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.1.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A Bibliography 87
ix
List of Figures
2.1 (a)Cocktail party problem. (b) a linear superposition of the speakers
is recorded at each microphone. This can be written as the mixing
model x(t) = As(t) equation with speaker voices s(t) and activity x(t)
at the microphones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Independent Component Structure. . . . . . . . . . . . . . . . . . . . 20
2.3 Joint distribution of two independent Source of Normal(Gaussian) dis-
tribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Joint distribution of two independent Source of uniform (sub-Gaussian)
distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Joint distribution of two independent Source of Laplace(super-Gaussian)
distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 (a) The joint distribution of the observed mixture of two uniform (sub-
gaussian) variables. (b)The joint distribution of whiten mixtures of
uniformly distributed independent components. (c) The joint distri-
bution of the observed mixture of two Laplacian distribution. (d) The
joint distribution of two whiten mixture of Laplacian distribution. . . 28
2.7 Entropy measurement corresponding to probability. From the figure
entropy will be maximum when p=0.5 . . . . . . . . . . . . . . . . . 29
2.8 Mutual Information between two variable X and Y . . . . . . . . . . . 31
x
2.9 An illustration of projection pursuit and the ”interestingness” of non-
gaussian projections. The data in this figure is clearly divided into
two clusters. However, the principal component, i.e. the direction of
maximum variance, would be vertical, providing no separation between
the clusters. In contrast, the strongly nongaussian projection pursuit
direction is horizontal, providing optimal separation of the clusters. . 34
2.10 Flowchart of the FastICA algorithm. . . . . . . . . . . . . . . . . . . 39
2.11 Flowchart of Infomax learning algorithm. . . . . . . . . . . . . . . . . 41
2.12 (a) Vectors IC1 and IC2 show the directions determined by the relative
actions of the two component process. The data will be independently
along these two component vectors. Vector PC1 and PC2 show the
two perpendicular principal component directions indicating maximum
variance in the data. (b) Shows that IC1 and IC2 can be indirectly
determined by the finding a linear transformation matrix W which re-
sults in a rectangular distribution. The sigmoid transformation g(Wx)
makes the distribution more uniform and the ICA algorithm of Bell
and Sejnowski (1995) further adjusts IC1 and IC2 to maximize the
entropy of the distribution. . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 Matrix plot of original source of 10 uniform (sub-gaussian) distribution. 48
3.2 Matrix plot of observable mixture of 10 uniform (sub-gaussian) distri-
bution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Matrix plot of 10 principal components. . . . . . . . . . . . . . . . . . 49
3.4 Matrix plot of 10 independent components . . . . . . . . . . . . . . . 49
3.5 Matrix plot of original source of 5 laplace (super-gaussian), 3 binomial,
2 multinomial distribution. . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Matrix plot of observed mixture of 5 Laplace (super-gaussian), 3 bino-
mial, 2 multinomial distribution. . . . . . . . . . . . . . . . . . . . . . 51
xi
3.7 Matrix plot of principal components of 5 Laplace (super-gaussian), 3
binomial, 2 multinomial distribution. . . . . . . . . . . . . . . . . . . 51
3.8 Matrix plot of independent components of 5 Laplace (super-gaussian),
3 binomial, 2 multinomial distribution after applying ICA. . . . . . . 52
3.9 Matrix plot of 5 original source variable comes from uniform (sub-
gaussian), Laplace (super-gaussin), binomial, multinomial and normal
distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.10 Matrix plot of observe mixture of 5 variables. . . . . . . . . . . . . . 53
3.11 Matrix plot of all principal components. . . . . . . . . . . . . . . . . 54
3.12 Matrix plot of all independent components. . . . . . . . . . . . . . . . 54
4.1 Density plot of various distribution and their kurtosis. . . . . . . . . . . 57
4.2 Scatter plot of two variables. . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 (left) Scatter plot of first principal component. (Right) Scatter plot of
last independent component. . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 (a)Matrix plot of the Australian Crabs data set. (b) Matrix plot of all
principal components of Australian Crabs data set. . . . . . . . . . . . . 60
4.5 (a)Scatter plot of first two principal component (b) Scatter plot of the last
two independent components of Australian Crabs data. . . . . . . . . . 60
4.6 (a)Matrix plot of the Fisher Iris data set. (b) Matrix plot of the principal
components of Iris data. . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 (a)Scatter plot of first two principal component (b) Scatter plot of the last
two independent components of Iris data. . . . . . . . . . . . . . . . . . 62
4.8 (a)Matrix plot of Italian Olive oil data set.(b) Matrix plot of all principal
components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.9 (a)Scatter plot of first two principal components (b) Scatter plot of the last
two independent components of Olive oils data. . . . . . . . . . . . . . . 63
5.1 (a)Text plot of first two largest PCs (b) Text plot of two smallest PCs . . 67
xii
5.2 (a)Text plot of first two largest ICs (b) Text plot of smallest ICs . . . . . 67
5.3 (a)Text plot of first two principal components. (b)Text plot of last two
principal components of Education expenditure data set. . . . . . . . . . 68
5.4 (a)Text plot of first two independent component. (b)Text plot of the last
two independent components of Education expenditure data set. . . . . . 69
5.5 (a)Text plot of first two principal component. (b)Text plot of the last two
independent components of Stackloss data set. . . . . . . . . . . . . . . 70
5.6 (a)Text plot of first two independent components. (b)Text plot of the first
two independent components of Stackloss data set. . . . . . . . . . . . . 70
5.7 (a)Scatter text plot of first two principal component. (b)Scatter text plot
of the first two independent components of Stackloss data set. . . . . . . 71
5.8 (a)Scatter text plot of first two principal component. (b)Scatter text plot
of the first two independent components of Stackloss data set. . . . . . . 72
6.1 Blind source separation of two speech. (Top row) time course of two
speech signals. (Middle row) These were mixed of two observation.
After separation of two signals (Bottom row) . . . . . . . . . . . . . . 74
6.2 Scatter plot of two audio mixture signals. . . . . . . . . . . . . . . . . 74
6.3 Scatter plot of two principal components of audio signals. . . . . . . . 75
6.4 Scatter plot of two of two independent components of audio signals. . 75
6.5 (a) Raphical output of Co-registration of EEG data, showing (upper panel)
cortex (blue), inner skull (red) and scalp (black) meshes, electrode loca-
tions (green), MRI/Polhemus fiducials (cyan/magneta), and headshape (red
dots). (b)All channel of list of EEG for this dataset. . . . . . . . . . . . 77
6.6 Flowchart of EEG data analysis using ICA. . . . . . . . . . . . . . . . 77
6.7 A 5 sec. portion of the EEG time series with prominent alpha rhythms
(8-21 Hz). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.8 The 32 ICA component extracted from the EEG data in figure 6.7. . 78
6.9 Scalp map projection of all 32 channels . . . . . . . . . . . . . . . . . 79
xiii
6.10 Second independent component properties. . . . . . . . . . . . . . . . 80
6.11 Cortex (blue), inner skull (red), outer skull (orange) and scalp (pink)
meshes with transverse slices of the subject’s MRI. . . . . . . . . . . 81
6.12 Comparison of brain networks obtained using ICA independently on
fMRI data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
xiv
Chapter 1
Introduction
1.1 Introduction
A fundamental problem in neural network research, as well as in many other disci-
plines, is finding a suitable representation of multivariate data, i.e. random vectors.
For reasons of computational and conceptual simplicity, the representation is often
sought as a linear transformation of the original data. In other words, each component
of the representation is a linear combination of the original variables. Well-known
linear transformation methods include principal component analysis, factor analysis,
and projection pursuit. Independent component analysis (ICA)[5, 25] is a recently
developed method in which the goal is to find a linear representation of nongaussian
data so that the components are statistically independent, or as independent as pos-
sible.
ICA has recently become an important tool for modelling and understanding em-
pirical datasets as it offers an elegant and practical methodology for blind source
separation and deconvolution. It is seldom possible to observe a pure unadulterated
signal. When two or more signal are interpreted to each other than ICA may be
applied to this Blind Source Separation (BSS) to the signal processing community.
Introduction
Finding a natural coordinate system is an essential first step in the analysis of em-
pirical data. Principal Component Analysis (PCA) has, for many years, been used
to find a set of basis vectors which are determined by the data set itself. The prin-
cipal components are orthogonal and projections of the data onto them are linearly
decorrelated, properties which can be ensured by considering only the second order
statistical characteristics of the data. ICA aims at a loftier goal: it seeks a trans-
formation to coordinates in which the data are maximally statistically independent,
not merely decorrelated. The stronger condition allows one to remove the rotational
invariance of PCA, i.e. ICA provides a meaningful unique bilinear decomposition of
two-way data that can be considered as a linear mixture of a number of independent
source signals. The discipline of multilinear algebra offers some means to solve the
ICA problem.
Perhaps the most famous illustration of ICA is the cocktail party problem, in which
a listener is faced with the problem of separating the independent voices chatter-
ing at a cocktail party. Humans employ many independent voices chattering at a
cocktail party. Recently, ICA has received attention because of its potential applica-
tion in signal processing such as speech recognition, telecommunication and medical
signal processing; feature extraction such as face recognition; clustering; time series
analysis; Modelling of the hippocampus and visual cortex; Compression, redundancy
reduction; Watermarking; Scientific Data Mining etc.
2
Introduction
1.2 Historical Background
The technique of ICA was first introduced by Cristian Jutten and Jenny Herault
in Space or time Adaptive Signal Processing by Neural Network Models(1986) [25].
They presented a recurrent neural network model and a learning algorithm based on
a version of the Hebb learning rule that, they claimed, was able to blindly separate
mixtures of independent signals. They demonstrate the separation of two mixture sig-
nals and also mention the possibility of unmixing stereoscopic visual signals with four
mixtures. This approach has been further developed by Jutten and Herault (1991),
Karhunen and Joutsensalo (1994), Cichocki et al. (1994), Comon (1994) [25, 47, 3, 73].
Unsupervised learning rules based on information theory to study the blind source
separation in parallel were proposed by Linsker (1992) [95]. The goal was to maxi-
mize the mutual information between the input and output of a neural network. This
approach is related to the principle of redundancy reduction suggested by Barlow
(1961) as a coding strategy in neurons. Each neuron should encode features that are
as statistically independent as possible from other neurons over a natural ensemble
of inputs; decorrelation as a strategy for visual processing.
Independent Component estimation using MLE was first introduce by Gaeta and
Lacoume (1990) and elaborated by Pham et al. (1992) [96, 31]. The original infomax
learning rule for blind source separation first presented Bell and Sejnowski (1995)
[17]. Their algorithm is suitable for super-gaussian source, but the algorithm fails
to separate sub-gaussian. An extension of the infomax learning algorithm of Bell
and Sejnowski is presented in Lee et al. (1998b) [88] that is blindly separate mixed
signals with sub and super-Gaussian source distributions. This is achieved by using
a simple type of learning rule first derived by Girolami (1997b)[98] by choosing ne-
gentropy as a projection pursuit index. Parameterized probability distributions that
have sub and super-Gaussian regimes were used to derive a general learning rule that
3
Introduction
preserves the simple architecture proposed by Bell and Sejnowski (1995) [17], is opti-
mized using the natural gradient by Amari (1998) [80], and uses the stability analysis
of Cardoso and Laheld (1996) [40] to switch between sub and super-Gaussian regimes.
There are two properties in ICA: the natural gradient and the robustness in ICA
against parameter mismatch. The natural gradient (Amari 1998) [80] or equivalently
the relative gradient Cardoso and Laheld (1996) [40] gives fast convergence.
Extensive simulations have been performed to demonstrate the power of the learning
algorithm. However instantaneous mixing and unmixing simulations are toy prob-
lems and challenges lie in dealing with real world data. Makeig et al. (1996) [81] have
applied the original infomax algorithm to EEG and ERP data showing that the algo-
rithm can extract EEG activations and isolate artifacts. Infomax learning algorithm
is able linearly able to decompose EEG artifacts such as line noise, eye blinks, and
cardiac noise into independent component with sub and super-gaussian distributions
proposed by Jung et al [101]. McKeown et al. (1998b)[68] have used the extended
ICA algorithm to investigate task related human brain activity in fMRI data.
The multichannel blind source separation problem has been addressed by Yellin and
Weinstein (1994)[108] and Nguyen-Thi and Jutten (1995)[109] and other based on
forth order cumulants criteria. An extension to time-delays and convolved sources
form the infomax view point using a feedback architecture was developed by Torkkola
(1996a) [110]. A full feedback system and a full feedforward system of the blind source
separation problem was extended by Lee et al. (1997a) [85]. The feedforward archi-
tecture allows the inversion of non minimum phase systems. In additions, the rule
are extended using polynomial filter matrix algebra in the frequency domain (Lam-
bert, 1996) [111]. The propose method can successfully separate the voice and music
recorded in a real environment. Lee et al. (1997b) [86] showed that the recognition
4
Introduction
rate of an automatic speech recognition system was increased after separating the
speech signals.
Since ICA is restricted and relies on several assumptions researchers have started to
tackle a few limitations of ICA. One obvious but non-trivial extension is the nonlinear
mixing model. ICA of nonlinear components are extracted using Self Organizing Fea-
ture Maps (SOFM) [112, 113]. Other researchers (Burel, 1992; Lee et al. 1997c; Taleb
and Jutten, 1997; Yang et al., 1997; Horchreiter and Schmidhuber, 1998) [114, 87, 100]
have used a more direct extension to the previously presented ICA models. They in-
clude certain flexible nonlinearities in the mixing model and the goal is to invert the
linear mixing matrix as well as the nonlinear mixing. Hochreiter and Schmidhuber
(1998)[100] have proposed low complexity coding and decoding approaches for non-
linear ICA. Another limitation is the underdetermined problem in ICA, i.e. having
less receivers than sources. Lee et al. (1998c) [88] demonstrated that an overcomplete
representation of the data can be used to learn non-square mixing matrix and to infer
more sources than receivers. The overcomplete framework also allows additive noise
in the ICA model and can therefore be used to separate noisy mixtures.
There is now a substantial amount of literature on ICA and BSS. Reviews of the
different theories can be found in Cardoso and Comon (1996) [40]; Cardoso (1997)
[42]; Lee et al. (1998a)[88] and Nadal Parga (1997) [53]. Several neural network learn-
ing rules are reviewed and discussed by Karhunen (1996); Cichocki and Unbehauen
(1996) and Karhunen et al. (1997a)[2, 49].
ICA is fairly new and gradually applicable method to several challenges in signal pro-
cessing. It reveals a diversity of theoretical questions and opens a variety of potential
applications. Succesful results in EEG, fMRI, Speech recognition and face recognition
systems indicate the power and optimistic expectation in the new paradigm.
5
Introduction
1.3 Motivation of the Study
In multivariate statistics cluster analysis, outlier detection and pattern recognition by
using PCA is very old technique [38, 121, 122]. Very few work have done in these area
using ICA [118, 116, 117, 119, 120]. A comparison study between ICA and PCA for
clustering approach, outlier detection and shape study are hardly seen in literature.
Thus we become intended to study in this field of ICA.
1.4 Objective of the Study
The main objectives of this study are:
� Study algorithms of Independent Component Analysis (ICA).
� Applying ICA for pattern recognition, clustering analysis, outlier detection.
� Compare its performance with that of PCA.
1.5 Scope and Limitation of the Study
Although ICA is recently developed technique but it has extensive application in
multivariate statistics. Since the invention of ICA in earlier is used as a source
separation, but now it is used as cluster analysis and outlier detection as well. This
thesis is the partial fulfilment for the degree of M.Sc. and only three month is time
limit after finishing M. Sc. theory part. Despite of time and monetary constraints
that are unavoidable during preparing M.Sc. thesis, it is expected that the study
would helpful to those who want to analyze ICA and PCA in the area of pattern
recognition, cluster analysis and outlier detection.
6
Introduction
1.6 Organization of the Subsequent Chapter
Thus thesis is partitioned theory and application of ICA with some simulated and
real data sets. Organization of subsequent chapter of this thesis are as follows-
Chapter 1 gives the introduction of ICA with historical background. Motivation
and objective of the study are discussed also. Future challenges in ICA research are
mentioned later in this chapter.
Chapter 2 states the methods and methodologies of PCA, ICA with basic PCA and
ICA model. Here also discuss most popular algorithms of ICA. Description of data
and software that has been used for the analysis of this thesis are discussed later in
this chapter.
Chapter 3 starts with various application of ICA. These chapter are partitioned into
theory and application of ICA.
Chapter 4 starts with Blind Source Separation of ICA model.
Chapter 5 represents the visualization of cluster technique using ICA. Some simu-
lated and real data sets are analyzed in this chapter.
Chapter 6 presents the outlier detection application of ICA. Some extensive appli-
cation of real data set have been discussed in this chapter.
Chapter 7 gives conclusions by summarizing the main results in this thesis.
7
Chapter 2
Methods and Materials
2.1 Introduction
PCA [38] and ICA [15] both are projection pursuit technique in multivariate analy-
sis. ICA depends on higher order statistics whereas PCA depends on second order
statistics. In this chapter we discuss the mathematical formulation of PCA and ICA.
In addition most popular algorithm also discussed. Data description and computer
software that used in the analysis are discussed later in this chapter.
2.2 Principal Component Analysis
PCA constructs a set of uncorrelated variable called principal components (PC’s)
from a set of correlated variable by using an orthogonal transformation. PC’s are
formed in such a way that first PC has the largest variance and the last PC has the
smallest variance i.e. principal components (PC’s) are ordered [38, 60].
PCA is mathematically defined as an orthogonal linear transformation that trans-
forms the data to a new coordinate system such that the greatest variance by any
projection of the data comes to lie on the first coordinate (called the first principal
Methods and Materials
component), the second greatest variance on the second coordinate, and so on.
PCA, an observed vector x first centered by removing its mean (in practice, the
mean is estimated as the average value of the vector in a sample). Then the vector
is transformed by a linear transformation into a new vector, possibly of lower dimen-
sion, whose elements are uncorrelated with each other. The linear transformation is
found by computing the eigenvalue decomposition of the covariance matrix, which for
zero-mean vectors is the correlation matrix E{xxT} of the data. The eigenvectors
of form a new coordinate system in which the data are presented. The decorrelating
process is called whitening or sphering if also the variances of each element of the new
data vector are set to unity. This can be accomplished by scaling the vector elements
by the inverses of the eigenvalues of the correlation matrix. In all, the whitened data
have the form
z = D−1/2ETx (2.1)
where, z is the whitened data vector, D is a diagonal matrix containing the eigen-
values of the correlation matrix and E contains the corresponding eigenvectors of
the correlation matrix as its columns. In practice, the expectation in the correlation
matrix is computed as the sample mean. Subsequent ICA estimation is done on z
instead of x. For whitened data it is enough to find an orthogonal demixing matrix
if the independent components are also assumed white.
Dimensionality reduction is performed by PCA simply by choosing the number of
retained dimensions, m, and projecting the n-dimensional observed vector x to a
lower dimensional space spanned by the m(m < n) dominant eigenvectors (that is,
eigenvectors corresponding to the largest eigenvalues) of the correlation matrix. Now
the matrix E in Formula (2.1) has only m columns instead of n, and similarly D is
of size m×m instead of n× n, if whitening is desired.
9
Methods and Materials
There is no clear way to choose the number of retained dimensions in practice. In
theory, the rank of xTx is equal to the rank of sT s in the noiseless case, so it is enough
to compute the number of non-zero eigenvalues of xTx. The problem is discussed in,
e.g., [15]. One often chooses the number of largest eigenvalues so that the chosen
eigenvectors explain the data well enough, for example, 90 percent of the total vari-
ance in the data. As PCA pre-processing for ICA always involves the risk that the
true independent components are not in the space spanned by the dominant eigen-
vectors, it is often advisable to estimate fewer independent components than what is
the dimensionality of the data after PCA. Trial and error are often needed in deter-
mining both the number of eigenvectors and the number of independent components
estimated.
PCA is a convenient method for estimating the structure of the data, assuming that
the distribution of the data is roughly symmetric and unimodal. PCA finds the or-
thogonal directions in which the data have maximal variance. PCA is an optimal
method of dimensionality reduction in the mean-square sense: data points projected
into the lower dimensional PCA subspace are as close as possible to the original high
dimensional data points.
||x(t)− z(t)||2 (2.2)
is minimized. Here we denote by x(t) the tth original observation vector and by z(t)
its projection.
2.2.1 PCA by variance maximization
In Mathematical terms, consider a linear combination
y1 =n∑k=1
wk1xk = wT1 x (2.3)
of the element x1, ..., xn of the vector x. The w11, ..., wn1 are scaler coefficients or
weights, elements of an n-dimensional vector w1, and wT1 denote the transpose of w1.
10
Methods and Materials
The factor y1 is called the first principal component of x, if the variance of y1 is
maximally large. Because the variance depends on both the norm and orientation
of the weight vector w1 and grows without limits as the norm grows, we impose the
constraint that the norm of w1 is constant, in practice equal to 1. Thus we look for
a weight vector w1 maximizing the PCA criterion.
JPCA1 (w1) = E{y21} = E{wT1 x} = wT
1E{xxT}w1 = w1Cxw1 (2.4)
So,that
||w1|| = 1 (2.5)
Here E{.} is the expectation over unknown density of input vector x, and the norm
w1 is the usual Euclidean norm defined as-
||w1|| = (wT1 w1)
12 = (
n∑k=1
w2k1)
12
The matrix Cx in Eq. (2.1) is the n × n covariance matrix of x given for the zero
mean vector of x by the correlation matrix
Cx = E{xxT} (2.6)
It is well known from basic linear algebra that the solution to the PCA problem is
given in terms of the unit-length eigenvectors e1, ..., en of the matrix C [102]. The
ordering of the eigenvectors is such that the corresponding eigenvalues d1, ..., dn satisfy
d1 ≥ d2 ≥, ...,≥ dn. The solution maximization is given by
w1 = e1
Thus the first principal component of x is y1 = eT1 x. The criterion JPCA1 in eq. (2.4)
11
Methods and Materials
can be generalized to m principal components, with many number between 1 and
n. Denoting the mth (1 ≤ m ≤ n) principal component by ym = wTmx, with wm
the corresponding unit norm weight vector, the variance of ym is now maximized
under the constraint that ym is uncorrelated with all the previously found principal
components:
E{ym, yk} = 0, k < m (2.7)
Since that the principal components ym has zero means because
E(ym) = wTmE{x} = 0
The condition (2.7) yields
E{ymyk} = E{(wTmx)(wT
k x)} = wTmCxwk = 0 (2.8)
For the second principal component, we have the condition that
wT2 Cw1 = d1w2
Tae1 (2.9)
because we know that, w1 = e1. Thus looking for maximal variance E{y22} =
E{(wT2 x)2} in the subspace orthogonal to the first eigenvector of Cx. The solution
is given by
w2 = e2
. Thus the kth principal component is yk = eTk x
2.2.2 PCA by minimum mean-square error compression
In the preceding subsection, the principal components were defined as weighted sums
of the elements of x with maximal variance, under the constraints that the weights
are normalized and the principal components are uncorrelated with each other. It
turns out that this is strongly related to minimum mean-square error compression
of x, which is another way to pose the PCA problem. Let us search for a set of m
orthonormal basis vectors, spanning an m-dimensional subspace, such that the mean
12
Methods and Materials
square error between x and its projection on the subspace is minimal. Denoting again
the basis vectors by w1, ..., wm for which we assume
wTi wj = δij
the projection of x on the subspace spanned by them is∑n
i=1(wTi x)wi. The mean
square error (MSE) criterion, to be minimized by the orthonormal basis w1, ...,wm,becomes
JPCAMSE = E{||x−n∑i=1
(wTi x)wi||2} (2.10)
It is easy to show that due to the orthogonality of the vectors bfwi, this criterion can
be further written as
JPCAMSE = E{||x||} − E{n∑
j=1
(wTj x)2} (2.11)
trace(Cx)−n∑j=1
(wTj Cx)wj} (2.12)
It can be shown that the minimum of (2.12) under the orthonormality condition on
the wi is given by any orthonormal basis of the PCA subspace spanned by the m first
eigenvectors e1, ..., em [102]. However, the criterion does not specify the basis of this
subspace at all. Any orthonormal basis of the subspace will give the same optimal
compression. While this ambiguity can be seen as a disadvantage, it should be noted
that there may be some other criteria by which a certain basis in the PCA subspace
is to be preferred over others. Independent component analysis is a prime example
of methods in which PCA is a useful preprocessing step, but once the vector x has
been expressed in terms of the first m eigenvectors, a further rotation brings out the
much more useful independent components. It can also be shown [102] that the value
of the minimum mean-square error of (2.10) is
JPCAMSE =n∑
i=m+1
di (2.13)
13
Methods and Materials
the sum of the eigenvalues corresponding to the discarded eigenvectors em+1, ..., en.
If the orthonormality constraint is simply changed to
wTj wk = wkδjk (2.14)
where all the numbers wk are positive and different, then the mean-square error
problem will have a unique solution given by scaled eigenvectors [103].
2.2.3 PCA by singular value decomposition
The singular value decomposition can be viewed as the extention of the eigenvalue
decomposition for the case of nonsquare matrix. It shows that any real matrix can
be diagonalized by using two orthogonal matrix. The eigen value decomposition,
instead, works only on square matrices and uses only one matrix(and its inverse) to
achive diagonalization.
Theorem 2.1. Consider a m×n matrix X with singular value decomposition aX =
UΛVT . The best approximation in Frobenius norm to X by a matrix rank k =
min(m,n) is given by
XUdiag(λ1, ..., λk, ..., 0)V T , ||X|| =k∑i=1
λ2i , ||X − X|min(m,n)∑
i=1
λ2i
This is also the best approximation by a projection onto a subspace of dimension
at most k, the projection onto the space spanned by the first k columns of U, and
maximizes the Frobenius norm of a projection of X onto a subspace of dimension at
most k.
Proof: We have∥∥∥X − X∥∥∥2 = tr[(UΛV T − UΛkVT )T (UΛV T − UΛkV
T )T ]
= tr[V (Λ− Λk)VT )TUTU(Λ− Λk)V
T ]
= tr[V TV (Λ− Λk)T (Λ− Λk)]
= tr[(Λ− Λk)T (Λ− Λk)]
=∑min(m,n)
i=k+1 λ2i
14
Methods and Materials
X corresponds to a projection onto the space spanned by the first k columns of U ,
say Uk, Since the projection gives
Uk(UTk Uk)
−1UTk X = UkU
Tk UΛV T [Since UT
k Uk = Ik]
= UkΛkVT
= UΛkVT
Consider any approximation Y of rank at most k. This can be written as Y = AB
where A is m× k and B is k × n. Now consider the best approximation of the form
AC for any k × n matrix C. Since the squared Frobenius norm is the sum of the
squared lengths of the columns, this is solved by regressing each column of X in turn
on A; the optimal choice is C = (ATA)−1ATX and
‖X − Y ‖2 ≥∥∥∥X − AC∥∥∥2 [Since UT
k Uk = Ik]
= ‖(I − PA)X‖2
= ‖X‖2 − ‖PAX‖2
Where PA = A(ATA)−1AT is the projection matrix onto span (A). Now we choose
PAto maximize
‖PAX‖ .
‖PAX‖2 =∥∥PAUΛV T
∥∥= tr[(PAUΛVT)(PAUΛVT)T ]
= tr[(PAUΛVTVΛTUTPTA)
= tr[(PAUΛ)(PAUΛ)T [Since V TV = I]
= ‖PAUΛ‖2
=∑min(m,n)
j=1 λ2j ‖PAuj‖2
=∑min(m,n)
j=1 λ2jp2j
and |pj ≤ 1| (it is in the projection of a unit length vector),∑p2j = ‖PAU‖2 =
‖PA‖2 = k. It is then obvious that the maximum is attained if and only if the first k
15
Methods and Materials
pj’s are one the rest are zero, so
‖X − Y ‖2 ≥ ‖X‖2 − ‖PAX‖2
≥ ‖X‖2 −∑k
i=1 λ2i
=∑min(m,n)
i=1 λ2i −∑k
i=1 λ2i
=∑min(m,n)
i=k+1 λ2i
=∥∥∥X − X∥∥∥2
Any projection of X into a subspace of k dimensions has rank at most k.
Theorem 2.2.Consider m n-variate observations forming a matrix X. Then the pro-
jection of Theorem 2.1: (a) minimizes the sum of squared lengths from points to their
projections onto any subspace of dimension at most k,
(b) maximizes the trace of variance matrix of the projected variables onto any
subspace of dimension at most k, and
(c) maximizes the sum of squared inter-point distances of the projections onto any
subspace of dimension at most k.
Proof. Without loss of generality we can centre the observations, so each variable
has mean zero. Part (a) is follows from the squared Frobenius norm of (X − PAX)
being the sum of squared lengths of its rows.
For part (b) the squared Frobenius norm of PAX is the sum of squares of the projected
variables, that is m - 1 times the sum of the variances of t0he variables, which is the
trace of the variance matrix.
For (c) consider any projection PAX. Let drs be the distance between observations r
and s, and drsthe distance under projection. Let yr be therth projected observation
as a row vector than-
∑rs
d2rs =∑rs
‖yr − ys‖2
=∑rs
‖yr‖2 + ‖ys‖2 − yrysT
16
Methods and Materials
= 2m∑r
‖yr‖2 + ‖ys‖2 − yrysT [Since∑rs
yrysT = PA
(∑r
xr∑s
xsT
)PA
T = 0]
= 2m ‖PAX‖2
which is maximized according to theorem 3.1 We can use SVD to perform PCA. We
decompose X using SVD, i.e.
X = UΛV T (2.15)
and find that we can write the covariance matrix as
C =1
n− 1XTX =
1
n− 1X = UΛV T (2.16)
Where V is a p× k matrix. The columns of V are the eigen vectors of XTX. The
transformed data can thus be written as,
Y = SV
Theorem 2.3 The principal components are given, in order, by columns of V . The
first k principal components span a subspace with the properties of Theorem 3.3.
Proof. Consider a linear combination y = xa with ||a|| = 1. Then
var(y) = aTvar(x)a
= 1n−1a
TXTXa
= 1n−1a
TV Λ2V Ta
= 1n−1
∑λ2i a
′i2
Where,a′a = V Ta also has unit length (and this corresponds to rotating to a new basis
of the variables). It is clear that the maximum occurs when a′ is the first coordinate
vector, or a the first column of V . Now consider the second principal component xb.
It must be uncorrelated with the first, so
0 = [Xa]T [Xb][UΛa′][UΛb′]λ21b′1
17
Methods and Materials
and it is obvious that the maximum variance under this constraint is given by taking
b′ as the second coordinate vector. An inductive argument gives the remaining prin-
cipal components. Using the principal component variables, XV = UΛ, so it clear
that the subspace spanned by the first k columns is the approximation of Theorem
2.2 and 2.3.
Theorem 2.4. Consider a orthogonal change XB to k new variables. The first k
principal components have maximal variance, both in the sense of the trace and of
the determinant of the variance matrix. Similarly, the last k principal components
have minimal variance.
Proof. Consider the SVD of XB, and let its singular values be µ1, ..., µk. We will
show µj ≤ λj; j = 1, ..., k,which suffices as the trace of the variance matrix is propor-
tional to the sum of the squared singular values, and the determinant is proportional
to their product.
Consider a variable xa which is a unit-length linear combination of the first j prin-
cipal components of the B set, but is orthogonal to the first j − 1 original principal
components. (A dimension argument shows that such a variable exists. Since B is
orthogonal it is also a unit-length combination of the original variables and of their
principal components.) This has variance at least µ2j and at most λ2j , so µj ≤ λj.
The result on minimality is proved by showing µj ≥ λpk+j, j = 1, ..., k, taking a unit
length linear combination of the last j original principal components orthogonal to
the last j − 1 principal components of the B set.
2.3 Independent Component Analysis
ICA is a method for finding latent factors or components from multivariate (multi-
dimensional) statistical data, which are not only statistically independent but also
come from non-Gaussian distribution. ICA is a step forward from Principal Com-
18
Methods and Materials
ponents Analysis (PCA), as the data are 1st standardized to be uncorrelated (PCA)
and then rotated so that independent factors can be found. ICA is related to the
cocktail party problem where main objective of ICA is blindly separate source signal.
The ICA model is-
x = f(a, s) (2.17)
where x = (x1, ...,xm) is an observed vector and f is a general unknown function
with parameters a that operates on statistically independent latent variables listed in
the vector s = (s1, ..., sn). A special case of (2.17) is obtained when the function is
linear, and we can write
x = As (2.18)
xi = ai1s1 + ai2s2 + ...+ ainsn
where, aij, i, j = 1, ..., n are real mixing coefficients. s1, ..., sn Usually in matrix
notation it can be written as-
x =n∑i=1
aixi
Figure 2.1: (a)Cocktail party problem. (b) a linear superposition of the speakers is recorded
at each microphone. This can be written as the mixing model x(t) = As(t) equation with
speaker voices s(t) and activity x(t) at the microphones.
19
Methods and Materials
Figure 2.2: Independent Component Structure.
Throughout this thesis, matrices are denoted by uppercase boldface letters, vectors
by lowercase boldface letters and scalars by lowercase letters. An entry (i, j) of a
matrix is denoted as A(i, j). Sometimes we write Am×n to indicate that A is an m n
matrix. The entries of a vector are denoted by the same letter as the vector itself as
shown after Formula (2.1); generally, y is an element of y and so on. All vectors are
column vectors.
2.3.1 Assumptions of ICA
The independent components are assumed to be statistically independent.
This is the most fundamental assumption of ICA. This is why ICA became most
powerful technique in last decades. Since independent and uncorrelated are not the
same think. Independent means uncorrelated but converge may not be true. To see
let us consider two random variables X and Y are uncorrelated when their correlation
coefficient is zero:
ρ(X, Y ) = 0
20
Methods and Materials
ρ(X, Y ) =cov(X, Y )√(v(X)v(Y )
being uncorrelated is the same as have zero variance. Since,
cov(X, Y ) = E(X, Y )− E(X)E(Y )
Having zero covariance and so being uncorrelated, is the same as-
E(X, Y ) = E(X)E(Y )
Two random variables are independent when their joint probability distribution is
the product of their marginal probability distributions: for all x and y,
ρ(X,Y )(x, y) = ρX(x), ρY (y)
If X and Y are independent, then they are also uncorrelated. To see this write the
expectation of the product
E[X, Y ] =
∫ ∫xyρ(X,Y )(x, y)dxdy
=
∫ ∫xyρX(x)ρY (y)dxdy
=
∫xρX(x)dx
∫yρY (y)dy
= E[X]E[Y ]
However, if X and Y are uncorrelated, then they can still be dependent. To see
an extreme example of this, let X be uniformly distributed on the interval [1, 1], If
X = 0, then Y = X, while if X is positive, then Y = X.
At most one of the independent component is Gaussian.
ICA look for the higher order cumulant and this higher order cumulant is zero for
21
Methods and Materials
gaussian distribution. Thus, ICA is essentially impossible if the observed variables
have gaussian distributions. If some of the components are gaussian and others are
non-gaussian, in this case we can estimate all non-gaussian components but the gaus-
sian component cannot be separated to each other.
We assume that the unknown mixing matrix is square.
This assumption states that the number of independent components is equal to the
number of observed mixtures. This assumption is only for simplicity. Because now
a days many research is going on underdetermined ICA where mixing matrix is not
square. In Blind Source Separation (BSS), if there are fewer receiver than the source
the problem is referred to as underdetermined [37] or overcomplete ICA and more
difficult to solve.
2.3.2 Ambiguities of ICA
We cannot determine the variance (energies) of the independent compo-
nent. Both s and A being unknown, any scalar multiplier in one of the sources si
could always be canceled by dividing the corresponding column ai of A by the same
scalar. Let ki be any scalar,
x =∑i=1
(1
ki
ai)(siki)
.
We cannot determine the order of the independent components.
Terms can be freely changed, because both s and A are unknown. So we can call
any IC as the first one. Formally, a permutation matrix P and its inverse can be
substituted in the model to give x = AP−1Ps. The elements of P s are the origi-
nal independent variables sj, but in another order. The matrix AP−1 is just a new
22
Methods and Materials
unknown mixing matrix, to be solved by the ICA algorithms.
2.3.3 Gaussian variable is forbidden for ICA
9 Let us consider the joint distribution of two ICs s1 and s2 is
f(s1, s2) =1√2πexp(−(s21 + s22
2) =
1
2πexp(−||s||
2) (2.19)
Now, assume that the mixing matrix A is orthogonal. For example, we could assume
that this is so because the data has been whitened. Using the classic formula of
transforming pdfs in (2.14), and noting that for an orthogonal matrix A−1 = AT
holds, we get, the joint density of mixture distribution of x1 and x2 is
p(x1, x2) =1√2πexp(−x
21 + x22
2) (2.20)
Since A is orthogonal then we have ||ATx||2 = ||x||2 and ||detA|| = 1. Thus we have
p(x1, x2) =1√2πexp(−||x||
2) (2.21)
23
Methods and Materials
−4 −2 0 2
−4−2
02
4
s1
s2
Figure 2.3: Joint distribution of two independent Source of Normal(Gaussian) distribution.
From the above equation the pdf of orthogonal mixing matrix and source are
identical. Thus there is no way to infer the mixing matrix from mixture. If we
try to estimate the ICA model and some (more than one) of the components are
gaussian, some nongaussian we can estimate all the nongaussian components, but the
gaussian components cannot be separated from each other, because they entangled to
each other and form a single gaussian component. Actually, in case of one gaussian
component, we can estimate the ICA model, because the single gaussian component
does not have any other gaussian components that it could be mixed with.
2.3.4 Key of ICA estimation
Non-gauassian is the heart of ICA estimation. Actually without non-gaussianity the
estimation is not possible at all. In most classical statistical theory, random variables
are assumed to have gaussian distributions which is the main obstruction for ICA.
This is probably the main reasons for late resurgence of ICA research. . According to
Central limit Theorem (CLT), the distribution of the sum (average of linear combi-
nation) of n independent component tends to gaussian distribution as n→∞, under
24
Methods and Materials
certain condition (for cauchy distribution central limit theorem does not hold).
According to ICA model (Eq. 2.18) mixture data vector is a linear combination
of independent source. Thus a linear combination y =∑
i bixi of the observed vari-
ables xi (which in turn are linear combinations of the independent components sj)
will be maximally non-Gaussian if it equals one of the independent components sj.
This is seen by a counter example: if y does not equal one of the sj but is a mixture of
two or more sj , then by spirit of the central limit theorem, y is more Gaussian than
each of the sj. Thus the task is to find wj such that the distribution of yj = wTj x is
as far from Gaussian as possible.
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
s1
s2
Figure 2.4: Joint distribution of two independent Source of uniform (sub-Gaussian) distri-
bution.
25
Methods and Materials
−5 0 5
−50
5
s1
s2
Figure 2.5: Joint distribution of two independent Source of Laplace(super-Gaussian) distri-
bution.
2.3.5 Measure of non-Gaussianity
Non-gaussian is the heart of ICA estimation. There are several measures of non-
gaussianity. Some of them are discussed below-
Kurtosis
One of the simple and easiest method for measuring non-gaussianity is kurtosis. kur-
tosis measure the degree of peakedness and tailedness of a distribution (Decarlo,
1997). Typically, non gaussianity is measured by the absolute value of kurtosis; the
square of kurtosis can also be used. Since each column of s is a vector of latent
variables, for this case the classical measure of univariate kurtosis often used as-
β2(s) = E[s−E(s)]4
[var(s)]2− 3
= E(s4)− 3[E(s2)]2
= µ4σ4 − 3
26
Methods and Materials
where E(.) is the expectation operator, σ is the standard deviation, µ4 is 4th moment
about the mean. For gaussian random variable the 4th moments E(s4) = 3[E(s2)]2.
Thus gaussian distribution has null kurtosis. we can distinguish the kurtosis in three
cases.
β2 = 0Gaussian
β2 > 0 Super − gaussian
β2 < 0 sub− gaussian
If x1 and x2 are two independent random variables then, k(x1 + x2) = k(x1) + k(x2)
and k(αx) = (α)4k(x) where, α is the constant term and k(.) is kurtosis operator.
27
Methods and Materials
0 2 4 6 8
02
46
8
x1
x2
(a)
−2 −1 0 1 2
−2
−1
01
2
x1
x2(b)
−40 −20 0 20 40
−1
00
−5
00
50
x1
x2
(c)
−5 0 5
−5
05
x1
x2
(d)
Figure 2.6: (a) The joint distribution of the observed mixture of two uniform (sub-gaussian)
variables. (b)The joint distribution of whiten mixtures of uniformly distributed independent
components. (c) The joint distribution of the observed mixture of two Laplacian distribution.
(d) The joint distribution of two whiten mixture of Laplacian distribution.
28
Methods and Materials
Negentropy
Entropy in information theory measure the unpredictability of information contents.
The generally information reduce uncertainty. Let us consider, y is a discrete random
variable that obtains values from a finite set y1, , yn with probabilities p1, , pn. We look
for a measure of how much choice is involved in the section of event or how certain
we are of the outcome. Shannon argued that such a measure H(p1, , pn) should obey
the following properties
1. H should be continuous in pi
2. If all pi are equal than H should be monotonically increasing in n.
3. If a choice is broken down into two successive choices, the original H should be
the weighted sum of the individual values of H.
The entropy H(y) of a discrete random variable y is defined by
H(y) = −n∑i=1
p(y)logp(y) (2.22)
Figure 2.7: Entropy measurement corresponding to probability. From the figure entropy will
be maximum when p=0.5
29
Methods and Materials
A fundamental result of information theory is that a gaussian variable has the
largest entropy among all random variables of equal variance [84, 20]. This means
that entropy can be used for measuring non-gaussianity.
A Gaussian random variable has the largest possible entropy of all variables with
an equal variance, so high degree of entropy can be associated with a high degree
of gaussianity. Negentropy is a measurement of the entropy of a random variable
that is designed to be always non negative and equal to zero when the distribution is
Gaussian. Negentropy is defined in terms of entropy as
J(y) = H(ygauss)−H(y) (2.23)
where, ygauss is a gaussian random variable of the same covariance matrix as y. As
J(y) is always greater than zero unless y is Gaussian, it is a good measurement of
non-gaussianity.
This result can be generalized from random variables to random vectors, such as
y = [y1, · · · , ym]T , and we want to find a matrix W so that y = Wx has the maximum
negentropy J(y) = H(yG)−H(y), i.e., y is most non-Gaussian. However, exact J(y)
is difficult to get as its calculation requires the specific density distribution function
p(y).
The negentropy can be approximated by
J(y) ≈ 1
12E{y3}2 +
1
48kurt(y)2 (2.24)
However, this approximation also suffers from the non-robustness due to the kur-
tosis function. A better approximation is
J(y) ≈p∑i=1
ki[E{Gi(y)} − E{Gi(g)}]2 (2.25)
where ki are some positive constants, y is assumed to have zero mean and unit
variance, and g is a Gaussian variable also with zero mean and unit variance. Gi are
some non-quadratic functions such as
30
Methods and Materials
G1(y) =1
alog cosh (a y), G2(y) = −exp(−y2/2)
where 1 ≤ a ≤ 2 is some suitable constant. Although this approximation may not
be accurate, it is always greater than zero except when x is Gaussian.
Minimization of Mutual Information
Mutual information is a non-parametric measure of relevance between two variables.
Shannon’s information theory provides a suitable formalism for quantifying these con-
cepts. Mutual Information I(X, Y ) of two random variable X and Y can be defined
as-
I(X, Y ) = p(X, Y )log(p(X, Y ))
p(X)p(Y )(2.26)
where p(X) and p(Y ) are the probability density functions for X and Y , and p(X, Y )
is their joint probability density function. Mutual information measures the extent
to which observation of one variable reduces the uncertainty of the second. This is
minimized when X and Y are independent and one variable provides no knowledge
about the other. In this case, p(X, Y ) = p(X)p(Y )
Figure 2.8: Mutual Information between two variable X and Y .
31
Methods and Materials
The mutual information I(x, y) of two random variables x and y is defined as
I(x, y) = H(x) +H(y)−H(x, y) = H(x)−H(x|y) = H(y)−H(y|x) (2.27)
Obviously when x and y are independnent, i.e., H(y|x) = H(y) and H(x|y) =
H(x), their mutual information I(x, y) is zero.
Similarly the mutual information I(y1, · · · , yn) of a set of n variables yi (i =
1, · · · , n) is defined as
I(y1, · · · , yn) =n∑i=1
H(yi)−H(y1, · · · , yn) (2.28)
If random vector y = [y1, · · · , yn]T is a linear transform of another random vector
x = [x1, · · · , xn]T :
yi =n∑j=1
wijxj, or y = Wx (2.29)
then the entropy of y is related to that of x by: H(y1, · · · , yn) = H(x1, · · · , xn) +
E {log J(x1, · · · , xn)} = H(x1, · · · , xn) + log detW
where J(x1, · · · , xn) is the Jacobian of the above transformation:
J(x1, · · · , xn) =∣∣∣ ∂......yn
∂xn
∣∣∣ = detW (2.30)
The mutual information above can be written as I(y1, · · · , yn) =n∑i=1
H(yi) −
H(y1, · · · , yn) =n∑i=1
H(yi)−H(x1, · · · , xn)− log det W We further assume yi to be
uncorrelated and of unit variance, i.e., the covariance matrix of y is
E{yyT} = WE{xxT}WT = I (2.31)
and its determinant is
det I = 1 = (detW) (det E{xxT}) (detWT ) (2.32)
32
Methods and Materials
This means det W is a constant (same for any W). Also, as the second term in
the mutual information expression H(x1, · · · , xn) is also a constant (invariant with
respect to W), we have
I(y1, · · · , yn) =n∑i=1
H(yi) + Constant (2.33)
i.e., minimization of mutual information I(y1, · · · , yn) is achieved by minimizing
the entropies
H(yi) = −∫pi(yi)log pi(yi)dyi = −E{log pi(yi)} (2.34)
As Gaussian density has maximal entropy, minimizing entropy is equivalent to mini-
mizing Gaussianity. Moreover, since all yi have the same unit variance, their negen-
tropy becomes
J(yi) = H(yG)−H(yi) = C −H(yi) (2.35)
where C = H(yG) is the entropy of a Gaussian with unit variance, same for all
yi. Substituting H(yi) = C − J(yi) into the expression of mutual information, and
realizing the other two terms H(x) and log det W are both constant (same for any
W), we get
I(y1, · · · , yn) = Const−n∑i=1
J(yi) (2.36)
where Const is a constant (including all terms C, H(x) and log detW) which is
the same for any linear transform matrix W . This is the fundamental relation between
mutual information and negentropy of the variables y1. If the mutual information of
a set of variables is decreased (indicating the variables are less dependent) then the
negentropy will be increased, and yi are less Gaussian. We want to find a linear
transform matrix W to minimize mutual information I(y1, · · · , yn), or, equivalently,
to maximize negentropy (under the assumption that yi are uncorrelated).
33
Methods and Materials
2.3.6 ICA and Projection Pursuit
Projection pursuit [44, 45, 72, 104, 105, 106] is a technique developed in statistics
for finding interesting projections of multidimensional data. These projections can
then be used for optimal visualization of the data, and for such purposes as density
estimation and regression. It has been argued by [72] and others [104] in the field of
projection pursuit, that the Gaussian distribution is the least interesting one, and that
the most interesting projections are those that exhibit the least Gaussian distribution.
This is almost exactly what is done during the independent component estimation of
ICA, which can be considered a variant of projection pursuit. The difference is that
projection pursuit extracts one projected signal at a time that is as non-gussian as
possible, whereas independent component analysis which extracts n signals from n
signal mixtures simultaneously.
Figure 2.9: An illustration of projection pursuit and the ”interestingness” of nongaussian
projections. The data in this figure is clearly divided into two clusters. However, the prin-
cipal component, i.e. the direction of maximum variance, would be vertical, providing no
separation between the clusters. In contrast, the strongly nongaussian projection pursuit
direction is horizontal, providing optimal separation of the clusters.
34
Methods and Materials
Specifically, the projection pursuit allows us to tackle the situation where there are
less independent components sithan original variables xi. However, it should be noted
that in the formulation of projection pursuit, no data model or assumption about
independent components is made. In ICA models, optimizing the non-gaussianity
measures produces independent components; if the model does not hold, then the
projection pursuit directions are produced.
2.3.7 Data preprocessing
Centering
Typically algorithm for ICA use centering, whitening and dimensionality reduction
as preprocessing steps in order to simplify and reduce the complexity of the problem
for the actual iterative algorithm. Let us consider x′ is the observed data. After
centering the data we have-
x = x′ − E(x′)
Whitening
Whitening is a slightly stronger property than uncorrelatedness. Whitening of a zero
mean random vector means that their components are uncorrelated and their variance
equal to unity. In other word the covariance matrix equal to identity matrix. Actually
whitening is the linear transformation of the observed data vector.
z = Vx
So, that the z is white. It is sometimes is called sphering. There are many method
are available for whitening. One of the popular methods is Eigenvalue Decomposition
(ED).
E(xxT ) = EDET
35
Methods and Materials
Here E is an orthogonal matrix of eigenvectors E(xxT ) and D is the diagonal matrix
of eigenvalues.
V = ED−1/2ET
E(zzT ) = VE(xxT )VT
= ED−1/2ETEDETED−1/2ET
= ED−1/2DD−1/2ET
= I
Whitening transforms the mixing A matrix into a new one A. We have from the
above-
= VAs = As
One could hope that whitening solves the ICA problem, since whiteness or uncorre-
latedness is related to independence. This is, however, not so. Uncorrelatedness is
weaker than independence, and is not in itself sufficient for estimation of the ICA
model. whitening gives the ICs only up to an orthogonal transformation. This is not
sufficient in most applications.
E(zzT ) = AE(ssT )AT
= I
2.3.8 FastICA algorithm
The FastICA algorithm was developed at the Laboratory of Information and Com-
puter Science in the Helsinki University of Technology by Hugo Gvert, Jarmo Hurri,
Jaakko Srel, and Aapo Hyvrinen. The FastICA algorithm is a highly computationally
efficient method for performing the estimation of ICA. It uses a fixed-point iteration
process that has been determined, in independent experiments, to be 10-100 times
faster than conventional gradient descent methods for ICA. Another advantage of the
FastICA algorithm is that it can be used to perform projection pursuit as well, thus
36
Methods and Materials
providing a general-purpose data analysis method that can be used both in an ex-
ploratory fashion and for estimation of independent components (or sources)(FastICA
website. http://www.cis.hut.fi/projects/ica/fastica/).
The advantage of using negentropy, also called differential entropy, as a measure
of nongaussianity is that it is well justified by statistical theory. The problem in
using negentropy is, however, that it is computationally very difficult[14]. Approxi-
mations for calculating negentropy in a much more computationally efficient manner
have been proposed as seen in equation 2.13. Varying the formulas used for G can
provide further approximation with a minimal loss of information.
In finite sample statistical properties of the estimators based on optimizing such
a general contrast function were analysed. It was found that for a suitable choice of
G, the statistical properties of the estimator (asymptotic variance and robustness)
are considerably better than the properties cumulant based estimator. The following
choices of G were proposed [?]:
G1(u) = log{cosh(a1u)} (2.37)
G2(u) = exp(−a2u2
2) (2.38)
where a1, a2 ≥ 1 are some suitable constants. Experimentally, it is found that espe-
cially the values 1 ≤ a1 ≤ 2, a2 = 1 for the constant gives good approximations.
The basic process for the algorithm is best first described as a one-unit version,
where there is only one computational unit with a weight vector w that is able to
update by a learning rule. The FastICA learning rule finds a unit vector w such that
wTx maximizes nongaussianity, which in this case is calculated by the approximation
of negentropy J(wTx). The steps of the FastICA algorithm for extracting a single
independent component are outlined below.
37
Methods and Materials
� Take a random initial vector w(0) of norm 1, and let k = 1.
� Let w(k) = E(x(w(k − 1)Tx)3)− 3w(k − 1)
� Divide w(k) by its norm.
� If |w(k)Tw(k1)| is not close enough to 1, let k = k + 1 and go back to step 2.
Otherwise output the vector w(k).
After starting with a random guess vector for w, the second step is the equation
finding maximum independence, with the third checking the convergence to a local
maxima. The final w(k) vector produced by the FastICA algorithm is one of the
columns of the orthogonal unmixing matrix W . In the case of blind source separa-
tion, this means that w(k) extracts one of the nongaussian source signals from the set
of mixtures x. This set of steps only estimates one of the independent components,
so it must be run n times to determine all of the requested independent components.
To guard against extracting the same independent component more then once, an
orthogonalizing projection is inserted at the beginning of step three, changing it to
the item below. Let w(k) = w(k)WWTw(k) Becuase the unmixing matrix bfW
is orthogonal, independent components can be estimated one by one by projecting
the current solution w(k) on the space orthogonal to the columns of the unmixing
matrix W . The matrix W is defined as the matrix whose columns are previously
found columns of W. This decorrelation of the outputs after each iteration solves the
problem of any two independent components converging to the same local maxima.
The convergence of this algorithm is cubic, which is unusual for an independent com-
ponent analysis algorithm. Many algorithms use the power method, and converge
linearly. The FastICA algorithm is also hierarchical, allowing it to find indepen-
dent components one at a time instead of estimating the entire unmixing matrix at
once. Therefore, it is possible to estimate only certain independent components with
FastICA if theres enough prior information known about the weight matrices. The
FastICA algorithm was developed to make the learning of kurtosis faster, and thus
38
Methods and Materials
Figure 2.10: Flowchart of the FastICA algorithm.
provide a much more computationally efficient way of estimating independent compo-
nents. Its performance and theoretical assumptions have been proven in independent
studies and [?], and given cause for it to be used in the development of a process that
will quickly evaluate and separate the high-level components of acoustic information.
The originally developed FastICA algorithm Matlab implementation can be found at
http://www.cis.hut.fi/projects/ica/fastica/ and other language such as R, C, Python
of FastICA are available.
39
Methods and Materials
2.3.9 Infomax learning algorithm
Infomax is an implementation of ICA from a neural network viewpoint, based on
minimization of mutual information between independent components [17]. In the
Infomax framework, a self-organizing learning algorithm is chosen to maximize the
output entropy, or the information flow, of a neural network of non-linear units. The
network has N input and output neurons, and an n×n weight matrix W connecting
the input layer neurons with the output layer neurons. x is an input the to neural
network. Assuming sigmoidal units, the neurons outputs are given by
s = g(D)withD = Wx (2.39)
Where g(.) is a specified non-linear function. This non-linear function, which provides
necessary higher-order statistical information, is chosen to be a logistic function
g(D) =1
1 + e−Di(2.40)
Where Di represents a row in the matrix D for i = 1, , N . The main idea of this
algorithm is to find an optimal weight matrix W iteratively such that the output
joint entropy H(s) is maximized. In the simplified case of only two outputs, where
s = (s1, s2), I(s) = H(s1) +H(s2)H(s) holds by the definition of mutual information.
Hence, we can minimize the mutual information by maximizing the joint entropy.
Then, by another equivalent definition of mutual information, I(x, s) = H(s)H(s|x),
the information flow between the input and the output is maximized by maximizing
the joint entropy H(s) since the last term vanishes due to the deterministic nature of
s given x and g(.).
To find an optimal weight matrix W , the algorithm first initializes W to the identity
matrix I. Using small batches of data drawn randomly from X without substitution,
the elements of W are updated based on the following rule:
∂W = −ε(∂(s)
∂W)W TW = −ε(I + f(D)DT )W (2.41)
40
Methods and Materials
Where ε is the learning rate (typically near 0.01) and the vector function g has
elements
fi(Di) =∂
∂Di
ln∂gi∂Di
= (1− 2si) (2.42)
Equ. (2.42) is known as the Infomax algorithm. The W TW term in Equ. (2.41), first
proposed by Amari et al. [78], avoids matrix inversions and speeds up convergence.
During training, the learning rate is reduced gradually until the weight matrix stops
changing appreciably. The choice of nonlinearity depends on the application type. In
the context of fMRI, where relatively few highly active voxels are usually expected in
a large volume, the distribution of the estimated components is assumed to be super-
gaussian. Therefore, a sigmoidal function is appropriate for such an application [68].
2.4 Data Description
In this thesis I apply three simulated dataset for shape study, one simulated and three
real datasets for clustering which are Australian Crabs data, Olive oils data, Fisher
Iris data and five real data set for outlier detection which are Epilepsy , Education
expenditure, Stackloss data, Scottish hill racing data.
2.5 PCA vs ICA
PCA can be interpreted in terms of blind source separation methods in as much as
PCA is like a version of ICA in which the source variables are assumed to be gaussian.
However, the essential difference between ICA and PCA is that PCA decomposes a
set of mixture variables into a set of uncorrelated variables, whereas ICA decomposes
a set of mixture variables into a set of independent variables. Independent is much
stronger property than uncorrelatedness.
42
Methods and Materials
Figure 2.12: (a) Vectors IC1 and IC2 show the directions determined by the relative actions
of the two component process. The data will be independently along these two component
vectors. Vector PC1 and PC2 show the two perpendicular principal component directions
indicating maximum variance in the data. (b) Shows that IC1 and IC2 can be indirectly
determined by the finding a linear transformation matrix W which results in a rectangular
distribution. The sigmoid transformation g(Wx) makes the distribution more uniform and
the ICA algorithm of Bell and Sejnowski (1995) further adjusts IC1 and IC2 to maximize
the entropy of the distribution.
Actually, PCA does more than simply find a transformation of the mixture vari-
ables such that the new variables are uncorrelated. PCA orders the extracted signals
according to their variances (variance can be equated with eigenvalue associated with
components), so that variables associated with high variance are deemed more im-
portant than those with low variance. In contrast, ICA is essentially blind to the
variance associated with each extracted variables.
43
Methods and Materials
Specifying that a set of uncorrelated gaussian variables is required places very
few constraints on the variables obtained. So few that an infinite number of sets
of independent gaussian variables can be obtained from any set of mixture variables
(recall that mixture variable tend to be gaussian). For example, a relatively simple
procedure such as Gram-Schmidt Orthogonalisation (GSO) can be used to obtain a
set of uncorrelated variables, and the set so obtained depends entirely on the variable
used to initialize the GSO procedure. This is why any procedure which obtains
a unique set of variables requires more constraints than simple decorrelation can
provide. In the case of ICA, these extra constraints involve high order moments of
the joint pdf of the set of mixtures. In the case of PCA, these extra constraints involve
an ordering of the gaussian variables obtained. Specifically, PCA finds an ordered set
of uncorrelated gaussian variables such that each variables accounts for a decreasing
proportion of the variability of the set of mixture variables. The uncorrelated nature
of the variable obtained ensures that different variables account for non-overlapping or
disjoint amounts of the variability in the set of mixture variables, where this variability
is formalized as variance.
44
Methods and Materials
2.6 Computer Program Used in the Analysis
In this thesis all computations are done by using several software-
� R version 3.0.1, R development core team (16-05-2013) with several download
packages fastICA, moments, VGAM, MASS, rattle etc.
� Matlab version 7.10.0.499 (R2010a) with several toolbox FastICA, BSS, fmrlab,
ICALAB etc.
All typeset is being completed by using the software ”miktex”, (LaTeX) version 2.9
and Texwork. A personal computer of HP pavilion g6 notebook, Intel(R) Core(TM)i3,
2.53 GHz Processor, 4GB RAM, Windows 8 Enterprise N-32-bit is used for the anal-
ysis.
2.7 Summary
In this chapter, we described the basic ICA and PCA model with their mathematical
and logical difference of our interest. We also reviewed most popular algorithms of
ICA. In addition, we mention data and software that are used in the analysis. We
apply these methods in identifying shape of the source, cluster and outlier detection
in following chapters and discuss their similarities and differences.
45
Chapter 3
Latent Structure Detection
3.1 Introduction
The main task of ICA to recover the original source from mixture distribution. Sup-
pose we are having a conversation at a crowded cocktail party . It is usually no
problem to focus on the person you are talking to, although our two ears are receiv-
ing a wild mixture of different sounds originating from various sources: for example,
the conversation of the people next to us or the stereo system playing background
music. Despite all the background noise the brain enables us to understand the per-
son we are trying to listen to. Replace the two ears with microphones and the brain
with a computer. Can we program a computer such that it separates the microphone
recordings into the different sound sources of the cocktail party? Can it single out
the words of the person in front of us? This is the cocktail-party problem which is
quite difficult to solve, but it illustrates the goal of blind source separation (BSS):
decompose signals that have been recorded by an array of sensors (i.e. multichan-
nel recordings) into the underlying sources. This source separation problem is called
blind because neither the mixing process nor the characteristics of the source signals
are known.
Latent Structure Detection
PCA belongs to the standard techniques of statistical data analysis. By making
use of the correlations between mixture variable, PCA is able to obtain a represen-
tation with less redundancy. However, it can only identify an orthogonal basis of the
subspace that contains the observations but it can not determine the directions of the
sources inside this subspace. Thus PCA does not recover the original source of the
data. One of the goal of this chapter is to find source variables from observe mixture
variables by using PCA and ICA and compare their performance.
3.2 Experimental Setup
In this section we have analyzed three simulated datasets. Simulated datasets are
generated from various super and sub-gaussian distributions. Source variable of sim-
ulated datasets are blended with mixing matrix. After that, source variable are
recovered from observed mixture by using both projection pursuit technique PCA
and ICA and compare them.
3.2.1 Simulated data set-1
The first example of simulated dataset consists of 10 standard uniform (sub-gaussian)
variables each have 1000 observations. Figure 3.1 shows the matrix plot of 10 uniform
source variables. Source variables are blended with mixing matrix. Figure 3.2 shows
matrix plot of mixture data. Lastly, two projection pursuit technique, PCA and ICA
apply on the mixture data to recover the original source. Figure 3.3 and 3.4 shows
the performance of recovering original source by using PCA and ICA respectively.
47
Latent Structure Detection
S1
−1.5 −1.5 −1.5 −1.5 −1.5
−1.5
−1.5 S2
S3
−1.5
−1.5 S4
S5
−1.5
−1.5 S6
S7
−1.5
−1.5 S8
S9
−1.5
−1.5
−1.5
−1.5 −1.5 −1.5 −1.5
S10
Figure 3.1: Matrix plot of original source of 10 uniform (sub-gaussian) distribution.
X1
−3 2 −2 2 0 3 −3 0 −2 2
−30
−32
X2
X30.
03.
0
−22
X4
X5 13
03
X6
X7
−31
−30 X8
X9
02
−3 0
−22
0.0 3.0 1 3 −3 1 0 2
X10
Figure 3.2: Matrix plot of observable mixture of 10 uniform (sub-gaussian) distribution.
48
Latent Structure Detection
Comp.1
−3 2 −3 1 −2 1 −0.5 −0.2
−42
−32
Comp.2
Comp.3
−31
−31
Comp.4
Comp.5
−21
−21
Comp.6
Comp.7
−1.5
1.5
−0.5 Comp.8
Comp.9
−0.5
−4 2
−0.2
−3 1 −2 1 −1.5 1.5 −0.5
Comp.10
Figure 3.3: Matrix plot of 10 principal components.
var 1
−1.5 −1.5 −1.5 −2 1 −1 2
−21
−1.5
var 2
var 3
−1.5
−1.5
var 4
var 5
−1.5
−1.5
var 6
var 7
−1
−21
var 8
var 9
−21
−2 1
−12
−1.5 −1.5 −1 −2 1
var 10
Figure 3.4: Matrix plot of 10 independent components
If we compare visually the figure 3.1 and 3.4 then it looks like identical but figure
3.3 and 3.4 not different. Therefore ICA successfully go back to the source variable
whereas PCA failed.
49
Latent Structure Detection
3.2.2 Simulated data set-2
The second example of simulated data set consists 5 Laplace (super-gaussian), 3 bi-
nomial, 2 multinomial (p=(0.1,0.2,0.3,0.2,0.1)) variables each have 1000 observation.
Figure 3.5 shows the matrix 10 source variables. Figure 3.6 shows matrix plot of
observed mixture data. Lastly, two projection pursuit technique, PCA and ICA ap-
ply on the mixture data to recover the original source. Figure 3.7 and 3.8 shows the
performance of recovering original source.
s1
−4 2 −4 2 −1.0 1.0 −1.0 1.0 −1 2
−42
−42 s2
s3
−42
−42 s4
s5
−42
−1.0
1.0
s6
s7
−1.0
1.0
−1.0
1.0
s8
s9
−12
−4 2
−12
−4 2 −4 2 −1.0 1.0 −1 2
s10
Figure 3.5: Matrix plot of original source of 5 laplace (super-gaussian), 3 binomial, 2 multi-
nomial distribution.
50
Latent Structure Detection
X1
−15 5 −20 10 −10 5 −4 2 −10 5
−15
5
−15
5
X2
X3
−15
5
−20
10
X4
X5
−15
10
−10
5
X6
X7
−10
5
−42
X8
X9
−10
10
−15 5
−10
5
−15 5 −15 10 −10 5 −10 10
X10
Figure 3.6: Matrix plot of observed mixture of 5 Laplace (super-gaussian), 3 binomial, 2
multinomial distribution.
Comp.1
−20 20 −10 10 −6 2 −2 1 −0.4 0.4
−20
20
−20
20
Comp.2
Comp.3
−15
10
−10
10
Comp.4
Comp.5
−10
5
−62
Comp.6
Comp.7
−42
−21
Comp.8
Comp.9
−1.0
−20 20
−0.4
0.4
−15 10 −10 5 −4 2 −1.0
Comp.10
Figure 3.7: Matrix plot of principal components of 5 Laplace (super-gaussian), 3 binomial,
2 multinomial distribution.
51
Latent Structure Detection
Figure 3.8: Matrix plot of independent components of 5 Laplace (super-gaussian), 3 bino-
mial, 2 multinomial distribution after applying ICA.
From the above (Figure 3.5 and 3.8) ICA almost recover the source variables where
PCA (Figure 3.5 and 3.7) fail to recover source variables.
3.2.3 Simulated data set-3
The third example of simulated data set consists of 5 variables comes from uniform
(sub-gaussian), Laplace (super-gaussian), binomial, multinomial and normal distribu-
tion each have 1000 observation. Figure 3.9 shows the matrix plot of original source.
After that original data matrix mix with mixing matrix. Figure 3.10 shows matrix
plot of mixture data. Lastly, two projection pursuit technique, PCA and ICA apply
on the mixture data to recover the original source. Figure 3.11 and 3.12 shows the
performance of recovering original source.
52
Latent Structure Detection
s1
−1.0 0.0 1.0 −4 0 2
−60
4
−1.0
0.0
1.0
s2
s3
−11
3
−40
2
s4
−6 0 4 −1 1 3 −1.5 0.0 1.5
−1.5
0.0
1.5
s5
Figure 3.9: Matrix plot of 5 original source variable comes from uniform (sub-gaussian),
Laplace (super-gaussin), binomial, multinomial and normal distribution.
X1
−8 −4 0 4 −10 0 5
−22
6
−8−4
04
X2
X3
−10
05
−10
05
X4
−2 2 6 −10 0 5 −10 0 5
−10
05
X5
Figure 3.10: Matrix plot of observe mixture of 5 variables.
53
Latent Structure Detection
Comp.1
−5 0 5 10 −1 1 2
−10
05
−50
510
Comp.2
Comp.3
−10
010
−11
2
Comp.4
−10 0 5 −10 0 10 −0.4 0.0 0.4
−0.4
0.0
0.4
Comp.5
Figure 3.11: Matrix plot of all principal components.
var 1
−1.0 0.0 1.0 −1.5 0.0 1.5
−20
24
−1.0
0.0
1.0
var 2
var 3
−4−2
0
−1.5
0.0
1.5
var 4
−2 0 2 4 −4 −2 0 −6 0 4
−60
4
var 5
Figure 3.12: Matrix plot of all independent components.
Sicne ICA is forbidden for gaussian vairable [15]. In last simulated study we
make data matrix with a gaussian variable. From figure 3.9 and 3.11 ICA extracted
source variables. Because in case of one gaussian component all source variable can be
54
Latent Structure Detection
extracted but in case of more than one gaussian variable only non-gaussian variable
can be extracted, gaussian variables entangled to each others.
3.3 Summary
ICA successfully recover the original source after mixing the data because ICA taking
into account higher-order statistics which are ignored by PCA that relies only on
second-order statistics. Since its introduction it has become an indispensable tool for
statistical data analysis and processing of multi-channel data.
55
Chapter 4
Visualization of Clusters
4.1 Introduction
One of the useful ways of getting summary information about a dataset is to per-
form clusters. Clustering is an unsupervised way or learning. The objective of any
clustering algorithm is to maximize similarity within numbers of the same cluster
(intra-cluster) and to minimize similarity within numbers belonging to different clus-
ters (inter-cluster). One of the motivation of this chapter is to find out cluster from
multivariate datasets by using ICA and compare with PCA based clustering.
One of the goal of projection pursuit [104] technique in ICA is to find interesting
direction. From purely Gaussian distributed data no unique ICs can be extracted.
Therefore, ICA should only be applied to datasets where we can find components that
have a non-gaussian distribution. It is usually agreed that gaussian distribution is the
least interesting one whereas non-gaussian distribution is most interesting one. Clus-
tering analysis are typically associated with sub-gaussianity (Bugrin, 2009). Negative
kurtosis can indicate a cluster structure or at least an uniformly distributed factor
[117]. Finucan (1964) noted that, because bimodal distributions can be viewed as
having heavy shoulders they should tend to negative kurtosis, i.e. a bimodal curve in
Visualization of Clusters
general has also a strong negative kurtosis. As for example, the uniform distribution
is a infinity modal distribution with negative kurtosis β2−3 = 1.2 and Laplace distri-
bution has heavier tail and higher peak than normal with positive kurtosis β2−3 = 6.
(a)
Figure 4.1: Density plot of various distribution and their kurtosis.
ICA is a step forward from Principal Components Analysis (PCA), as the data
are 1st standardized to be uncorrelated (PCA) and then rotated so that independent
factors can be found. Huber [72] emphasized that interesting projections are those
that produce non-normal distributions and therefore non-normality is one of the cri-
teria used to find the factors. PCA can be ordered by according to the order of the
eigenvalues but ICA have no such order. In our thesis, we have used kurtosis value for
ordering the ICs. Clustering using PCA is popular technique [38]. Gene expression
data clustering using PCA by[122, 121]. Some author use ICA for cluster analysus
instead of using PCA in multivariate datasets [69, 116, 117].
57
Visualization of Clusters
4.2 Experimental Setup
In this section we have discussed one simulated and three real datasets. Real datasets
are Australian crabs, Fisher Iris, and Italian olive oils datasets. Generally cluster can
be visualize plotting first two PCs or last two ICs. PCA can be ordered according to
the order of the eigen values. In case of ICA there is no concept of ordering. In our
thesis we have ordering ICs acording to their kurtosis value.
4.2.1 Simulated dataset 1
This dataset are formed to simply understand the cluster analysis using ICA. The
dataset has two variables and 11 observations in two groups.
1 2 3 4 5 6
01
23
45
X1
X2
Figure 4.2: Scatter plot of two variables.
58
Visualization of Clusters
−1.0 0.0 0.5 1.0
−3−2
−10
12
3
x
PC
1
−1.0 0.0 0.5 1.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
x
last
ICFigure 4.3: (left) Scatter plot of first principal component. (Right) Scatter plot of last
independent component.
First PC and last IC is projected onto a vector. From figure 4.3 PCA cannot
separate the observation whereas ICA completely separate two group of observation.
4.2.2 Australian crabs dataset
The first experiment of real data set for clustering is Australian crabs data set where
there are 200 rows and 8 columns describing the 5 morphological measurements
(Frontal lob size, Rear width, Carapace length, Carapace width, Body depth). There
are two species in the data set each have both sexes (male, female) of the genus Lep-
tograpsus. There are 50 specimens of each sex of each species, collected on site at
Fremantle, Western Australia. (N. A. Campbell et al., 1974).
59
Visualization of Clusters
FL
6 10 14 18 20 40
1015
20
610
1418
RW
CL
1525
3545
2040 CW
10 15 20 15 25 35 45 10 15 20
1015
20
BD
(a)
Comp.1
−2 0 2 −1.0 0.0 1.0
−30
020
−20
2
Comp.2
Comp.3
−20
12
−1.0
0.0
1.0
Comp.4
−30 0 20 −2 0 1 2 −0.5 0.5
−0.5
0.5
Comp.5
(b)
Figure 4.4: (a)Matrix plot of the Australian Crabs data set. (b) Matrix plot of all principal
components of Australian Crabs data set.
−30 −20 −10 0 10 20
−2
−1
01
23
pc1
pc2
(a)
−1 0 1 2
−2
−1
01
2
ic1
ic2
(b)
Figure 4.5: (a)Scatter plot of first two principal component (b) Scatter plot of the last two
independent components of Australian Crabs data.
60
Visualization of Clusters
4.2.3 I ris data set
The second example of real data set is world famous Fishers Iris data set where the
data report four characteristics (sepal width, sepal length, petal width and petal
length) of three species (setosa, versicolor, virginica) of Iris flower.
Sepal.Length
2.0 3.0 4.0 0.5 1.5 2.5
4.5
5.5
6.5
7.5
2.0
3.0
4.0
Sepal.Width
Petal.Length
12
34
56
7
4.5 5.5 6.5 7.5
0.5
1.5
2.5
1 2 3 4 5 6 7
Petal.Width
(a)
Comp.1
−1.0 0.0 1.0 −0.4 0.0 0.4
−3−1
13
−1.0
0.0
1.0
Comp.2
Comp.3
−0.5
0.0
0.5
−3 −1 1 3
−0.4
0.0
0.4
−0.5 0.0 0.5
Comp.4
(b)
Figure 4.6: (a)Matrix plot of the Fisher Iris data set. (b) Matrix plot of the principal
components of Iris data.
61
Visualization of Clusters
−3 −2 −1 0 1 2 3 4
−1
.0−
0.5
0.0
0.5
1.0
pc1
pc2
(a)
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−3
−2
−1
01
23
ic1
ic2
(b)
Figure 4.7: (a)Scatter plot of first two principal component (b) Scatter plot of the last two
independent components of Iris data.
From figure 4.1, it is clear that ICA technique for clustering approach is must
better option than PCA.
4.2.4 Italian Olive oil’s data set
The third example of real data set is Italian olive oils data set (Forina et al.,1983).
This data consists of the percentage composition of fatty acids found in the lipid frac-
tion of Italian Olive oils. The data arises from a study to determine the authenticity
of an olive oil. There are nine classes (areas) in this data.
62
Visualization of Clusters
palmitic
50 250 6500 0 40 0 30 60
600
1600
5025
0palmitoleic
stearic
150
300
6500
oleic
linoleic
600
1400
040 linolenic
arachidic
060
600 1600
030
60
150 300 600 1400 0 60
eicosenoic
(a) PCA
Comp.1
−300 300 −50 50 −40 20 −20 20
−10
00
−30
030
0
Comp.2
Comp.3
−10
015
0
−50
50 Comp.4
Comp.5
−10
050
−40
20 Comp.6
Comp.7
−40
0
−1000
−20
20
−100 150 −100 50 −40 0
Comp.8
(b) ICA
Figure 4.8: (a)Matrix plot of Italian Olive oil data set.(b) Matrix plot of all principal com-
ponents.
−1000 −500 0 500 1000
−30
0−
100
010
020
030
0
pc1
pc2
(a) PCA
−2 −1 0 1 2
−1
01
2
ic1
ic2
(b) ICA
Figure 4.9: (a)Scatter plot of first two principal components (b) Scatter plot of the last two
independent components of Olive oils data.
63
Visualization of Clusters
4.3 Summary
The algorithm for clustering presented here is based on a fast fixed point algorithm
if ICA to find out the cluster of multivariate data sets. The algorithm demonstrate
on 3 real world datasets whose clustering result is more effective rather than PCA.
64
Chapter 5
Outlier Detection
5.1 Introduction
In statistics, an outlier is an observation that lies abnormal distance from other ob-
servation in a random sample from a population. Outliers are important because
they can change the results of our data analysis. Univariate outliers are cases that
have an unusual value for a single variable. Multivariate outliers are cases that have
an unusual combination of values for a number of variables. The value for any of the
individual variables may not be a univariate outlier, but, in combination with other
variables, is a case that occurs very rarely.
The main motivation of this chapter is find out outliers in multivariate analysis by
using ICA and PCA. Since ICA is a very promising old technique for outlier detec-
tion. Some work of outier detection using ICA have done in past decade such as time
series outlier detection [120], outlier detection on ecological data by Jackson.
Outlier Detection
5.2 Experimental Setup
In this section we have used 4 real datasets for outlier detection which are Epilepsy,
Education expenditure, Stackloss, and Scottish hill racing datasets. Most of the cases,
last two PCs are used to visualize the outlier, but many researchers used first two
PCs as well. Generally first two ICs are used to detect outlier in multivariate statis-
tics. Since ICs cannot order as like as PCA. In this thesis we used kurtosis value of
independent component for ordering them.
5.2.1 Epilepsy dataset
Thall and Vail [107] reported data from a clinical trial of 59 patients with epilepsy,
31 of whom were randomized to receive the anti-epilepsy drug Progabide and 28 of
whom received a placebo. Baseline data consisted of the patients age and the number
of epileptic seizures recorded during an 8 week period prior to randomization. The
response consisted of counts of seizures occurring during the two week period prior
to each of four follow up visits. The dataset consists 12 variable (ID, Number of
epilepsy attacks patients have during the first, second, third and forth follow-up,
Baseline, Age, Treatment, etc)
66
Outlier Detection
−5 0 5 10
−5
05
pc5
pc6 1 2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 1819
20
212223
24
25 2627 28
2930
31
323334
35
36
37
38
39
4041
42
4344
45
46
47
48
495051
52
53
54
55
5657
58 59
(a)
−150 −100 −50 0
−4
0−
30
−2
0−
10
01
02
0
pc1
pc2 123
4
5
67
8
9
10
11 1213
14
15
16
17
18
19202122
2324
25
2627
28
29
30 31
3233
34
35
36
37
38
39
40
41
4243
44
45
46
47
48
49
50
51 52
53 5455
56
57
5859
(b)
Figure 5.1: (a)Text plot of first two largest PCs (b) Text plot of two smallest PCs
0 2 4 6
−6
−4
−2
0
first s
seco
nd
s
1234
5
67
8
9
10
11
121314151617
18
19 20212223
2425
26
27
28
293031
32 3334
35
3637
38
3940
4142
434445
464748
49
5051
52
53
5455
5657
5859
(a)
−2 −1 0 1
−1
01
23
ic1
ic2
12
3
4
5
67
8
9
1011
12
13
14
15
16
17
18
19
20 21
22
2324 25 26
27
28
29
30
31
3233
34
35
3637
38
39
40
41
42
4344
45
46
47
48
49 50
51 5253 54
5556
57
5859
(b)
Figure 5.2: (a)Text plot of first two largest ICs (b) Text plot of smallest ICs
67
Outlier Detection
Breslow (1996) identify observation 49 (ID no. 207) is high leverage point. First
two PCs and last two ICs idnetify 49 and 25 observation are outlier.
5.2.2 Education expenditure data
These data were used by Chatterjee, Hadi, and Price [?] as an example of het-
eroscedasticity. The data give the education expenditures for the 50 U.S. States
as projected in 1975. The data were also studied by Rousseeuw and Leroy (1987, pp.
109-112). There are three explanatory variables, X1 (number of residents per thou-
sand residing in urban areas in 1970), X2 (per capita personal income in 1973), X3
(number of residents per thousand under 18 years of age in 1974), and one response
variable Y (per capita expenditure on public education in 1975).
−1000 −500 0 500 1000
−3
00
−2
00
−1
00
01
00
20
0
pc1
pc2 1
2
3
4
5
6
78
910
11
12
1314
15
16
17
18
19
202122
23
24
2526
27
28
29
30
3132
3334
3536
37
383940
41
4243
44
45 46 47
48
49
50
(a)
−50 0 50 100 150 200
−3
0−
20
−1
00
10
20
pc1
pc2
1
2
3
4
5
6
7
8
9
10
1112
13
14 15
16
17
18
19
20
21
22
2324
25
26
2728
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
4950
(b)
Figure 5.3: (a)Text plot of first two principal components. (b)Text plot of last two principal
components of Education expenditure data set.
68
Outlier Detection
−2 −1 0 1 2 3 4 5
−1
01
23
ic1
ic2
1
2
3
4
5
6
789
101112
1314
15
16
17
18
1920
21
22
23
24
25
26
2728
29 3031
3233
34
35
36
37
38
39
40
41
42
43
44
454647
48 49
50
(a)
−2 −1 0 1 2
−1
01
2
ic3
ic4
1
2
3
4
5
6
7
8
9
10
11
12
13
14 15
16
17
18
19
20
21
22
2324
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42 43
44
45
46
47
4849
50
(b)
Figure 5.4: (a)Text plot of first two independent component. (b)Text plot of the last two
independent components of Education expenditure data set.
Chatterjee and Price analyzed these data by using weighted least-squares regres-
sion. They considered the fiftieth case (Hawaii) as an outlier and decided to omit it.
Rousseeuw and Leroy also identified Hawaii as an outlier by analyzing the residual
plot from a least median-of-squares regression. Applying ICA we see that Alaska
(49th) and Utah(44th) are outliers.
5.2.3 Stackloss data
The stack loss data (Brownlee, 1965) consists of 21 days of operation from a plant for
the oxidation of ammonia as a stage in the production of nitric acid. The response
is called stack loss which is percent of unconverted ammonia that escapes from the
plant. There are three explanatory and single response variable in the dataset.
69
Outlier Detection
−4 −2 0 2 4 6 8
−2
−1
01
2
pc3
pc4
1
2
3
4
5
67
8
9
10
11
12
1314
15
16
17
18
19
20
21
(a)
−30 −20 −10 0 10
−5
05
10
pc1
pc2
12
34
56
78
9
10
1112
13
14
15
16
17
181920
21
(b)
Figure 5.5: (a)Text plot of first two principal component. (b)Text plot of the last two
independent components of Stackloss data set.
−1 0 1 2
−2
−1
01
23
ic1
ic2
1
2
3
4
5 6
78
910
1112
13
14
15
1617
1819 20
21
(a)
−1 0 1 2
−2
−1
01
2
first s
seco
nd s
12 3
45
6
7 8
9
10
11
12
13
14
15
16
17
18
19
20
21
(b)
Figure 5.6: (a)Text plot of first two independent components. (b)Text plot of the first two
independent components of Stackloss data set.
70
Outlier Detection
Several author Daniel and Wood (1980), Atkinson (1985) treated as 1-4 and 21 are
outliers. The technique of PCA is fail to detect actual outlier as the author suggest
before. From figure (5a) first two ICs detected 1, 2, 3, 4 and 21 as outliers.
5.2.4 Scottish hill racing data
The data give the record-winning times for 35 hill races in Scotland, as reported by
Atkinson (1986). The distance travelled and the height climbed in each race is also
given. The purpose of that study is to investigate the relationship between record
time 35 hill races and two prdictors: distance is the total length of the race, measured
in miles, and climb is the total elevation gained in the race, measured in feet. One
would expect that longer races and larger climbs would be associated with longer
record times
−5000 −3000 −1000 0 1000
−5
05
1015
20
pc1
pc2
12
3
4
5 6
7
8910
11
12
13 14
15
16
17
1819
20
21
2223
24
25
26
272829
30
31
32
33
34
35
(a)
−5 0 5 10 15 20
−0.
20.
00.
20.
40.
60.
81.
0
pc2
pc3
1
23
4
5
6
7
89
10 1112
13
14
15
16
17
18
19
2021
2223
24 25
26
272829
30
31
32
33
34
35
(b)
Figure 5.7: (a)Scatter text plot of first two principal component. (b)Scatter text plot of the
first two independent components of Stackloss data set.
71
Outlier Detection
−1 0 1 2 3 4 5
−1
01
23
4
ic1
ic2
1
2
34
5
6
7
8910
1112
13
14
15
16
17
18
19
2021
2223
24 25
26
27282930
31
3233
34
35
(a)
−1 0 1 2 3 4
−1
01
23
4
ic2
ic3
1
2
3 4
56
7
89 1011
1213
1415
16
17
18
19 2021
2223
2425
26
2728
29
30
31
32
33
34
35
(b)
Figure 5.8: (a)Scatter text plot of first two principal component. (b)Scatter text plot of the
first two independent components of Stackloss data set.
The data contains a known error - Atkinson (1986) reports that the record for
Knock Hill (observation 18) should actually be 18 minutes rather than 78 minutes.
Hadi (1992) concluded that races 7 and 18 are outliers. After they removed obser-
vation 7 and 18, their methods indicated that observation 33 is also an outlier. By
using the technique of ICA, last two ICs identify observation 7, 18 and 33 as outliers
whereas PCA fail completely..
5.3 Summary and Outlook
PCA and ICA is very useful for outlier detection. But in some cases ICA (figure 5.5,
5.6 ) perform better than PCA.
72
Chapter 6
Special Application of ICA
6.1 Introduction
Although ICA is a new technique in multivariate analysis but it has extensive appli-
cation in real world. In this chapter some extensive real world application of ICA is
presented.
6.2 ICA in Audio source separation
The goal of this section is to tackle the problem of separating voice recorded in real
environments. This problem is related to the cocktail party problem when a listener
can extract one voice from ensemble of different voice corrupted by music or noise in
the background. In this study we used two audio signals. Audio signals are blended
with each other. we extracted the source variable implementing FastICA algorithm.
Special Application of ICA
Figure 6.1: Blind source separation of two speech. (Top row) time course of two speech
signals. (Middle row) These were mixed of two observation. After separation of two signals
(Bottom row)
Figure 6.2: Scatter plot of two audio mixture signals.
74
Special Application of ICA
Figure 6.3: Scatter plot of two principal components of audio signals.
Figure 6.4: Scatter plot of two of two independent components of audio signals.
75
Special Application of ICA
6.3 ICA in Biomedical Application
This section deals with the application of ICA to biomedical data. In biomedical data
multiple receivers are used to record some physical phenomena. Often these receivers
are located close to each other, so that they simultaneously receive signals that are
highly correlated to each other.
6.3.1 ICA of Electroencephalographic Data
In electroencephalographic recordings receivers are placed at different point over a
head. EEG recordings of brain electrical activity measure changes in potential dif-
ference between pairs of points on the human scalp. Scalp recordings also include
artifacts such as line noise, eye movements, blinks and cardiac signals (ECG) which
can present serious problems for analyzing and interpreting EEG recordings (Berg
and Scherg, 1991). Following figure is the channel list of electroencephalographic
data analysis.
76
Special Application of ICA
(a)
Figure 6.5: (a) Raphical output of Co-registration of EEG data, showing (upper panel)
cortex (blue), inner skull (red) and scalp (black) meshes, electrode locations (green),
MRI/Polhemus fiducials (cyan/magneta), and headshape (red dots). (b)All channel of list
of EEG for this dataset.
Figure 6.6: Flowchart of EEG data analysis using ICA.
77
Special Application of ICA
51.45+−
Scale
sq
ua
re
sq
ua
re
rt
sq
ua
re
0 1 2 3 4 5
O2 Oz O1
PO8 PO4 POz PO3 PO7 P8 P4 Pz P3 P7
CP6 CP2 CP1 CP5 T8 Cz C4 C3 T7
FC6 FC2 FC1 FC5
EOG2F4 Fz F3
EOG1FPz
Figure 6.7: A 5 sec. portion of the EEG time series with prominent alpha rhythms (8-21
Hz).
52.87+−
Scale
sq
ua
re
sq
ua
re
rt
sq
ua
re
0 1 2 3 4 5
3231302928272625242322212019181716151413121110 9 8 7 6 5 4 3 2 1
Figure 6.8: The 32 ICA component extracted from the EEG data in figure 6.7.
Figure 6.7 represents the observable mixture of 32 channel over the scalp. After
78
Special Application of ICA
applying the ICA we find much stable signal figure 6.8. ICA characteristically sep-
arates several important classes of non brain EEG artifact activity from the rest of
the EEG signal into separate sources including eye blinks, eye movement potentials,
electromyographic (EMG) and electrocardiographic (ECG) signals, line noise, and
single channel noise (Jung et al., 2000b; Jung et al., 2000a). This important benefit
of ICA decomposition of EEG data was apparent from the first attempt to apply it
(Makeig, 1996). ICA has thus found initial use in many EEG laboratories simply as a
method for removing eye blinks and other artifacts from data. For data sets heavily
contaminated by eye blinks or other artifacts, for instance data collected from young
children, the ability to analyze brain activity in data trials including eye movement
artifacts can mean the difference between analyzing and rejecting the subject data
altogether.
Figure 6.9: Scalp map projection of all 32 channels
The above figure appears, showing the scalp map projection of the selected com-
ponents. Note that the scale in the following plot uses arbitrary units. The scale of
the component’s activity time course also uses arbitrary units. However, the com-
ponent’s scalpmap values multiplied by the component activity time course is in the
79
Special Application of ICA
same unit as the data
IC2
−13.3
−6.7
0
6.7
13.3
Tria
ls
Continous data
Time (ms)0 50 100 150
100
10 20 30 40 50−20
−10
0
10
Frequency (Hz)
Pow
er 1
0*lo
g 10(µ
V2 /H
z)Activity power spectrum
Figure 6.10: Second independent component properties.
The component below has a strong alpha band peak near 10 Hz and a scalp map
distribution compatible with a left occipital cortex brain source. When we localize
ICA sources using single-dipole or dipole-pair source localization. Many of the ’EEG-
like’ components can be fit with very low residual variance (e.g., under 5%).
The ICA algorithm appears to be very effective for performing source separation
in domains where, (1) the mixing medium is linear and propagation delay are negligi-
ble. (2) The time course of the source ICA appears to be a generally applicable and
effective method for removing a wide variety of artifacts from EEG records. There
are several advantages of the method: (1) ICA is computationally efficient. (2) ICA is
generally applicable to removal of a wide variety of EEG artifacts. It simultaneously
separates both the EEG and its artifacts into independent components based on the
statistics of the data, without relying on the availability of ’clean’ reference channels.
This avoids the of thresholds (variable across sessions) are needed to determine when
regression should be performed. (4) Separate analysis are not required to remove
80
Special Application of ICA
different classes of artifacts. Once the training is complete, artifact-free EEG records
in all channels can then be derived by simultaneously eliminating the contributions
of various identified artifactual source in the EEG record.
6.3.2 ICA of Functional Magnetic Resonance Imaging Anal-
ysis (fMRI)
Functional Magnetic Resonance Imaging (fMRI) is a non-invasive technique used to
detect neural activations, which has been widely applied to mapping functions of the
human brain (S. Ogawa et al. 1998). Independent Component Analysis (ICA) is
the most commonly used and most diversely applicable exploratory method for the
analysis of functional Magnetic Resonance Imaging (fMRI) data. Over the last ten
years it has offered a wealth of insights into brain function during task execution and
in the resting state. ICA is a blind source separation method that was originally
applied to identify technical and physiological artifacts and allow their removal prior
to analysis with model-based approaches.
Figure 6.11: Cortex (blue), inner skull (red), outer skull (orange) and scalp (pink) meshes
with transverse slices of the subject’s MRI.
81
Special Application of ICA
It has matured into a method capable of offering a stand-alone assessment of acti-
vation on a sound statistical footing. Recent innovations have taken on the complex
challenges of how components should be combined over subjects to allow group in-
ferences, and how activation identified with ICA might be compared between groups
of patients and controls - for instance. Having proved its worth in the investigation
of resting state networks, ICA is being applied in other cutting edge uses of fMRI;
in multivariate pattern analysis, real-time fMRI, in utero studies and a wide variety
of paradigms and stimulus types and with challenging tasks with patients at ultra-
high field. These are testament both to ICAs flexibility and its central role in basic
neuroscience and clinical applications of fMRI. When neurons in the brain are ac-
tivated, the increase in electrical activity causes an increase in the local metabolic
rate. The increased consumption of oxygen results in fluctuations in the levels of
paramagnetic deoxyhemoglobin in the blood which is sensed using a magnetic field.
The changing level of deoxyhemoglobin measured in the brain is referred to as the
blood oxygenation level dependent (BOLD) signal. During an fMRI scan session, a
series of three-dimensional images are captured consecutively, usually two to three
seconds apart. The value of the image at each small volume unit, called a voxel, is
the BOLD signal intensity.
The patient is usually presented with a stimuli to induce neural activations. While
the types of stimuli vary, the standard experimental design is a simple on-off or boxcar
design, where the patient is repeatedly presented with a task followed by a pause.
82
Special Application of ICA
Figure 6.12: Comparison of brain networks obtained using ICA independently on fMRI
data.
Functional connectivity is defined as correlations between spatially remote neural
events. If activity in two brain regions is observed to covary, it suggests that the
neurons generating that activity may be interacting (B. Horwitz et al. 2005). This
covariance can be measured by observing the BOLD signal from two locations in
an fMRI scan. The time-series from different voxel locations can be compared in
several different ways. However, ICA seems to be sensitive to both transiently and
consistently task related brain activations. The method gives highly reproducible
result and consistent across different trials and different subjects. It may also used
isolate artifactual components from fMRI.
83
Chapter 7
Summary, Conclusions and Future
Research
7.1 Summary
In this section we will discuss performance of all results in our study. We have used
two different technique ICA and PCA for comparisons of structure detection, clusters
analysis, and outlier detection in multivariate statistics.
In Blind Source Separation chapter, we have use three simulated datasets. First
dataset consists of 10 variables each have generated from standard uniform (sub-
gaussian) distribution while second dataset consists of 10 variables generated from
5 Laplace (super-gaussian), 3 binomial and 2 multinomial distributions and third
dataset consists of 5 variables each generate from uniform, Laplace, binomial, multi-
nomial, and normal distribution. In all the cases ICA almost detect the structure. We
have use one normal distribution for third datasets. It should be noted that more than
one gaussian variable cannot be separate by ICA. Actually in case of one gaussian
component, we can estimate the ICA model, because the single gaussian component
does not have any other gaussian components that it could be mixed with.
Summary, Conclusions and Future Research
In forth chapter we have used one simulated and three real datasets which are Aus-
tralian crabs, Fisher’s Iris and Italian olive oils data. Generally first two PC’s are
used for detecting cluster in multivariate analysis. Since ICA can not order as like
as PCA. Thus in our thesis we order IC’s according to their kurtosis. In all the four
cases ICA perform better than PCA.
In outlier detection chapter we have used Epilepsy, Stackloss, Education expendi-
ture and Scottish hill racing datasets. Generally last two PC’s are used to detect
outliers. Many researcher used first two PC’s as well. In our study we use two largest
IC’s for outlier detection.
The ICA algorithm successfully applied in many areas. In last chapter we analysis
biomedical signal processing problem such as EEG data. In last chapter of analysis
we apply ICA for extracting two mixture of audio signals.
7.1.1 Conclusions
ICA solves several instances of the BSS problem by taking into account higher-order
statistics which are ignored by PCA that relies only on second-order statistics. Since
its introduction it has become an indispensable tool for statistical data analysis and
processing of multi-channel data. In case of cluster analysis ICA always perform bet-
ter than PCA. ICA and PCA both are useful for outlier detection, but ICA sometimes
more fruitful than PCA. We recommended using ICA in place of PCA in detecting
clusters as well as outliers. Furthermore, we suggest that if subject domain supports
the assumption of independent non-gaussian source variables ICA, not PCA be used
to identify the latent structures.
85
Summary, Conclusions and Future Research
7.1.2 Future Research
The following are the areas in which we want to study
� Use Kernel technique of ICA for shape study, clustering and outlier detection.
� Separation of Nonlinear mixture.
� Data mining (sometimes called data or knowledge discovery) is the most recent
technique in multivariate analysis to extract information from a data set and
transform it into an understandable structure for further use. Text data mining
or Medical data mining using ICA wolud be future research.
86
Bibliography
Bibliography
[1] A. Cichocki, R.E. Bogner, L. Moszczynski, & K. Pope. Modified Herault-Jutten
algorithms for blind separation of sources. Digital Signal Processing, 7:80-93,
1997.
[2] A. Cichocki and R. Unbehauen. Robust neural networks with on-line learning
for blind identification & blind separation of sources. IEEE Trans. on Circuits
and Systems, 43(11):894-906, 1996.
[3] A. Cichocki, R. Unbehauen, L. Moszczynski, & E. Rummert. A new on-line
adaptive algorithm for blind separation of source signals. In Proc. Int. Sympo-
sium on Artificial Neural Networks ISANN-94, pages 406-411, Tainan, Taiwan,
1994.
[4] A. D. Back, and A. S. Weigend. A first application of independent component
analysis to extracting structure from stock returns. Int. J. on Neural Systems,
8(4):473484, 1997.
[5] A. Hyvarinen. Fast and robust fixed-point algorithms for independent compo-
nent analysis. IEEE Transactions on Neural Networks, 10(3):626-634, 1999.
[6] A. Hyvarinen. The fixed-point algorithm and maximum likelihood estimation
for independent component analysis. Neural Processing Letters, 10(1):1-5, 1999.
[7] A. Hyvarinen. Gaussian moments for noisy independent component analysis.
IEEE Signal Processing Letters, 6(6):145-147, 1999.
88
Bibliography
[8] A. Hyvarinen. Sparse code shrinkage: Denoising of nongaussian data by maxi-
mum likelihood estimation. Neural Computation, 11(7):1739-1768, 1999.
[9] A. Hyvarinen. Survey on independent component analysis. Neural Computing
Surveys, 2:94-128, 1999.
[10] A. Hyvarinen & E. Oja. A fast fixed-point algorithm for independent component
analysis. Neural Computation, 9(7):14831492, 1997.
[11] A. Hyvarinen & E. Oja, . Independent component analysis by general nonlinear
Hebbian-like learning rules. Signal Processing, 64(3):301313, 1998.
[12] A. Hyvarinen, J. Sarela, & R. Vigario. Spikes and bumps: Artefacts generated
by independent component analysis with insufficient sample size. In Proc. Int.
Workshop on Independent Component Analysis and Signal Separation (ICA99),
pages 425429, Aussois, France, 1999.
[13] A. Hyvarinen and P. Pajunen. Nonlinear independent component analysis: Ex-
istence and uniqueness results. Neural Networks, 12(3):429-439, 1999.
[14] A. Hyvarinen, P. O. Hoyer, and E. Oja. Sparse Code Shrinkage: Denoising by
Nonlinear Maximum Likelihood Estimation. Advances in Neural Information
Processing Systems, (11):473479, 1999.
[15] A. Hyvarinen, Juha Karhunen, and Erkki Oja. Independent Component Analy-
sis. J. Wiley, 2001.
[16] A. Hyvarinen. Independent component analysis in the presence of gaussian noise
by maximizing joint likelihood. Neurocomputing, 22:4967, 1998.
[17] A. J. Bell, and T. J. Sejnowski. An information-maximization approach to blind
separation and blind deconvolution. Neural Computation, 7:11291159, 1995.
89
Bibliography
[18] A. J. Bell, and T. J. Sejnowski. The independent components of natural scenes
are edge filters. Vision Research, 37:33273338.
[19] A. Koutras, E. Dermatas, & G. Kokkinakis. Blind signal separation and speech
recognition in the frequency domain. Proceedings of the IEEE International
Conference on Electronics, Circuits and Systems, Vol. 1, pp. 427-430, 1999.
[20] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-
Hill, 3rd edition, 1991.
[21] B. A. Olshausen & D. J. Field. Emergence of simple-cell receptive field proper-
ties by learning a sparse code for natural images. Nature, 381:607609, 1996.
[22] B. A. Pearlmutter & L. C. Parra. Maximum likelihood blind source separation:
A context-sensitive generalization of ICA. In Advances in Neural Information
Processing Systems, volume 9, pages 613619, 1997.
[23] B. Laheld and J. F. Cardoso. Adaptive source separation with uniform perfor-
mance. In Proc. EUSIPCO, pages 183-186, Edinburgh, 1994.
[24] B. Laheld and J. F. Cardoso. Adaptive source separation with uniform perfor-
mance. In Proc. EUSIPCO, pages 183-186, Edinburgh, 1994.
[25] C. Jutten and J. Herault. Blind separation of sources, part I: An adaptive
algorithm based on neuromimetic architecture. Signal Processing, 24:1-10, 1991.
[26] C. Fyfe and R. Baddeley. Non-linear data structure extraction using simple
Hebbian networks. Biological Cybernetics, 72:533-541, 1995.
[27] D. Zhang, S. Chen, and J. Liu. Representing Image Matrices: Eigenimages
Versus Eigenvectors. Lecture Notes in Computer Science, 3497/2005:659664,
2005.
90
Bibliography
[28] D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard. Wavelet
shrinkage: asymptopia? Journal of the Royal Statistical Society, Ser. B,
57:301337, 1995.
[29] D. Luenberger. Optimization by Vector Space Methods. Wiley, 1969.
[30] D. N. Lawley. Test of significance of the latent roots of the covariance and
correlation matrices. Biometrica, 43:128-136, 1956.
[31] D.-T. Pham, P. Garrat, and C. Jutten. Separation of a mixture of independent
sources through a maximum likelihood approach. In Proc. EUSIPCO, pages
771774, 1992.
[32] D. Nion,K. N. Mokios, N. D. Sidiropoulos & A. Potamianos. Batch and adap-
tive PARAFAC based blind separation of convolutive speech mixtures. IEEE
Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 6, Au-
gust 2010, pp. 1193-1207, ISSN 1558-7916, 2010.
[33] E.Oja. A Simplified neuron model as a principal component analyzer. J. of
Mathematical Biology, 15:267-273, 1982.
[34] E.Oja and J. Karhunen. On stochastic approximation of the eigenvectors and
eigenvalues of the expectation of random matrix. J. of Math. Analysis and
Applications, 106:69-84, 1985.
[35] E. Oja, H. Ogawa, and J. Wangviwattana. Principal component analysis by
homogeneous neural networks, part I: the weighted subspace criterion. IEICE
Trans. on Information and Systems, E75-D(3):366-375, 1992.
[36] F. J. Mato-Mendez, & M. A. Sobreira-Seoane. Blind separation to improve
classification of traffic noise. Applied Acoustics, Vol. 72, No. 8 (Special Issue on
Noise Mapping), July 2011, pp. 590-598, ISSN 0003-682X, 2011.
91
Bibliography
[37] H. Sawada, S. Araki, & S. Makino. Underdetermined convolutive blind source
separation via frequency bin-wise clustering and permutation alignment. IEEE
Transactions on Audio, Speech, and Language Processing, Vol. 19, No. 3, March
2011, pp. 516-527, ISSN 1558-7916, 2011.
[38] I.T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986.
[39] J. Antoni. Blind separation of vibration components: Principles and demon-
strations. Mechanical Systems and Signal Processing, Vol. 19, No. 6, November
2005, pp. 1166-1180, ISSN 0888-3270, 2005.
[40] J.-F. Cardoso and B. Hvam Laheld. Equivariant adaptive source separation.
IEEE Trans. on Signal Processing, 44(12):3017-3030, 1996.
[41] J.-F. Cardoso and A. Souloumiac. Blind beamforming for non Gaussian signals.
IEEE Proceedings-F, 140(6):362-370, 1993.
[42] J.-F. Cardoso. Infomax and maximum likelihood for source separation. IEEE
Letters on Signal Processing, 4:112114, 1997.
[43] J.-F. Cardoso. Entropic contrasts for source separation. In S. Haykin, editor,
Adaptive Unsupervised Learning, 1999.
[44] J. H. Friedman. Exploratory projection pursuit. J. of the American Statistical
Association, 82(397):249266, 1987.
[45] J. H. Friedman & J. W. Tukey. A projection pursuit algorithm for exploratory
data analysis. IEEE Trans. of Computers, c-23(9):881890, 1974.
[46] J. Karhunen, A. Hyvrinen, R. Vigario, J. Hurri, and E. Oja. Applications of
neural blind separation to signal and image processing. In Proc. IEEE Int.
Conf. on Acoustics, Speech and Signal Processing (ICASSP’97), pages 131-134,
Munich, Germany, 1997.
92
Bibliography
[47] J. Karhunen and J. Joutsensalo. Representation and separation of signals using
nonlinear PCA type learning. Neural Networks, 7(1):113-127, 1994.
[48] J. Karhunen and J. Joutsensalo. Generalizations of principal component analy-
sis, optimization problems, and neural networks. Neural Networks, 8(4):549-562,
1995.
[49] J. Karhunen, E. Oja, L. Wang, R. Vigario, and J. Joutsensalo. A class of neural
networks for independent component analysis. IEEE Trans. on Neural Net-
works, 8(3):486-504, 1997.
[50] J. Karhunen and P. Pajunen. Blind source separation using least-squares type
adaptive algorithms. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal
Processing (ICASSP’97), pages 3048-3051, Munich, Germany, 1997.
[51] J. Karhunen, P. Pajunen, and E. Oja. The nonlinear PCA criterion in blind
source separation: Relations with other approaches. Neurocomputing, 22:5-20,
1998.
[52] J. Karhunen, E. Oja, L. Wang, R. Vigario, & J. Joutsensalo. A class of neural
networks for independent component analysis. IEEE Trans. on Neural Net-
works, 8(3):486504, 1997.
[53] J.-P. Nadal & N. Parga. Non-linear neurons in the low noise limit: a factorial
code maximizes information transfer. Network, 5:565581, 1994.
[54] J. V. Stone. Independent Component Analysis. The MIT Press, 2004.
[55] K.I. Diamantaras and S.Y. Kung. Principal Component Neural Networks: The-
ory and Applications. Wiley, 1996.
[56] K. Kiviluoto & E. Oja. Independent component analysis for parallel financial
time series. In Proc. Int. Conf. on Neural Information Processing (ICONIP98),
volume 2, pages 895898, Tokyo, Japan, 1998.
93
Bibliography
[57] L. De Lathauwer, B. De Moor, and J. Vandewalle. A technique for higher-order-
only blind source separation. In Proc. ICONIP, Hong Kong, 1996.
[58] L. Molgedey and H. G. Schuster. Separation of a mixture of independent signals
using time delayed correlations. Phys. Rev. Lett., 72:3634-3636, 1994.
[59] M.C. Jones and R. Sibson. What is projection pursuit? J. of the Royal Statis-
tical Society, ser. A, 150:1-36, 1987.
[60] M. Kendall. Multivariate Analysis. Charles Griffin& Co., 1975.
[61] M. Kendall and A. Stuart. The Advanced Theory of Statistics. Charles Griffin
& Company, 1958.
[62] M. Lewicki and B. Olshausen. Inferring sparse, overcomplete image codes using
an efficient coding framework. In Advances in Neural Information Processing
10 (Proc. NIPS*97), pages 815-821. MIT Press, 1998.
[63] M. Lewicki and T. J. Sejnowski. Learing overcomplete representations.
[64] M. McKeown, S. Makeig, S. Brown, T.-P. Jung, S. Kindermann, A.J. Bell, V.
Iragui, and T. Sejnowski. Blind separation of functional magnetic resonance
imaging (fMRI) data. Human Brain Mapping, 6(5-6):368-372, 1998.
[65] M. Knaak & D. Filberi Acoustical semi-blind source separation for machine
monitoring. Proceedings of the International Conference on Independent Com-
ponent Analysis and Blind Source Separation, pp. 361-366, December 2001, San
Diego, USA.
[66] M. Knaak, M. Kunter, & D. Filberi. Blind Source Separation for Acoustical Ma-
chine Diagnosis. Proceedings of the International Conference on Digital Signal
Processing, pp. 159-162, July 2002, Santorini, Greece.
94
Bibliography
[67] M. Knaak, S. Araki & S. Makino. Geometrically constrained ICA for robust
separation of sound mixtures. Proceedings of the International Conference on
Independent Component Analysis and Blind Source Separation, pp. 951956,
April 2003, Nara, Japan, 2003.
[68] M. McKeown, S. Makeig, G. Brown, T.-P. Jung, S. Kindermann, A. J. Bell, and
T. J. sejnowski. Analysis of fmri by blind separation into independent spatial
component. Human Brain Maping, 6:1-31, 1998.
[69] M.S. Reza, M. Nasser, and M. Shahjaman. An Improved Version of Kurto-
sis Measure and Their Application in ICA. International Journal of Wireless
Communication and Information Systems (IJWCIS), Vol 1, No 1., 2011.
[70] M. Jones & R. Sibson. What is projection pursuit? J. of the Royal Statistical
Society, Ser. A, 150:136.20, 1987.
[71] N. Delfosse, and P. Loubaton. Adaptive blind separation of independent sources:
a deflation approach. Signal Processing, 45:59-83, 1995.
[72] P. Huber. Projection pursuit. The Annals of Statistics, 13(2):435475, 1985.
[73] P. Comon. Independent Component Analysis-a new concept? Signal Processing,
36:287-314, 1994
[74] R. Gonzales & P. Wintz. Digital Image Processing. Addison-Wesley, 1987.
[75] R. Vigario, V. Jousmaki, M. Hamalainen, R. Hari, and E. Oja. Independent
component analysis for identification of artifacts in magnetoencephalographic
recordings. In Advances in Neural Information Processing Systems, volume 10,
pages 229235. MIT Press, 1998.
[76] R. Vigario. Extraction of ocular artifacts from EEG using independent compo-
nent analysis. Electroenceph. Clin. Neurophysiol, 103(3):395404, 1997.
95
Bibliography
[77] R. H. Lambert. Multichannel Blind Deconvolution: FIR Matrix Algebra and
Separation of Multipath Mixtures. PhD thesis, Univ. of Southern California,
1996.
[78] S. Amari, A.Cichocki, and H.H.Yang. A new learning algorithm for blind source
separation. In Advances in Neural Information Processing 8, pages 757-763.
MIT Press, Cambridge, MA, 1996.
[79] S.-I. Amari. Neural learning in structured parameter spaces natural riemannian
gradient. In Advances in Neural Information Processing 9, pages 127-133. MIT
Press, Cambridge, MA, 1997.
[80] S.-I. Amari and A. Cichocki. Adaptive blind signal processing neural network
approaches. Proceedings of the IEEE, 9, 1998.
[81] S. Makeig, A.J. Bell, T.-P. Jung, and T.-J. Sejnowski. Independent component
analysis of electroencephalographic data. In Advances in Neural Information
PRocessing Systems 8, pages 145-151. MIT Press, 1996.
[82] S. G. Mallat. A theory for multiresolution signal decomposition: The wavelet
representation. IEEE Trans. on PAMI, 11:674-693, 1989.
[83] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, Heidelberg, New
York, 1995.
[84] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley
& Sons, 1991.
[85] T-W. Lee, B.U. Koehler, and R. Orglmeister. Blind source separation of con-
volved and delayed sources. In Information Processing Systems 9, pages 758-764,
1997a.
[86] T-W. Lee, B.U. Koehler, and R. Orglmeister. Blind source separation of real-
world signals. In Proc. ICNN, pages 2129-2135-415, 1997b.
96
Bibliography
[87] T-W. Lee, B.U. Koehler, and R. Orglmeister. Blind source separation of nonlin-
ear mixing models. In Neural networks for Signal Processing VII, pages 406-415,
1997c.
[88] T.-W. Lee, M. Girolami, and T. J. Sejnowski. Independent component analysis
using an extended infomax algorithm for mixed sub-gaussian and super-gaussian
sources. Neural Computation, pages 609-633, 1998.
[89] T.-W. Lee, M. Girolami, A.J. Bell, and T.J. Sejnowski. A unifying information-
theoretic framework for independent component analysis. International Journal
on Mathematical and Computer Models, 1999.
[90] U. Lindgren, T. Wigren, and H. Broman. On local convergence of a class of
blind separation algorithms. IEEE Trans. on Signal Processing, 43:3054-3058,
1995.
[91] Z. Malouche and O. Macchi. Extended anti-Hebbian adaptation for unsuper-
vised source extraction. In Proc. ICASSP’96, pages 1664-1667, Atlanta, Geor-
gia, 1996.
[92] J.-F. Cardoso Eigne-structure of the forth-order cumulant tensor with the ap-
plication to the blind source separation problem. In Proc. ICASSP’90 pages
2655-2658, Alabuquerque, NM, USA, 1990.
[93] J.-F. Cardoso Super symmetric decomposition of the forth-order cumulant ten-
sor, blind identification of more source than sensors. In Proc. ICASSP’91 pages
3109-3112, 1991.
[94] J.-F. Cardoso Source separation using higher order moments. In Proc.
ICASSP’89 pages 2109-2112, 1989.
[95] R. Linsker. Local synaptic learning rules suffice to maximize mutual information
in a linear network. Neural Computation, 4:691-702.
97
Bibliography
[96] M. Gaeta and Lacoume. Source separation without prior knowledge the maxi-
mum likelihood solution. proc. EUSIPO, pages 621-624.
[97] J. Atick Cloud information theory provide an ecological theory of censor pro-
cessing? Network 3:213-251.
[98] M. Girolami and C. Fyfe Extraction of independent signal source using a de-
flanary exploratory projection pursuit network with lateral inhibition. IEEE
Proceeding of Vision, Image and Signal Processing Journal, 14(5): 299-306,
1997b.
[99] M. Scholz, S. Gatzek, A. Sterling, O. Fiehn and J. Selbig. Metabolite fingerprint-
ing: detecting biological features by independent component analysis. Bioinfor-
matics 20, 24472454, 2004.
[100] S. Hochreiter and J. Schmidhuber. Feature extraction through LOCOCODE.
Neural Computation 11(3): 679-714, 1998.
[101] T.PJung, C.Humphries, T.W.Lee, S.Makeig, M.McKeown, V.Iragui,
T.Sejnowski. Extended ICA Removes Artifacts from Electroencephalographic
Recordings. submitted to Advances in Neural Information Processing Sys-
tems,May 1997.
[102] K. I. Diamantaras and S. Y. Kung. Principal Component Neural Networks:
Theory and Applications. Wiley, 1996.
[103] E. Oja, H. Ogawa, and J. Wangviwattana. Principal component analysis by
homogeneous neural networks, part I: the weighted subspace criterion. IEICE
Trans. on Informations and Systems, E75-D(3):366-375, 1992.
[104] M. C. Jones and R. Sibson What is projection pursuit? J. of Royal Statistical
Society, ser. A, 150:1-36, 1987.
98
Bibliography
[105] D. Cook, A. Buja, and J. Cabrera. Projection pursuit indexes based on or-
thonormal function expansions. J. of Computational and Graphical Statistics,
2(3):225-250, 1993.
[106] J. Sun. Some practical aspects of exploratory projection pursuit. SIAM J. of
Sci. Comput., 14:68-80, 1993.
[107] P.F. Thall, and S.C. Vail Some covariance models for longitudinal count data
with overdispersion. Biometrics 46, 657671,1990.
[108] D. Yellin and E. Weinstein. Multichannel signal separation: Methods and anal-
ysis. IEEE Transactions on Signal Processing,44(1):106-118, 1994.
[109] H.-L. Nguyen-Thi, and C. Jutten. Blind source separation for convolutive mix-
tures. Signal Processing, 45(2), 1995.
[110] K. Torkkola Blind separation of convolved source based on information max-
imization. In IEEE Workshop on neural networks for signal processing, pages
423-432, Kyoto, Japan, 1994.
[111] R. Lambert, and C. Nikias. Polynomial matrix whitening and application to
the multichannel blind deconvolution problem. In IEEE Conference on Military
Communications, pages 21-24, San Diego, CA, 1995a.
[112] M. Hermann, H. Yang. Prospective limitation of self organizing maps In
ICONIP’96, 1996.
[113] J. Lin, D. Grier, and J. Cowan. Feature extraction approach to blind source
separation. IEEE Workshop on Neural Networks for Signal Processing.
[114] G. Burel. A non-linear neural algorithm. Neural networks, 5:937-947.
[115] S. Chatterjee, A. Hadi and B. Price. Regression Analysis by example Wiley,
New York, 2000.
99
Bibliography
[116] M. Saimul, M. Sahidul, and M. Nasser. PCA vs ICA in Visualization of Clusters.
International Conference on Statistical Data Mining for Bioinformatics, Health,
Agriculture and Environment,pp. 169-176, 21-24 Dec., 2012.
[117] M. Scholz, S. Gatzek, A. Sterling, O. Fiehn, and J. Selbig. Metabolic Finger-
printing: Detecting biological feature by independent component analysis Bioin-
formatics 20,2447-2454,2004.
[118] J. B. Bugrien and John T. Kent Independent Component Analysis: An ap-
proach to clustering, 2009. In Proceedings of the 2009 International Confer-
ence on Modeling, Simulation & Visualization Methods, MSV 2009, Las Vegas
Nevada, USA, July 13-16, 2009.
[119] I. R. Keck, S. Nassabay, C. G. Puntonet, E. W. Lang. A New Approach to Clus-
tering and Object Detection with Independent Component Analysis. Artificial
Intelligence and Knowledge Engineering Applications: A Bioinspired Approach
Lecture Notes in Computer Science, Volume 3562, 2005, pp 558-566, 2005.
[120] R. Baragona and F. Battaglia Outliers detection in multivariate time series by
independent component analysis. Neural Comput., 19(7):1962-84; Jul, 2007.
[121] A. Ben-Hur and I. Guyon. Detecting stable clusters using principal component
analysis. In Functional Genomics: Methods and Protocols. M.J. Brownstein
and A. Kohodursky (eds.) Humana press, pp. 159-182, 2003.
[122] K. Yeung, and W. Ruzzo. An empirical study of principal component analysis
for clustering gene expression data. Bioinformatics,17, 763774, 2001.
100