Upload
ahmad-bunkley
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Pre-processing
Idea: X O X O X O
X X X O O O
Pre-processing
Post-processing
Network
Input data
Output data
Pre-processing is good to use with networks since the network training => pre-processing does not need to be exact
Why Pre-process? Although in principle networks can approximate any
function in practice its easier if pre-processing is performed first
Types of pre-processing:1. Linear transformations
e.g input normalisation
2. Dimensionality reduction
loss of info. Good pre-proc => lose irrelevant info and retain salient features
3. Incorporate prior knowledge
look for edges / translational invariants
4. Feature extraction
use a combination of input variables: can incorporate 1, 2 and 3
5. Feature selectiondecide which features to use
be.g. Character recognition
For a 256 x 256 character we have 65, 536 pixels. One input for each pixel is bad for many reasons:
1. Poor generalisation: data set would have to be vast to be able to properly constrain all the parameters (Curse of Dimensionality)
2. Takes forever to train
Answer: use e.g. averages of N2 pixels dimensionality reduction – each average could be a feature.
Which ones to use (select)? Use prior knowledge of where salient bits are for different letters
Be careful not to over-specify
e.g if X was in one of k classes could use the posterior probabilities P(Ck| X) as features.
Therefore, in principle only k-1 features are needed.
In practice, its hard to obtain P(Ck| X) and so we would use a much larger number of features to ensure we don’t throw out the wrong thing
Notice that the distinction between network training and pre-proc. is artificial:
If we got all the posterior probs. the classification is complete. Leave some work for the network to do.
Input normalisation
Useful for RBFNs (and MLPs): if variation in one parameter is small with respect to the others it will contribute very little to distance measures (l + )2 ~ l2. Therefore, preprocess data to give zero mean and unit variance via simple transformation:
x* = (x - )
However, this does not take into account correlations in the data.
Can be better to use whitening (Bishop, 1995, pp 299-300)
Eigenvectors and eigenvalues
If : Ax = x
For some scalar not = to 0, then we say that x is an eigenvector with eigenvalue .
Clearly, x is not unique [e.g. if Ax = x, A2x = x], so it is usual to scale x so that it has unit length.
Intuition: direction of x is unchanged by being transformed by A so it in some sense reflects the principal axis of the transformation.
Eigenvector Facts
If the data is D-dimensional there will be D eigenvectors
If A is symmetric (true if A is the covariance matrix), the eigenvectors will be orthogonal and unit length so:
xiT xj = 1 if i = j
xiT xj = 0 else
This means that the eigenvectors form a set of basis vectors. That is, any vector can be expressed as a linear sum of the eigenvectors.
d
iii uzx
1
LetU be a matrix whose columns are the eigenvectors ui of , and a matrix with the corresponding eigenvalues i on the diagonals i.e:
U = (u1, … …, un) And: diag(1, ……, n)
So: AU = UBecause of orthogonality of the eigenvectorsU is orthonormal I.e:
UT U = U-1 U = I (that is diag(, ……, ))
Thus we have the orthogonal similarity transformation:UT AU = UT U =
By which we can transform A into a diagonal matrix
Also if A is the covariance matrix of multivariate normal data, eigenvectors/eigenvalues reflect the direction and extent of variation ie
1u1
2u2
Standard deviation in each direction = eigenvalue
If A is diagonal, eigenvectors are oriented along the axes
If A is the identity, A is circular
x* = -1/2 UT (x -
whereU is a matrix whose columns are the eigenvectors ui of , the covariance matrix of the data, and a matrix with the corresponding eigenvalues i on the
diagonals and is the mean of the data
Why? Because the new covariance matrix will be approximately the identity matrix
1u1
2u2
Whitening
Dimensionality Reduction
Clearly losing some information but this can be helpful due to curse of dimensionality
Need some way of deciding what dimensions to keep
1. Random choice2. Principal components analysis (PCA)3. Independent components analysis (ICA)4. Self-organised maps (SOM) etc
Random subset selection
Any suitable algorithm can be used especially ones used in selecting number of hidden units
• Sequential forward search• Sequential backward search• Plus-l take away r• etc
Principle Components Analysis
Transform the data into a lower dimensional space but lose as little information as possible
Project the data onto unit vectors to reduce the dimensionality of the data. What vectors to use?
x* = xT y = yT x y
|| y || = 1
x
Want to reduce the dimensionality of x from d to M
component. principal a asknown is and direction
principal theonto of projection theis :i.e.
1
i
i
ijjT
iT
ii
i
d
iii
u
xz
uuasxuzThus
lorthonormaareuwhereuzxWrite
1u1
2u2
xx
xx
x
xx
x
.x-X i.e.mean its minusset -data theofmatrix
covariance theof basis) lorthonormaan form(which
rseigenvecto theof seigenvalue theare where
2
1
:Bishop)E,(Appendix have weminimum at theset data
in the points theallover error squaremean gCalculatin
~ :iserror theThus
~
dimensions Monly in vector new a form Now
1
1
1
i
i
d
Mii
d
Miii
M
iii
u
E
uzxx
uzx
Therefore to minimise E we discard the dimensions with the smallest eigenvectors
PCA can also be motivated from considerations of the variance along an axis specified by the eigenvectors
anceleast vari with thedirections
theare components principal missing theThus
:is
~ :errorion approximat
theof elements M-dlast in the variance theThus
onto of projection thealong variance thei.e.
:component principalth i' in the variance theis where
:is of components m theof varianceTotal
11
2
1
11
2
d
Mii
d
Mii
d
Miii
iT
ii
i
d
ii
d
ii
uzxx
uxxuz
x
:
~ 1
xuzwhere
uzxx
Tii
M
iii
PCA procedure:1. Given a data set X = {x1, … … , xN} normalise the data (minus
mean and divide by the std deviation) and calculate the covariance matrix C
2. Calculate the eigenvalues i and eigenvectors ui of C and order them from 1 to d in decending order starting with the largest eigenvalue
3. Discard the last d-M dimensions and transform the data via:
Ie zi are the principal components (NB some books refer to the ui as the principal components)
Why use input normalisation?
Must subtract the mean vector as the theory requires that the data are centred at the origin
Also, we divide by the standard deviation as we must do something to ensure that input dimensions with a large range do not dominate the variance terms
Why not use whitening?
Since this removes the correlations that we are trying to find and makes all the eigenvalues similar
Here the data is best viewed along the dimension of the eigenvector with the most variance as this shows the 2 clusters clearly
Should result in losing unnecessary information
Here projecting the data onto u1, the eigenvector with the most variance, loses all discriminatory information
But it is not guaranteed to work …
Finally: How to decide M ie how many/which dimensions to leave out?
This may be decided in advance due to constraints on processing power
Another technique (used in eg Matlab) is to look at the contribution to the overall variance of each principal component and leave out any dimensions which fall below a certain threshold
As ever, no one answer: may just want to try a few combinations or could even keep them all
PCA is very powerful in practical applications. But how do we compute eigenvectors and thus principal components in real situations?
Basically two ways: Batch and sequential
• We have seen the batch method, but this can be impractical if the dimensionality or no. of data points is too large • Also in a nonstationary environment, sequential can track gradual changes in the data• It requires less storage space• Sequential mode is used in modelling self-organization• It mimics Hebbian learning rule …
Hebbian Learning
uvdt
wdw Ie if uwv .
Hebb's postulate of learning (or simply Hebb's rule) (1949), is the following:
"When an axon of cell A is near enough to excite cell B and repeatedly or persistently takes part in firing it, some growth processes or metabolic changes take place in one or both cells such that A's efficiency as one of the cells firing B, is increased".
then
However, simple hebbian learning cause uncontrolled growth of weights to a max value so need to impose a normalisation constraint
Where is a +ve constant: known as Oja’s rule (1982) which makes |w|2 gradually relax to 1/ – form of competition between synapses
wvuvdt
wdw
2
In this way networks can exhibit selective amplification if there is one dominant eigenvector (cf PCA)
How can such precise tuning come about? Hebbian learning
Relationship between PCA and Hebbian learning
Consider a single neuron with a Hebbian learning rule:
Oja’s learning rule (Oja, 1982) :
wi(t+1) = wi(t)+ y(t) (xi(t) –y 2(t) wi (t))
Where y(t) xi (t) is the Hebbian term and – y2(t) wi (t) is the normalisation term which avoids uncontrolled growth of the weights (=> ||w|| = 1 at convergence)
w1(t)
output y(t) = wT(t) x(t)
input x(t)
wd(t)
This can be shown to have a stable minimum at C w = 1 w
Where C is the the covariance matrix of the training data . Result: w(t) converges to w the eigenvector of C which has the largest eigenvalue 1 .
The output is therefore : y = wT x = u1
T x Ie the first principal component of C
Thus a single linear neuron with a Hebbian learning rule can evolve into a filter for the first principal component
Intuitively, consider the 1D case: here the eigenvector w is either 1 or –1. At convergence of Oja’s learning rule we have:
y(t) (x -y(t) w (t))=0 which is satisfied if w(t)=1 or -1
We now introduce a special PCA learning rule called APEX developed by Kung and Diamantaras, 1990. This is a generalisation of the single neuron case to multiple neurons where the outputs are connected via inhibitory links
w11(t)
output j: yj(t) = wj T(t) x(t) + aj
T(t) yj-1(t)
input x(t) w1d(t)
wdd(t)
aj1(t)
ajd(t)
y1(t)
y2(t)
Where we define the feedback vector:
yj-1= [y1(t), y2(t) , … yj-1(t)]
Wj(t) = [wj1(t), wj2(t) , … wjd(t)] and aj(t) = [aj1(t), aj2(t) , … ajd(t)]
Where the update rules for wj and aj are:
wj(t+1) = wj(t) + yj(t) (x(t) - y2j(t) wj (t))
(Hebbian + normalisation)
aj(t+1) = aj(t) - yj(t)(yj-1(t)+y2j(t) aj(t))
(anti-Hebbian (inhibitory) + normalisation)
Procedure to find the yi (ie the principal components) is analogous to proof by induction: if we have found (y1 , y1 , … yi-1 ) we can determine the feedback vector:
yi-1(t)=[y1(t), ...., yj-1(t)]
Apex algorithm
1. Initialize the feedforward weight vector wj and the feedback weight vector aj to small random values at time t = 1, where j = 1, 2, …, d. Assign a small positive value for
2. Set j=1 and compute the first principal component y1 as for the single neuron ie for t = 1, 2, 3, … compute:
y1(t) = w1T(t) x(t)
w1(t+1) = w1(t)+ y1(t) (x(t) - y1(t) w1(t))
(Continued overleaf …)
3. Set j=2 and for t = 1, 2, 3, … compute:
yj-1(t)=[y1(t), ...., yj-1(t)] (the feedback)
yj(t) = wj T(t) x(t) + ajT(t) yj-1(t)
wj(t+1) = wj(t) + yj(t) (x(t) - yj(t) wj (t))
aj(t+1) = aj(t) - yj(t)(yj-1(t)+yj(t) aj(t))
4. Increase j by 1 and go to step 3. Repeat till j = M the desired number of dimensions
Theoretically, PCA is the optimal (in terms of not losing information) way to encode high dimensional data onto a lower dimensional subspace
Can be used for data compression where intuition is that getting rid of dimensions with little variance gets rid of noise
Independent Components Analysis (ICA)
As the name implies, an extension of PCA but rooted in information theory. Starting point: suppose we have the following situation:
Source vector u(n)
Mixer
A
Demixer
W
Observation vector x(n)
Output vector y(n)
Unknown environment
That is we have a number of vectors of values (indexed by n eg data at various time-steps) generated by d independent sources
u(n) = (u1 (n), …, ud (n) )
(assumed to have zero mean) which have been mixed by a d x d matrix A to give a vector of observations:
x(n) = (x1 (n), …, xd (n) )
(also zero mean as u zero mean). That is:
x (n) = A u (n)
Where A and u(n) are unknown. The problem is to recover u when all we know (all we can see) are the observation vectors x
Problem therefore known as blind source separation
Example:
u1 (t) = 0.1 sin (400 t )cos( 30 t)
u2 (t) = 0.001 sign (sin(500 t+ 9cos (40 t)))
u3 (t) = uniformally distributed noise in the range [-1,1]
48.032.017.0
86.065.075.0
37.079.056.0
A
x1 (t) = 0.56 u1 (t) + 0.79 u2 (t) -0.37 u3 (t)
x2 (t) = -0.75 u1 (t) + 0.65 u2 (t) +0.86 u3 (t)
x3 (t) = 0.17 u1 (t) + 0.32 u2 (t) -0.48 u3 (t)
Problem: we receive signals x(t), how do we recover u(t)?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
u1(t)
u2(t)
u3(t)
x1(t)
x2(t)
x3(t)
To solve this we need to find a matrix W such that: y(n) = W x(n)
with the property that u can be recovered from the outputs y. Thus the blind source separation problem can be stated as:
Given N independent realisations of the observation vector x, find an estimate for the inverse of the
mixing matrix A
since : y(n) = W x(n) = A-1 x(n) = A-1 A u(n) = u(n)
Neurobiological correlate: the cocktail party problem
The brain has the ability to to selectively tune to and follow one of a number of (independent) voices despite noise, delays, water in your ear lecturer droning on etc etc
Very many applications including:
Speech analysis for eg teleconferencing
Financial analysis: extract the underlying set of dominant components
Medical sensor interpretation: eg separate a foetuses heartbeat from the mothers
(Sussex) neuroscience (Ossorio, Baddeley and Anderson): analysis of cuttlefish patterns. Try to find an underlying alphabet/language of patterns used to convey information
Use Independent Component Analysis (Comon, 1994)
Can be viewed as an extension of PCA as both aim to find linear sums of components to re-represent the data
In ICA, however, we impose statistical independence on the vectors found and lose the orthogonality constraint
Definition: random variables X and Y are statistically independent if joint probability density function can be expressed as a product of the marginal density functions (ie pdf’s of X and Y as if they were on their own):
f(x, y) = f(x) f(y)
[NB discrete analogy: if A and B are independent events then:
P(A and B) = P(A, B) = P(A) P(B) ]
PCA ICA
PCA good for gaussian data, ICA good for non gaussian as indpendence => non-gaussianity
In fact, independent components MUST be nongaussian (more interesting distributions if non-gaussian) and to get components we maximise the non-gaussianity (the kurtosis) of the data
Why? Because a linear sum of gaussians is itself gaussian and one cannot distinguish the components from the mixture model
Young field (mid 90’s), still developing, somewhat in concurrence with kernel techniques (eg kernel PCA and kernel ICA: find non-linear combinations of components to represent the data)
Need some measure of statistical independence of X and Y: Can use mutual information I(X, Y)
Concept from information theory: defined in terms of entropy which is a measure of the average amount of information a variable conveys, or analogously our uncertainty about the variable
If X is the system input and Y the system output, the mutual information I(X, Y) is the difference in our levels of uncertainty about the system input (it’s entropy) before and after observing the system output. Thus if :
I(X, Y) = 0 X and Y are statistically independent
[or intuitively: no information about X from Y and vice versa => X, Y independent]
Idea, therefore is to minimise the mutual info I(yi,, yj) between all pairs I and j of the outputs (which we want to be equal to the original inputs which are independent)
This is equivalent to minimising the Kullback-Leibler (KL) divergence which measures the difference between the joint pdf f(y,W) and the product of the marginal densities f(yi,W) with respect to W.
Thus we have (a variant of) the Infomax Principle (Comon):
Given a d-by-1 vector x representing a linear combination of d independent source signals, the transformation of the observation vector x by a neural system into a new vector y should be carried out in a way that the KL divergence between the paramaterised probability density function f(y,W) and the product of the marginal densities f(yi,W) is minimised with respect to the unknown parameter matrix W
Which after some hard maths (Haykin 10.11 and other bits of chapter 10) leads us to the following algorithm for finding W
W(n +1) - W(n)= ( n) [I - (y(n))yT(n)] W(n)
where:
(y) = [(y1), (y2) , …, (ym)]T
And:
17151311975
3
512128
3
112
15
2
2
15
3
2
2
1)( iiiiiiii yyyyyyyy
NB must be chosen to be sufficiently small for stability of the algorithm (see Haykin).
Many other versions are available (FastICA seems quite good)
Return to the problem described earlier …
u1 (t) = 0.1 sin (400 t )cos( 30 t)
u2 (t) = 0.001 sign (sin(500 t+ 9cos (40 t)))
u3 (t) = uniformally distributed noise in the range [-1,1]
48.032.017.0
86.065.075.0
37.079.056.0
A
x1 (t) = 0.56 u1 (t) + 0.79 u2 (t) -0.37 u3 (t)
x2 (t) = -0.75 u1 (t) + 0.65 u2 (t) +0.86 u3 (t)
x3 (t) = 0.17 u1 (t) + 0.32 u2 (t) -0.48 u3 (t)
Problem: we receive signals x(t), how do we recover u(t)?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
u1(t)
u2(t)
u3(t)
x1(t)
x2(t)
x3(t)
Using the blind separation learning rule starting from random weights in the range [0, 0.05], =0.1, N=65000, timestep = 1x 10-4, batch version of algorithm for stability (Haykin, p.544):
0.0109 0.0340 0.0260 W(0)= 0.0024 0.0467 0.0415 0.0339 0.0192 0.0017
0.2222 0.0294 -0.6213W(t) converges to -10.1932 -9.8131 -9.7259 around t=300 4.1191 -1.7879 -6.3765
2.5 0 0 where WA ~ 0 17.5 0 0 0 0.24
W is almost an inverse of A (with scaling of the original signals as the solution not unique) and so the signal is recovered
Components can only be estimated up to a rescaling (since if x is a component multiplied by a, then 2x multiplied by a/2 is also a component
Note that this means we often get –x instead of x
Must pre-process before performing ICA to give the data zero mean
Also helps to whiten the data as it makes the mixing matrix orthogonal which means there are less parameters to estimate (since AT = A)
Often good to reduce the dimensionality (via PCA etc) to get rid/reduce noise
Pre-processing in ICA
ICA example 2: Original Sources
ICA Example 2: Mixed images
ICA Example 2: PCA/whitened images
ICA Example 2: Extracted components
ICA Example 2: rescaling of –ve components
Example 1: Speech - Music Separation
A speaker has been recorded with two distance talking microphones (sampling rate 16kHz) in a normal office room with loud music in the background. The distance between the speaker, cassette player and the microphones is about 60cm in a square ordering.
Microphone 1
Microphone 2
Separated source 1
Separated source 2
2. Speech - Speech Separation A real Cocktail Party Effect . Two Speakers have been recorded speaking simultaneously. Speaker 1 says the digits from one to ten in English and speaker 2 counts at at the same time the digits in Spanish (uno dos ... ) The recording has been done in a normal office room. The distance between the speakers and the microphones is about 60cm in a square ordering
Microphone 1
Microphone 2
Separated source 1
Separated source 2
3. Speech - Speech Separation in difficult environmentsA real Cocktail Party Effect II . Two Speakers have been recorded speaking simultaneously. This time the recording was in a conference room ( 5.5m by 8m ). The conference room had some air-conditioning noise. Both speakers are reading a section from the newspaper for 16sec. The mics were placed 120 cm away from the speakers. The unmixing filters need to be sufficiently long. We used a filter size of 2048 taps for each filter.
Microphone 1 Microphone 2
Separated source 1 Separated source 2
Pre-processing for time series
Problem: prediction of time series data ie sequence of measurements taken at regular time intervals. e.g. share price, weather forecast, speech signals etc
Take 1D case for simplicity, x(t): network attempts to approximate the x(t) from the previous d outputs [x(t - d), … , x(t –1)] used as inputs: one-step ahead prediction
Could try to predict more steps ahead (multi-step ahead prdeiction) but the errors tend to accumulate quickly and such efforts are usually characterised by a sharp decrease in performance
Here we are attempting to fit the static function which is underlying the fluctuations
That is, if there is a general trend (eg increase with time) we want to remove it: de-trending
That is, we first attempt to fit a simple (eg linear) function of time to the data and then take this away from the inputs
However, if the trend itself evolves with time this is inappropriate and on-line techniques are needed to track the data. How to do this well is an open research issue
Prior knowledge can also be used to improve
network training and performance
Several ways of incorporating prior knowledge eg invariances
1. If we know the network is invariant to some transformations of the data we could ‘pre-process’ the data set X by forming invariants of members of the data set and adding them to a new data set X*
however, this can quickly lead to very large data sets
2. Simply ‘remove’ invariant points by pre-processing
If we are classifying lists of properties and the order is unimportant, pre-process the data so that the lists are all mapped to a unique ordering, eg alphabetical