Efficient Independent Component Analysis on a GPU

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

29-06-10Frontiers of GPU Computing 20101

Efficient Independent Component Analysis on a GPU

Rui Ramalho, Pedro Tomás, Leonel Sousa

1


technologyfrom seed


Outline

• Motivation• Independent Component Analysis• FastICA Algorithm• Experimental Results• Conclusions

2


technologyfrom seed

Blind Source Separation

• Blind Source Separation (BSS) is a signal processing technique that separates a set of signals (sources) from a set of mixed signals.

• Little is known about the original signals or the mixing process, only that the original signals are uncorrelated.


0 50 100 150 200 250 300 350 400 450 500-2

0

2Original signals

0 50 100 150 200 250 300 350 400 450 500-5

0

5

0 50 100 150 200 250 300 350 400 450 500-2

0

2

0 50 100 150 200 250 300 350 400 450 500-10

0

10

0 50 100 150 200 250 300 350 400 450 500-5

0

5Mixed signals

0 50 100 150 200 250 300 350 400 450 500-5

0

5

0 50 100 150 200 250 300 350 400 450 500-5

0

5

0 50 100 150 200 250 300 350 400 450 500-5

0

5

0 50 100 150 200 250 300 350 400 450 500-5

0

5Independent components

0 50 100 150 200 250 300 350 400 450 500-2

0

2

0 50 100 150 200 250 300 350 400 450 500-10

0

10

0 50 100 150 200 250 300 350 400 450 500-2

0

2

mix sep


technologyfrom seed

Blind Source Separation: Cocktail Party

• A classical example of blind source separation is the cocktail party problem.

– A number of people are talking simultaneously in a crowded room (at cocktail party).

– Despite all the noise and cross talking, a human brain has little difficulty following a conversation.

– Machines have to rely on blind source separation.



technologyfrom seed

Blind Source Separation: Applications

• Blind Source Separation has also been used in several other domains:

– EEG/MEG measurements (each sensor picks up a mixture of brain electrical activity and BSS can be used to separate and identify them).

– Denoising images (by treating the noise as an independent source it is possible to separate it from the image’s original components).

– Financial analysis (BSS can be used to uncover hidden factors in financial data).



technologyfrom seed

Independent Component Analysis

• Independent Component Analysis (ICA) is a special case of Blind Source Separation.

• The mixed signal’s sources are assumed to be statistically independent (BSS only assumes the sources are statistically uncorrelated).



technologyfrom seed

Independent Component Analysis: ICA Model

• Under the ICA model, the observed variables are assumed to be a linear combination of several independent sources/signals.

• The objective of ICA is to find the matrix W that inverts the mixing operation performed by the matrix A, without knowledge of A or s.



technologyfrom seed

Independent Component Analysis: Measuring Statistical Independence

• One of the ways of measuring statistical independence is through negentropy:

– H(y) is the differential information entropy of y:

• In practice J(y) needs to be estimated. The estimator used by FastICA is:

– G is a nonquadratic nonlinear function– is a Gaussian variable of zero mean and unit variance



technologyfrom seed

FastICA Algorithm

• The procedure for computing the independent components can be divided in 3 stages:

– pre-processing Allows a number of simplifications on the FastICA algorithm.

– weight vector computationThe FastICA algorithm itself.

– decorrelationPrevents the algorithm from converging to the same solutions.



technologyfrom seed

FastICA Algorithm:Preprocessing & Weight Vector Computation

• Preprocessing includes general tasks such as centering, whitening or filtering the data.

• The computation of each of the weight vectors is done by:

– g is the derivative of the nonlinear contrast function J– This algorithm can be modified to compute all the ICs

simultaneously (a symmetric approach).



technologyfrom seed

FastICA Algorithm:GPU Implementation

• The preprocessing stage is generally inexpensive and was implemented on the CPU.

• The FastICA algorithm is composed mostly of matrix operations that can be efficiently implemented using CUBLAS.– The computation of the non-linear function

g and g’ have no dependencies.– The expected value is computed using

hierarchical additions, storing the intermediate results in the GPU’s shared memory.



technologyfrom seed

FastICA Algorithm:Decorrelation

• To keep the estimated weight vectors from converging to the same results, they need to be decorrelated:

– After estimating p independent components, subtract the projections of the previous p components from the p+1 estimate:

– An alternative is to apply a symmetric decorrelation after every iteration:



technologyfrom seed

Decorrelation:The Tricky Bit

• The computation of (WWT)1/2 is complex and can be done using the eigenvalues of (WWT).

– This can be done using the already available CPU-based high performance libraries (LAPACK).

– Alternatively, the eigenvalues can be computed directly on the GPU



technologyfrom seed

Jacobi Eigenvalue Algorithm

• The Jacobi Eigenvalue Algorithm successively uses Jacobi rotations to annihilate the off-diagonal elements of a given matrix A.

• A Jacobi rotation is given by:

– J is a Jacobi rotation matrix– c = cos()– s = sin()



technologyfrom seed

Jacobi Eigenvalue Algorithm

• Each Jacobi rotation only changes two columns and two rows of the matrix A. By carefully choosing the order of the rotations, up to N/2 rotations can be done simultaneously.

• The matrix J is a very sparse matrix, making CUBLAS unsuitable for this algorithm.



technologyfrom seed

Decorrelation:Iterative Algorithm

• Another alternative to the eigenvalue problem is to avoid its computation altogether. Algorithm 4 converges to the decorrelation expression presented earlier.



technologyfrom seed

Decorrelation:Comparison of Decorrelation Algorithms

• Experimental results show that the proposed GPU-based Jacobi eigenvalue algorithm is outperformed by a CPU based LAPACK eigenvalue algorithm using multiple relatively robust representations (MRRR).

• However, avoiding the explicit computation of the eigenvalues is still the fastest process.

29-06-1017 Frontiers of GPU Computing 2010


technologyfrom seed

Experimental Results:Experimental Setup

• Experimental Setup

– A hyperbolic tangent was chosen as a typical non-linear function g– The iterative decorrelation algorithm that avoids the explicit

computation of the eigenvalues is used in the decorrelation step.


CPU GPU

AMD Opteron 170 NVidea GeForce 8800 GTX

Number of cores 2 128

Clock Frequency 2 GHz 1.35 GHz

Main Memory 2 GB 768 MB


technologyfrom seed

Experimental Results:Single Core CPU Vs GPU

• The accelerated portion of the algorithm (loop) is spedup up to 110x, for estimating 256 ICs with 10 000 samples.

• As the accelerated portion gets faster, so grows the influence of the unaccelerated part of the algorithm (the preprocessing stage). This noticeably reduces the global speedup.



technologyfrom seed

Experimental Results:Single Core CPU Vs GPU

• The accelerated loop component ceases to be the bottleneck.

• The additional penalty of transferring data to and from the GPU is negligible.



technologyfrom seed

Experimental Results:Multicore CPU Vs GPU

• The parallelized GPU algorithm was also tested on a more powerful Geforce GTX 285, with 240 cores. This implementation was compared with a CPU based implementation on an Intel Core 2 Quad Q9950 (@2.83GHz) using Intel’s high performance MKL library.

• It was possible to attain a speedup of around 12x



technologyfrom seed

Conclusions

• By using a GPU it was possible to speedup the FastICA algorithm by 55x for estimating 256 ICs with 1000 samples each, in comparison with a serial version running on a single core of a CPU.

• These results can be further improved as the current bottleneck lies in the preprocessing stage, which is still done on the CPU.


01-06-08IV Jornadas sobre Sistemas Reconfiguráveis - REC'200823

technologyfrom seed

29-06-10 23Frontiers of GPU Computing 2010

Documents

Efficient Independent Component Analysis on a GPU