Upload
makana
View
93
Download
1
Embed Size (px)
DESCRIPTION
Efficient Independent Component Analysis on a GPU. Rui Ramalho, Pedro Tomás, Leonel Sousa. 1. Outline. Motivation Independent Component Analysis FastICA Algorithm Experimental Results Conclusions. 2. Blind Source Separation. - PowerPoint PPT Presentation
Citation preview
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
29-06-10Frontiers of GPU Computing 20101
Efficient Independent Component Analysis on a GPU
Rui Ramalho, Pedro Tomás, Leonel Sousa
1
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
29-06-10Frontiers of GPU Computing 20102
Outline
• Motivation• Independent Component Analysis• FastICA Algorithm• Experimental Results• Conclusions
2
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Blind Source Separation
• Blind Source Separation (BSS) is a signal processing technique that separates a set of signals (sources) from a set of mixed signals.
• Little is known about the original signals or the mixing process, only that the original signals are uncorrelated.
29-06-10Frontiers of GPU Computing 201033
0 50 100 150 200 250 300 350 400 450 500-2
0
2Original signals
0 50 100 150 200 250 300 350 400 450 500-5
0
5
0 50 100 150 200 250 300 350 400 450 500-2
0
2
0 50 100 150 200 250 300 350 400 450 500-10
0
10
0 50 100 150 200 250 300 350 400 450 500-5
0
5Mixed signals
0 50 100 150 200 250 300 350 400 450 500-5
0
5
0 50 100 150 200 250 300 350 400 450 500-5
0
5
0 50 100 150 200 250 300 350 400 450 500-5
0
5
0 50 100 150 200 250 300 350 400 450 500-5
0
5Independent components
0 50 100 150 200 250 300 350 400 450 500-2
0
2
0 50 100 150 200 250 300 350 400 450 500-10
0
10
0 50 100 150 200 250 300 350 400 450 500-2
0
2
mix sep
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Blind Source Separation: Cocktail Party
• A classical example of blind source separation is the cocktail party problem.
– A number of people are talking simultaneously in a crowded room (at cocktail party).
– Despite all the noise and cross talking, a human brain has little difficulty following a conversation.
– Machines have to rely on blind source separation.
29-06-10Frontiers of GPU Computing 201044
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Blind Source Separation: Applications
• Blind Source Separation has also been used in several other domains:
– EEG/MEG measurements (each sensor picks up a mixture of brain electrical activity and BSS can be used to separate and identify them).
– Denoising images (by treating the noise as an independent source it is possible to separate it from the image’s original components).
– Financial analysis (BSS can be used to uncover hidden factors in financial data).
29-06-10Frontiers of GPU Computing 20105
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Independent Component Analysis
• Independent Component Analysis (ICA) is a special case of Blind Source Separation.
• The mixed signal’s sources are assumed to be statistically independent (BSS only assumes the sources are statistically uncorrelated).
29-06-10Frontiers of GPU Computing 20106
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Independent Component Analysis: ICA Model
• Under the ICA model, the observed variables are assumed to be a linear combination of several independent sources/signals.
• The objective of ICA is to find the matrix W that inverts the mixing operation performed by the matrix A, without knowledge of A or s.
29-06-10Frontiers of GPU Computing 20107
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Independent Component Analysis: Measuring Statistical Independence
• One of the ways of measuring statistical independence is through negentropy:
– H(y) is the differential information entropy of y:
• In practice J(y) needs to be estimated. The estimator used by FastICA is:
– G is a nonquadratic nonlinear function– is a Gaussian variable of zero mean and unit variance
29-06-10Frontiers of GPU Computing 20108
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
FastICA Algorithm
• The procedure for computing the independent components can be divided in 3 stages:
– pre-processing Allows a number of simplifications on the FastICA algorithm.
– weight vector computationThe FastICA algorithm itself.
– decorrelationPrevents the algorithm from converging to the same solutions.
29-06-10Frontiers of GPU Computing 20109
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
FastICA Algorithm:Preprocessing & Weight Vector Computation
• Preprocessing includes general tasks such as centering, whitening or filtering the data.
• The computation of each of the weight vectors is done by:
– g is the derivative of the nonlinear contrast function J– This algorithm can be modified to compute all the ICs
simultaneously (a symmetric approach).
29-06-10Frontiers of GPU Computing 201010
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
FastICA Algorithm:GPU Implementation
• The preprocessing stage is generally inexpensive and was implemented on the CPU.
• The FastICA algorithm is composed mostly of matrix operations that can be efficiently implemented using CUBLAS.– The computation of the non-linear function
g and g’ have no dependencies.– The expected value is computed using
hierarchical additions, storing the intermediate results in the GPU’s shared memory.
29-06-10Frontiers of GPU Computing 201011
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
FastICA Algorithm:Decorrelation
• To keep the estimated weight vectors from converging to the same results, they need to be decorrelated:
– After estimating p independent components, subtract the projections of the previous p components from the p+1 estimate:
– An alternative is to apply a symmetric decorrelation after every iteration:
29-06-10Frontiers of GPU Computing 201012
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Decorrelation:The Tricky Bit
• The computation of (WWT)1/2 is complex and can be done using the eigenvalues of (WWT).
– This can be done using the already available CPU-based high performance libraries (LAPACK).
– Alternatively, the eigenvalues can be computed directly on the GPU
29-06-10Frontiers of GPU Computing 201013
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Jacobi Eigenvalue Algorithm
• The Jacobi Eigenvalue Algorithm successively uses Jacobi rotations to annihilate the off-diagonal elements of a given matrix A.
• A Jacobi rotation is given by:
– J is a Jacobi rotation matrix– c = cos()– s = sin()
29-06-10Frontiers of GPU Computing 201014
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Jacobi Eigenvalue Algorithm
• Each Jacobi rotation only changes two columns and two rows of the matrix A. By carefully choosing the order of the rotations, up to N/2 rotations can be done simultaneously.
• The matrix J is a very sparse matrix, making CUBLAS unsuitable for this algorithm.
29-06-10Frontiers of GPU Computing 201015
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Decorrelation:Iterative Algorithm
• Another alternative to the eigenvalue problem is to avoid its computation altogether. Algorithm 4 converges to the decorrelation expression presented earlier.
29-06-10Frontiers of GPU Computing 201016
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Decorrelation:Comparison of Decorrelation Algorithms
• Experimental results show that the proposed GPU-based Jacobi eigenvalue algorithm is outperformed by a CPU based LAPACK eigenvalue algorithm using multiple relatively robust representations (MRRR).
• However, avoiding the explicit computation of the eigenvalues is still the fastest process.
29-06-1017 Frontiers of GPU Computing 2010
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Experimental Results:Experimental Setup
• Experimental Setup
– A hyperbolic tangent was chosen as a typical non-linear function g– The iterative decorrelation algorithm that avoids the explicit
computation of the eigenvalues is used in the decorrelation step.
29-06-10Frontiers of GPU Computing 201018
CPU GPU
AMD Opteron 170 NVidea GeForce 8800 GTX
Number of cores 2 128
Clock Frequency 2 GHz 1.35 GHz
Main Memory 2 GB 768 MB
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Experimental Results:Single Core CPU Vs GPU
• The accelerated portion of the algorithm (loop) is spedup up to 110x, for estimating 256 ICs with 10 000 samples.
• As the accelerated portion gets faster, so grows the influence of the unaccelerated part of the algorithm (the preprocessing stage). This noticeably reduces the global speedup.
29-06-10Frontiers of GPU Computing 201019
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Experimental Results:Single Core CPU Vs GPU
• The accelerated loop component ceases to be the bottleneck.
• The additional penalty of transferring data to and from the GPU is negligible.
29-06-10Frontiers of GPU Computing 201020
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Experimental Results:Multicore CPU Vs GPU
• The parallelized GPU algorithm was also tested on a more powerful Geforce GTX 285, with 240 cores. This implementation was compared with a CPU based implementation on an Intel Core 2 Quad Q9950 (@2.83GHz) using Intel’s high performance MKL library.
• It was possible to attain a speedup of around 12x
29-06-10Frontiers of GPU Computing 201021
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technologyfrom seed
Conclusions
• By using a GPU it was possible to speedup the FastICA algorithm by 55x for estimating 256 ICs with 1000 samples each, in comparison with a serial version running on a single core of a CPU.
• These results can be further improved as the current bottleneck lies in the preprocessing stage, which is still done on the CPU.
29-06-10Frontiers of GPU Computing 201022
01-06-08IV Jornadas sobre Sistemas Reconfiguráveis - REC'200823
technologyfrom seed
29-06-10 23Frontiers of GPU Computing 2010