Parallelization of FFT in AFNI
Huang, Jingshan Xi, Hong
Department of Computer Science and EngineeringUniversity of South Carolina
Motivation
AFNI: a widely used software package for medical image processing
Drawback: not a real-time system
Our goal: make a parallelized version of AFNI
First step: parallelize the FFT part of AFNI
Outline
What is AFNI
FFT in AFNI
Introduction of MPI
Our method of parallelization
Experiment result and analysis
Conclusion
What is AFNI?
AFNI stands for Analysis of Functional NeuroImages.
It is a set of C programs (over 1,000 source code files) for processing, analyzing, and displaying functional MRI (FMRI) data - a technique for mapping human brain activity.
AFNI is an interactive program for viewing the results of 3D functional neuroimaging.
How to run AFNI?
Log on to clustering machine (daniel.cse.sc.edu)
Go to directory /home/ramsey/newafnigo
Run “afni”
Interface should show up at this time
AFNI Interfaces
AFNI Interfaces --- Cont.
AFNI Interfaces --- Cont.
AFNI Interfaces --- Cont.
Axial Sagittal Coronal
AFNI Interfaces --- Cont.
Axial Sagittal Coronal
AFNI Interfaces --- Cont.
Axial Sagittal Coronal
FFT in AFNI
Fast Fourier Transform: a kind of finite FT from discrete time domain to discrete spatial domain
Reduces the number of computations needed for N points from O(N2)to O(NlgN)
Extensively used in AFNI
To parallelize FFT has great significance for AFNI
What is MPI?
MPI stands for Message-Passing Interface.
MPI is the most widely used approach to develop a parallel system.
MPI has specified a library of functions that can be called from a C or Fortran program.
The foundation of this library is a small group of functions that can be used to achieve parallelism by message passing.
What is Message Passing?
Explicitly transmits data from one process to another
Powerful and very general method of expressing parallelism
Drawback --- “assembly language of parallel computing”
What does MPI do for us?
Makes it possible to write libraries of parallel programs that are both portable and efficient
Use of these libraries will hide many of the details of parallel programming
Therefore make parallel computing much more accessible to professionals in all branches of science and engineering
Our Objective
To parallelize FFT part of AFNI
In AFNI, when we call FFT function, we are in fact calling the csfft_cox() function, which we will see the detail in next slide
Flow Chart of csfft_cox
fft32
fft128
fft2 fft43
fft8 fft16
fft64
fft256
fft512
fft1024
fft2048
fft4096
fft8192
fft16384
fft32768
SCLINV
fft_4dec
return
csfft_cox start
fft_4dec
fft_4dec
fft_4dec
fft_4dec
fft_4dec
3n
5n
fft_3dec
fft_5dec
One-level parallelization
There are several options for us to parallel the csfft_cox() function.
At present, we adopt the one-level parallelization method, that is, when fft4096() calls fft1024() and when fft8192() calls fft2048().
Correctness of our parallel code
By doing FFT and IFFT consequently, we obtain a set of complex numbers that are almost the same as the ones in the original data file
The only difference comes from the storage error of floating point number (in the original code, such phenomena also exists)
So, what is the speedup then?
Two Kinds of Time
There are two kinds of time in analyzing our experiment result: CPU Time and Wall Clock Time (Elapsed Time).
CPU time is the time spent in the calculation part of the code.
Wall Clock Time is the total elapsed time from the user’s point of view.
Experiments
Time analysis of Original code (4096 * 200,000 * 1)
starting 200000 FFTs of length 4096 -- 1 at a timeTIME 1
**********************************************************************TIME 1 beginning 0 0.00TIME 1 Abeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_tTIME 1 Bbeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_tTIME 1
**********************************************************************Using csfftTIME 2
**********************************************************************TIME 2 ending 0 155.09TIME 2 Aending 155.09 u 30.60 s: 155.09 u_t 30.60 s_tTIME 2 Bending 155.09 u 30.60 s: 155.09 u_t 30.60 s_tTIME 2
**********************************************************************wall clock time = 813.324630813.324630
Experiments --- Cont. Time analysis of Parallelized in 2 processors (4096 * 200,000 * 1)starting 200000 FFTs of length 4096 -- 1 at a timeTIME 1 **********************************************************************TIME 1 beginning 0 0.00TIME 1 beginning 1 0.00TIME 1 Abeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_tTIME 1 Bbeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_tTIME 1 **********************************************************************Using csfftTIME 2 **********************************************************************TIME 2 ending 0 168.09TIME 2 ending 1 85.66TIME 2 Aending 253.75 u 115.11 s: 253.75 u_t 115.11 s_tTIME 2 Bending 126.87 u 57.55 s: 126.87 u_t 57.55 s_tTIME 2 **********************************************************************
wall clock time = 679.795504679.795504
Experiments --- Cont. Time analysis of Parallelized in 4 processors (4096 * 200,000 * 1)
starting 100000 FFTs of length 4096 -- 1 at a timeTIME 1 **********************************************************************TIME 1 beginning 0 0.00TIME 1 beginning 1 0.00TIME 1 beginning 2 0.00TIME 1 beginning 3 0.00TIME 1 Abeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_tTIME 1 Bbeginning 0.00 u 0.00 s: 0.00 u_t 0.00 s_tTIME 1 **********************************************************************Using csfftTIME 2 **********************************************************************TIME 2 ending 0 139.71TIME 2 ending 1 71.39TIME 2 ending 2 57.29TIME 2 ending 3 61.77TIME 2 Aending 180.16 u 114.53 s: 180.16 u_t 114.53 s_tTIME 2 Bending 45.04 u 28.63 s: 45.04 u_t 28.63 s_tTIME 2 **********************************************************************
wall clock time = 946.5520413946.5520413
Analysis of speedup
CPU Time Wall Clock Time
Original Code 155.09 813.324630813.324630
Parallelized in 2 processors
168.09 (rank 0)85.66 (rank 1) 679.795504679.795504
Parallelized in 4 processors
139.71 (rank 0)71.39 (rank 1)57.29 (rank 2)61.77 (rank 3)
946.552041946.552041
Analysis of speedup --- Cont.
Two main reasons that we did not obtain the ideal speedup:
1. There exist the competitions among different users in the same CPU.
2. Due to the existing communication cost and some other overhead, it is impossible to obtain the idealspeedup in the real machines.
Conclusion
We have parallelized the FFT part of AFNI software package based on MPI. The result shows that for the FFT algorithm itself, we obtain a speedup of around 30 percent.
Increase the speedup of FFTparallelization of 3dDeconvolve program
Questions?