Parallel Spectral Methods: Solving Elliptic Problems with FFTs

  • View

  • Download

Embed Size (px)


Parallel Spectral Methods: Solving Elliptic Problems with FFTs. Horst Simon Motifs. The Motifs (formerly “Dwarfs”) from “ The Berkeley View” ( Asanovic et al.) Motifs form key computational patterns. - PowerPoint PPT Presentation

Text of Parallel Spectral Methods: Solving Elliptic Problems with FFTs

  • Parallel Spectral Methods:Solving Elliptic Problems with FFTsHorst

  • *MotifsThe Motifs (formerly Dwarfs) from The Berkeley View (Asanovic et al.)Motifs form key computational patternsTopic of this lecture

  • ReferencesPrevious CS267 lectures Lecture by Geoffrey Fox: projecthttp://www.fftw.orgSpiral project

  • Poissons equation arises in many modelsElectrostatic or Gravitational Potential: Potential(position) Heat flow: Temperature(position, time)Diffusion: Concentration(position, time)Fluid flow: Velocity,Pressure,Density(position,time)Elasticity: Stress,Strain(position,time)Variations of Poisson have variable coefficients3D: 2u/x2 + 2u/y2 + 2u/z2 = f(x,y,z)2D: 2u/x2 + 2u/y2 = f(x,y)1D: d2u/dx2 = f(x)f represents the sources; also need boundary conditions

  • Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)AlgorithmSerialPRAMMemory #ProcsDense LUN3NN2N2Band LUN2 (N7/3)NN3/2 (N5/3)N (N4/3)JacobiN2 (N5/3) N (N2/3) NNExplicit Inv.N2 log NN2N2Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log NNNRed/Black SOR N3/2 (N4/3) N1/2 (N1/3) NNSparse LUN3/2 (N2) N1/2 N*log N(N4/3) NFFTN*log Nlog NNNMultigridNlog2 NNNLower boundNlog NN

    PRAM is an idealized parallel model with zero cost communicationReference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997.

  • Solving Poissons Equation with the FFTExpress any 2D function defined in 0 x,y 1 as a series (x,y) = Sj Sk jk sin(p jx) sin(p ky)Here jk are called Fourier coefficient of (x,y) The inverse of this is: jk = 4 (x,y) sin(p jx) sin(p ky)

    Poissons equation 2 / x2 + 2 / y2 = f(x,y) becomesSj Sk (-p2j2 - p2k2) jk sin(p jx) sin(p ky) = Sj Sk fjk sin(p jx) sin(p ky) where fjk are Fourier coefficients of f(x,y) and f(x,y) = Sj Sk fjk sin(p jx) sin(p ky)This implies PDE can be solved exactly algebraically, jk = fjk / (-p2j2 - p2k2)

  • Solving Poissons Equation with the FFTSo solution of Poissons equation involves the following steps1) Find the Fourier coefficients fjk of f(x,y) by performing integral2) Form the Fourier coefficients of by jk = fjk / (-p2j2 - p2k2)3) Construct the solution by performing sum (x,y)There is another version of this (Discrete Fourier Transform) which deals with functions defined at grid points and not directly the continuous integralAlso the simplest (mathematically) transform uses exp(-2pijx) not sin(p jx) Let us first consider 1D discrete version of this case PDE case normally deals with discretized functions as these needed for other parts of problem

  • Serial FFTLet i=sqrt(-1) and index matrices and vectors from 0.The Discrete Fourier Transform of an m-element vector v is: F*v Where F is the m*m matrix defined as: F[j,k] = v (j*k)Where v is: v = e (2pi/m) = cos(2p/m) + i*sin(2p/m)v is a complex number with whose mth power vm =1 and is therefore called an mth root of unityE.g., for m = 4: v = i, v2 = -1, v3 = -i, v4 = 1,

  • Using the 1D FFT for filteringSignal = sin(7t) + .5 sin(5t) at 128 pointsNoise = random number bounded by .75Filter by zeroing out FFT components < .25

  • Using the 2D FFT for image compressionImage = 200x320 matrix of valuesCompress by keeping largest 2.5% of FFT componentsSimilar idea used by jpeg

  • Related TransformsMost applications require multiplication by both F and inverse(F).

    Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.)

    For solving the Poisson equation and various other applications, we use variations on the FFTThe sin transform -- imaginary part of FThe cos transform -- real part of F

    Algorithms are similar, so we will focus on the forward FFT.

  • Serial Algorithm for the FFTCompute the FFT of an m-element vector v, F*v (F*v)[j] = S F(j,k) * v(k) = S v (j*k) * v(k) = S (v j)k * v(k) = V(v j)Where V is defined as the polynomial V(x) = S xk * v(k)

    m-1k = 0m-1k = 0m-1k = 0m-1k = 0

  • Divide and Conquer FFTV can be evaluated using divide-and-conquer V(x) = S (x)k * v(k) = v[0] + x2*v[2] + x4*v[4] + + x*(v[1] + x2*v[3] + x4*v[5] + ) = Veven(x2) + x*Vodd(x2)V has degree m-1, so Veven and Vodd are polynomials of degree m/2-1We evaluate these at points (v j)2 for 0
  • Divide-and-Conquer FFTFFT(v, v, m) if m = 1 return v[0] else veven = FFT(v[0:2:m-2], v 2, m/2) vodd = FFT(v[1:2:m-1], v 2, m/2) v-vec = [v0, v1, v (m/2-1) ] return [veven + (v-vec .* vodd), veven - (v-vec .* vodd) ]The .* above is component-wise multiply.The [,] is construction an m-element vector from 2 m/2 element vectorsThis results in an O(m log m) algorithm.precomputed

  • An Iterative AlgorithmThe call tree of the d&c FFT algorithm is a complete binary tree of log m levels

    An iterative algorithm that uses loops rather than recursion, goes each level in the tree starting at the bottomAlgorithm overwrites v[i] by (F*v)[bitreverse(i)]Practical algorithms combine recursion (for memory hiearchy) and iteration (to avoid function call overhead)FFT(0,1,2,3,,15) = FFT(xxxx)FFT(1,3,,15) = FFT(xxx1)FFT(0,2,,14) = FFT(xxx0)FFT(xx10)FFT(xx01)FFT(xx11)FFT(xx00)FFT(x100)FFT(x010)FFT(x110)FFT(x001)FFT(x101)FFT(x011)FFT(x111)FFT(x000)FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15)evenodd

  • Parallel 1D FFTData dependencies in 1D FFTButterfly patternA PRAM algorithm takes O(log m) timeeach step to right is parallelthere are log m stepsWhat about communication cost?See LogP paper for details

  • Block Layout of 1D FFTUsing a block layout (m/p contiguous elts per processor)

    No communication in last log m/p steps

    Each step requires fine-grained communication in first log p steps

  • Cyclic Layout of 1D FFTCyclic layout (only 1 element per processor, wrapped)No communication in first log(m/p) stepsCommunication in last log(p) steps

  • Parallel Complexitym = vector size, p = number of processorsf = time per flop = 1 a = startup for message (in f units) b = time per word in a message (in f units)

    Time(blockFFT) = Time(cyclicFFT) = 2*m*log(m)/p + log(p) * a + m*log(p)/p * b

  • FFT With TransposeIf we start with a cyclic layout for first log(p) steps, there is no communicationThen transpose the vector for last log(m/p) stepsAll communication is in the transposeNote: This example has log(m/p) = log(p)If log(m/p) > log(p) more phases/layouts will be neededWe will work with this assumption for simplicity

  • Why is the Communication Step Called a Transpose?Analogous to transposing an arrayView as a 2D array of n/p by p Note: same idea is useful for uniprocessor caches

  • Complexity of the FFT with TransposeIf no communication is pipelined (overestimate!)Time(transposeFFT) = 2*m*log(m)/p same as before + (p-1) * a was log(p) * a + m*(p-1)/p2 * b was m* log(p)/p * bIf communication is pipelined, so we do not pay for p-1 messages, the second term becomes simply a, rather than (p-1)a.This is close to optimal. See LogP paper for details.See also following papers on class resource pageA. Sahai, Hiding Communication Costs in Bandwidth Limited FFTR. Nishtala et al, Optimizing bandwidth limited problems using one-sided communication

  • Comment on the 1D Parallel FFTThe above algorithm leaves data in bit-reversed orderSome applications can use it this way, like PoissonOthers require another transpose-like operation

    Other parallel algorithms also existA very different 1D FFT is due to Edelman (see on the Fast Multipole algorithmLess communication for non-bit-reversed algorithm

  • Higher Dimension FFTsFFTs on 2 or 3 dimensions are define as 1D FFTs on vectors in all dimensions.E.g., a 2D FFT does 1D FFTs on all rows and then all columnsThere are 3 obvious possibilities for the 2D FFT:(1) 2D blocked layout for matrix, using 1D algorithms for each row and column(2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs(3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columnsOption 2 is best, if we overlap communication and computation

    For a 3D FFT the options are similar2 phases done with serial FFTs, followed by a transpose for 3rdcan overlap communication with 2nd phase in practice

  • FFTW Fastest Fourier Transform in the Westwww.fftw.orgProduces FFT implementation optimized forYour version of FFT (complex, real,)Your value of n (arbitrary, possibly prime)Your architectureClose to optimal for serial, can be improved for parallelSimilar in spirit to PHIPAC/ATLAS/SparsityWon 1999 Wilkinson Prize for Numerical SoftwareWidely used for serial FFTsHad parallel FFTs in version 2, but no longer supporting themLayout constraints from users/apps + network differences are hard to support

  • Bisection BandwidthFFT requires one (or more) transpose operations:Ever processor send 1/P of its data to each other oneBisection Bandwidth limits this performanceBisection bandwidth is the bandwidth across the narrowest part of the networkImportant in globa