Parallel Spectral Methods: Solving Elliptic Problems with FFTs

04/15/2009 CS267 Lecture 20

Parallel Spectral Methods:Solving Elliptic Problems with FFTs

Horst Simon

[email protected]

www.cs.berkeley.edu/~demmel/cs267_Spr09

04/15/2009 CS267 Lecture 212

Motifs

The Motifs (formerly “Dwarfs”) from “The Berkeley View” (Asanovic et al.)

Motifs form key computational patterns

Topic of this lecture

04/15/2009 CS267 Lecture 21

References

° Previous CS267 lectures

° Lecture by Geoffrey Fox: http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/cps615fft00.ppt

° FFTW projecthttp://www.fftw.org

° Spiral projecthttp://www.spiral.net

http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/cps615fft00.ppt

http://www.fftw.org/

http://www.spiral.net/

04/15/2009 CS267 Lecture 21

Poisson’s equation arises in many models

° Electrostatic or Gravitational Potential: Potential(position)

° Heat flow: Temperature(position, time)

° Diffusion: Concentration(position, time)

° Fluid flow: Velocity,Pressure,Density(position,time)

° Elasticity: Stress,Strain(position,time)

° Variations of Poisson have variable coefficients

3D: 2u/x2 + 2u/y2 + 2u/z2 = f(x,y,z)

2D: 2u/x2 + 2u/y2 = f(x,y)

1D: d2u/dx2 = f(x)

f represents the sources; also need boundary conditions

04/15/2009 CS267 Lecture 21

Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)

Algorithm Serial PRAM Memory #Procs

° Dense LU N3 N N2 N2

° Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3)

° Jacobi N2 (N5/3) N (N2/3) N N

° Explicit Inv. N2 log N N2 N2

° Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N

° Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N

° Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N

° FFT N*log N log N N N

° Multigrid N log2 N N N

° Lower bound N log N N

PRAM is an idealized parallel model with zero cost communication

Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997.

04/15/2009 CS267 Lecture 21

Solving Poisson’s Equation with the FFT

° Express any 2D function defined in 0 x,y 1 as a series x,y= j k jk sin( jx) sin( ky)

• Here jk are called Fourier coefficient of (x,y)

° The inverse of this is: jk = 4 (x,y) sin( jx) sin( ky)

° Poisson’s equation /x2 + 2 /y2 = f(x,y) becomesj k (-j2 - k2) jk sin( jx) sin( ky)

j k fjk sin( jx) sin( ky)

° where fjk are Fourier coefficients of f(x,y)

andfx,y= j k fjk sin( jx) sin( ky)

° This implies PDE can be solved exactly algebraically, jk = fjk / (-j2 - k2)

1 1

0 0dx dy∫ ∫

04/15/2009 CS267 Lecture 21

Solving Poisson’s Equation with the FFT

° So solution of Poisson’s equation involves the following steps

° 1) Find the Fourier coefficients fjk of f(x,y) by performing integral

° 2) Form the Fourier coefficients of by

jk = fjk / (-j2 - k2)

° 3) Construct the solution by performing sum (x,y)

° There is another version of this (Discrete Fourier Transform) which deals with functions defined at grid points and not directly the continuous integral

• Also the simplest (mathematically) transform uses exp(-2ijx) not sin( jx)

• Let us first consider 1D discrete version of this case

• PDE case normally deals with discretized functions as these needed for other parts of problem

04/15/2009 CS267 Lecture 21

Serial FFT° Let i=sqrt(-1) and index matrices and vectors from 0.

° The Discrete Fourier Transform of an m-element vector v is:

F*v

Where F is the m*m matrix defined as:

F[j,k] = (j*k)

Where is:

= e (2i/m) = cos(2/m) + i*sin(2/m)

is a complex number with whose mth power m =1 and is therefore called an mth root of unity

° E.g., for m = 4:

= i, = -1, = -i, = 1,

04/15/2009 CS267 Lecture 21

Using the 1D FFT for filtering

° Signal = sin(7t) + .5 sin(5t) at 128 points

° Noise = random number bounded by .75

° Filter by zeroing out FFT components < .25

04/15/2009 CS267 Lecture 21

Using the 2D FFT for image compression

° Image = 200x320 matrix of values

° Compress by keeping largest 2.5% of FFT components

° Similar idea used by jpeg

04/15/2009 CS267 Lecture 21

Related Transforms

° Most applications require multiplication by both F and inverse(F).

° Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.)

° For solving the Poisson equation and various other applications, we use variations on the FFT

• The sin transform -- imaginary part of F

• The cos transform -- real part of F

° Algorithms are similar, so we will focus on the forward FFT.

04/15/2009 CS267 Lecture 21

Serial Algorithm for the FFT

° Compute the FFT of an m-element vector v, F*v

(F*v)[j] = F(j,k) * v(k)

= (j*k) * v(k)

= ( j)k * v(k)

= V(j)

° Where V is defined as the polynomial

V(x) = xk * v(k)

m-1

k = 0

m-1

k = 0

m-1

k = 0

m-1

k = 0

04/15/2009 CS267 Lecture 21

Divide and Conquer FFT° V can be evaluated using divide-and-conquer

V(x) = (x)k * v(k)

= v[0] + x2*v[2] + x4*v[4] + …

+ x*(v[1] + x2*v[3] + x4*v[5] + … )

= Veven(x2) + x*Vodd(x2)

° V has degree m-1, so Veven and Vodd are polynomials of degree m/2-1

° We evaluate these at points (j)2 for 0<=j<=m-1

° But this is really just m/2 different points, since

( (j+m/2) )2 = ( j *m/2) )2 = 2j *m = ( j)2

° So FFT on m points reduced to 2 FFTs on m/2 points• Divide and conquer!

m-1

k = 0

04/15/2009 CS267 Lecture 21

Divide-and-Conquer FFT

FFT(v, , m)

if m = 1 return v[0]

else

veven = FFT(v[0:2:m-2], 2, m/2)

vodd = FFT(v[1:2:m-1], 2, m/2)

-vec = [0, 1, … (m/2-1) ]

return [veven + (-vec .* vodd),

veven - (-vec .* vodd) ]

° The .* above is component-wise multiply.

° The […,…] is construction an m-element vector from 2 m/2 element vectors

This results in an O(m log m) algorithm.

precomputed

04/15/2009 CS267 Lecture 21

An Iterative Algorithm

° The call tree of the d&c FFT algorithm is a complete binary tree of log m levels

° An iterative algorithm that uses loops rather than recursion, goes each level in the tree starting at the bottom

• Algorithm overwrites v[i] by (F*v)[bitreverse(i)]

° Practical algorithms combine recursion (for memory hiearchy) and iteration (to avoid function call overhead)

FFT(0,1,2,3,…,15) = FFT(xxxx)

FFT(1,3,…,15) = FFT(xxx1)FFT(0,2,…,14) = FFT(xxx0)

FFT(xx10) FFT(xx01) FFT(xx11)FFT(xx00)

FFT(x100) FFT(x010) FFT(x110) FFT(x001) FFT(x101) FFT(x011) FFT(x111)FFT(x000)

FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15)

even odd

04/15/2009 CS267 Lecture 21

Parallel 1D FFT

° Data dependencies in 1D FFT

• Butterfly pattern

° A PRAM algorithm takes O(log m) time

• each step to right is parallel

• there are log m steps

° What about communication cost?

° See LogP paper for details

04/15/2009 CS267 Lecture 21

Block Layout of 1D FFT

° Using a block layout (m/p contiguous elts per processor)

° No communication in last log m/p steps

° Each step requires fine-grained communication in first log p steps

04/15/2009 CS267 Lecture 21

Cyclic Layout of 1D FFT

° Cyclic layout (only 1 element per processor, wrapped)

° No communication in first log(m/p) steps

° Communication in last log(p) steps

04/15/2009 CS267 Lecture 21

Parallel Complexity

° m = vector size, p = number of processors

° f = time per flop = 1

° = startup for message (in f units)

° = time per word in a message (in f units)

° Time(blockFFT) = Time(cyclicFFT) =

2*m*log(m)/p

+ log(p) *

+ m*log(p)/p *

04/15/2009 CS267 Lecture 21

FFT With “Transpose”

° If we start with a cyclic layout for first log(p) steps, there is no communication

° Then transpose the vector for last log(m/p) steps

° All communication is in the transpose

° Note: This example has log(m/p) = log(p)

• If log(m/p) > log(p) more phases/layouts will be needed

• We will work with this assumption for simplicity

04/15/2009 CS267 Lecture 21

Why is the Communication Step Called a Transpose?

° Analogous to transposing an array

° View as a 2D array of n/p by p

° Note: same idea is useful for uniprocessor caches

04/15/2009 CS267 Lecture 21

Complexity of the FFT with Transpose

° If no communication is pipelined (overestimate!)

° Time(transposeFFT) =

2*m*log(m)/p same as before

+ (p-1) * was log(p) *

+ m*(p-1)/p2 * was m* log(p)/p

° If communication is pipelined, so we do not pay for p-1 messages, the second term becomes simply , rather than (p-1).

° This is close to optimal. See LogP paper for details.

° See also following papers on class resource page• A. Sahai, “Hiding Communication Costs in Bandwidth Limited FFT”

• R. Nishtala et al, “Optimizing bandwidth limited problems using one-sided communication”

04/15/2009 CS267 Lecture 21

Comment on the 1D Parallel FFT

° The above algorithm leaves data in bit-reversed order• Some applications can use it this way, like Poisson

• Others require another transpose-like operation

° Other parallel algorithms also exist• A very different 1D FFT is due to Edelman (see http://www-

math.mit.edu/~edelman)

• Based on the Fast Multipole algorithm

• Less communication for non-bit-reversed algorithm

04/15/2009 CS267 Lecture 21

Higher Dimension FFTs

° FFTs on 2 or 3 dimensions are define as 1D FFTs on vectors in all dimensions.

° E.g., a 2D FFT does 1D FFTs on all rows and then all columns

° There are 3 obvious possibilities for the 2D FFT:• (1) 2D blocked layout for matrix, using 1D algorithms for each row

and column

• (2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs

• (3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columns

• Option 2 is best, if we overlap communication and computation

° For a 3D FFT the options are similar• 2 phases done with serial FFTs, followed by a transpose for 3rd

• can overlap communication with 2nd phase in practice

04/15/2009 CS267 Lecture 21

FFTW – Fastest Fourier Transform in the West

° www.fftw.org

° Produces FFT implementation optimized for• Your version of FFT (complex, real,…)

• Your value of n (arbitrary, possibly prime)

• Your architecture

• Close to optimal for serial, can be improved for parallel

° Similar in spirit to PHIPAC/ATLAS/Sparsity

° Won 1999 Wilkinson Prize for Numerical Software

° Widely used for serial FFTs• Had parallel FFTs in version 2, but no longer supporting them

• Layout constraints from users/apps + network differences are hard to support

http://www.fftw.org/

04/15/2009 CS267 Lecture 21

Bisection Bandwidth

° FFT requires one (or more) transpose operations:• Ever processor send 1/P of its data to each other one

° Bisection Bandwidth limits this performance• Bisection bandwidth is the bandwidth across the narrowest part

of the network

• Important in global transpose operations, all-to-all, etc.

° “Full bisection bandwidth” is expensive• Fraction of machine cost in the network is increasing

• Fat-tree and full crossbar topologies may be too expensive

• Especially on machines with 100K and more processors

• SMP clusters often limit bandwidth at the node level

04/15/2009 CS267 Lecture 21

Modified LogGP Model

° LogGP: no overlap

P0

P1

osend

L

orecv

EEL: end to end latency (1/2 roundtrip)g: minimum time between small message sendsG: additional gap per byte for larger messages

° LogGP: no overlap

P0

P1

g

04/15/2009 CS267 Lecture 21

Historical Perspective

0

5

10

15

20

25

T3E/ShmT3E/E-Reg

T3E/MPI IBM/LAPIIBM/MPI

Quadrics/ShmQuadrics/MPIMyrinet/GMMyrinet/MPIGigE/VIPLGigE/MPI

usec

Added Latency

Send Overhead (Alone)

Send & Rec Overhead

Rec Overhead (Alone)

• Potential performance advantage for fine-grained, one-sided programs• Potential productivity advantage for irregular applications

½ round-trip latency

04/15/2009 CS267 Lecture 21

General Observations

° The overlap potential is the difference between the gap and overhead

• No potential if CPU is tied up throughout message send

- E.g., no send size DMA

• Grows with message size for machines with DMA (per byte cost is handled by network)

- Because per-Byte cost is handled by NIC

• Grows with amount of network congestion

- Because gap grows as network becomes saturated

° Remote overhead is 0 for machine with RDMA

04/15/2009 CS267 Lecture 21

GASNet Communications System

GASNet offers put/get communication

° One-sided: no remote CPU involvement required in API (key difference with MPI)

• Message contains remote address

• No need to match with a receive

• No implicit ordering required

Compiler-generated code

Language-specific runtime

GASNet

Network Hardware

° Used in language runtimes (UPC, etc.)

° Fine-grained and bulk xfers

° Split-phase communication

04/15/2009 CS267 Lecture 21

Performance of 1-Sided vs 2-sided Communication: GASNet vs MPI

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

10 100 1,000 10,000 100,000 1,000,000 10,000,000

Size (bytes)

Bandwidth (KB/s)

gasnet_put_nbi_bulk

MPI Flood

Relative BW (put_nbi_bulk/MPI_Flood)

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

10 1000 100000 10000000

Size (bytes)

Up is good

° Comparison on Opteron/InfiniBand – GASNet’s vapi-conduit and OSU MPI 0.9.5

° Up to large message size (> 256 Kb), GASNet provides up to 2.2X improvement in streaming bandwidth

° Half power point (N/2) differs by one order of magnitude

04/15/2009 CS267 Lecture 21

(up

is

go

od

)

GASNet usually reaches saturation bandwidth before MPI - fewer costs to amortize

Usually outperform MPI at medium message sizes - often by a large margin

GASNet: Performance for mid-range message sizes

Flood Bandwidth for 4KB messages

1189547

420

190

702

152

252

(13,196)

750

714231

763223

679

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed Altix

Percent HW peak

MPI

GASNet

04/15/2009 CS267 Lecture 21

NAS FT Case Study

° Performance of Exchange (Alltoall) is critical• Communication to computation ratio increases with faster, more

optimized 1-D FFTs

• Determined by available bisection bandwidth

• Between 30-40% of the applications total runtime

° Two ways to reduce Exchange cost1. Use a better network (higher Bisection BW)

2. Overlap the all-to-all with communication (where possible) – “break up” the exchange

Default NAS FT Fortran/MPI relies on #1

Our approach uses UPC/GASNet and builds on #2

° Started as CS267 project• 1D partition of 3D grid is a limitation

• At most N processors for N^3 grid

• HPC Challenge benchmark has large 1D FFT (can be viewed as 3D or more with proper roots of unity)

04/15/2009 CS267 Lecture 21

3D FFT Operation with Global Exchange

Cachelines

1D-FFT Columns

1D-FFT (Columns)

Transpose + 1D-FFT

(Rows)1D-FFT Rows

Exchange (Alltoall)

send to Thread 0

send to Thread 1

send to Thread 2

Last 1D-FFT(Thread 0’s view)

Transpose + 1D-FFT

Divide rows among threads

° Single Communication Operation (Global Exchange) sends THREADS large messages

° Separate computation and communication phases

04/15/2009 CS267 Lecture 21

Communication Strategies for 3D FFT

Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea

chunk = all rows with same destination

pencil = 1 row

° Three approaches:• Chunk:

- Wait for 2nd dim FFTs to finish

- Minimize # messages

• Slab:

- Wait for chunk of rows destined for 1 proc to finish

- Overlap with computation

• Pencil:

- Send each row as it completes

- Maximize overlap and

- Match natural layout

slab = all rows in a single plane with same destination

04/15/2009 CS267 Lecture 21

Decomposing NAS FT Exchange into Smaller Messages

Exchange (Default) 512 Kbytes

Slabs (set of contiguous rows) 65 Kbytes

Pencils (single row) 16 Kbytes

° Three approaches:• Chunk:

- Wait for 2nd dim FFTs to finish

• Slab:

- Wait for chunk of rows destined for 1 proc to finish

• Pencil:

- Send each row as it completes

° Example Message Size Breakdown for • Class D (2048 x 1024 x 1024)

• at 256 processors

04/15/2009 CS267 Lecture 21

Overlapping Communication

° Goal: make use of “all the wires”• Distributed memory machines allow for asynchronous

communication

• Berkeley Non-blocking extensions expose GASNet’s non-blocking operations

° Approach: Break all-to-all communication• Interleave row computations and row communications since 1D-

FFT is independent across rows

• Decomposition can be into slabs (contiguous sets of rows) or pencils (individual row)

• Pencils allow:

- Earlier start for communication “phase” and improved local cache use

- But more smaller messages (same total volume)

04/15/2009 CS267 Lecture 21

NAS FT: UPC Non-blocking MFlops

° Berkeley UPC compiler support non-blocking UPC extensions

° Produce 15-45% speedup over best UPC Blocking version

° Non-blocking version requires about 30 extra lines of UPC code

0

200

400

600

800

1000

Myrinet 64

InfiniBand 256Elan3 256

Elan3 512Elan4 256

Elan4 512

MFlops per Thread MFlop rate in UPC Blocking and Non-blocking FT

UPC BlockingUPC Non-Blocking

04/15/2009 CS267 Lecture 21

NAS FT Variants Performance Summary

° Shown are the largest classes/configurations possible on each test machine

° MPI not particularly tuned for many small/medium size messages in flight (long message matching queue depths)

04/15/2009 CS267 Lecture 21

Pencil/Slab optimizations: UPC vs MPI

° Same data, viewed in the context of what MPI is able to overlap

° “For the amount of time that MPI spends in communication, how much of that time can UPC effectively overlap with computation”

° On Infiniband, UPC overlaps almost all the time the MPI spends in communication

° On Elan3, UPC obtains more overlap than MPI as the problem scales up

0

0.2

0.4

0.6

0.8

1

Fraction of Unoverlapped MPI Communication that UPC Effectively Overlaps with ComputationBest MPI and Best UPC for each System (Class/NProcs)

Myrinet 64

InfiniBand 256Elan3 256

Elan3 512Time Spent in Communication (Best MPI)

04/15/2009 CS267 Lecture 21

Summary of Overlap in FFTs

° One-sided communication has performance advantages• Better match for most networking hardware

- Most cluster networks have RDMA support

- Machines with global address space support (X1, Altix) shown elsewhere

• Smaller messages may make better use of network

- Spread communication over longer period of time

- Postpone bisection bandwidth pain

• Smaller messages can also prevent cache thrashing for packing

- Avoid packing overheads if natural message size is reasonable

04/15/2009 CS267 Lecture 21

FFTW

• C library for real & complex FFTs (arbitrary size/dimensionality)

• Computational kernels (80% of code) automatically generated

• Self-optimizes for your hardware (picks best composition of steps)= portability + performance

(+ parallel versions for threads & MPI)

free software: http://www.fftw.org/

the “FastestFourier Tranform

in the West”

04/15/2009 CS267 Lecture 21

FFTW performancepower-of-two sizes, double precision

833 MHz Alpha EV6 2 GHz PowerPC G5

2 GHz AMD Opteron 500 MHz Ultrasparc IIe

04/15/2009 CS267 Lecture 21

FFTW performancenon-power-of-two sizes, double precision

833 MHz Alpha EV6

2 GHz AMD Opteron

unusual: non-power-of-two sizesreceive as much optimization

as powers of two

…because welet the code do the optimizing

04/15/2009 CS267 Lecture 21

FFTW performancedouble precision, 2.8GHz Pentium IV: 2-way SIMD (SSE2)

powers of two

…because welet the code write itself

non-powers-of-two

exploiting CPU-specificSIMD instructions

(rewriting the code)is easy

04/15/2009 CS267 Lecture 21

Why is FFTW fast?three unusual features

FFTW implements many FFT algorithms: A planner picks the best composition

by measuring the speed of different combinations.

The resulting plan is executedwith explicit recursion:

enhances locality

3

1

2 The base cases of the recursion are codelets:highly-optimized dense code

automatically generated by a special-purpose “compiler”

04/15/2009 CS267 Lecture 21

FFTW is easy to use

{complex x[n];plan p;

p = plan_dft_1d(n, x, x, FORWARD, MEASURE);...execute(p); /* repeat as needed */...destroy_plan(p);

}

Key fact: usually,many transforms of same size

are required.

04/15/2009 CS267 Lecture 21





enhances locality

3

1



04/15/2009 CS267 Lecture 21

But traditional implementation is non-recursive,breadth-first traversal:

log2 n passes over whole array

FFTW Uses Natural Recursion

Size 8 DFT

Size 4 DFT Size 4 DFT

Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT

p = 2 (radix 2)

04/15/2009 CS267 Lecture 21

breadth-first, but with blocks of size = cache

Traditional cache solution: Blocking

Size 8 DFT



p = 2 (radix 2)

…requires program specialized for cache size

04/15/2009 CS267 Lecture 21

Recursive Divide & Conquer is Good

Size 8 DFT



p = 2 (radix 2)

eventually small enough to fit in cache…no matter what size the cache is

(depth-first traversal) [Singleton, 1967]

04/15/2009 CS267 Lecture 21

Cache Obliviousness

• A cache-oblivious algorithm does not know the cache size— it can be optimal for any machine & for all levels of cache simultaneously

• Exist for many other algorithms, too [Frigo et al. 1999]

— all via the recursive divide & conquer approach

04/15/2009 CS267 Lecture 21





enhances locality

3

1



Documents

Parallel Spectral Methods: Solving Elliptic Problems with FFTs