Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs *

Mar. 1, 2001 Parallel Processing 1

Parallel Processing (CS 730)

Lecture 9: Distributed Memory FFTs*

Jeremy R. Johnson

Wed. Mar. 1, 2001

*Parts of this lecture was derived from material from Johnson, Johnson, Pryor.


Introduction

• Objective: To derive and implement a distributed-memory parallel program for computing the fast Fourier transform (FFT).

• Topics– Derivation of the FFT

• Iterative version

• Pease Algorithm & Generalizations

• Tensor permutations

– Distributed implementation of tensor permutations• stride permutation

• bit reversal

– Distributed FFT


FFT as a Matrix FactorizationCompute y = Fnx, where Fn is n-point Fourier matrix.

1000

0010

0100

0001

1100

1100

0011

0011

000

0100

0010

0001

1010

0101

1010

0101

11

1111

11

1111

4

iii

iiF

LFITIFF4

222

4

2224)()(

LFF

WI

IIIIF

2

22

FI

0

0

T

0

0

IF 22

2

m

m

m

m

m

mm

mmm

mm

mm


Matrix Factorizations and Algorithms

function y = fft(x)

n = length(x)

if n == 1 y = xelse% [x0 x1] = L^n_2 x x0 = x(1:2:n-1); x1 = x(2:2:n);

% [t0 t1] = (I_2 tensor F_m)[x0 x1] t0 = fft(x0); t1 = fft(x1);

% w = W_m(omega_n) w = exp((2*pi*i/n)*(0:n/2-1));

% y = [y0 y1] = (F_2 tensor I_m) T^n_m [t0 t1] y0 = t0 + w.*t1; y1 = t0 - w.*t1; y = [y0 y1]end


Rewrite Rules

TTT

TN

N

AB

(AB)

:7RD)C)(B(A

CD)(AB

:6R

C)(BACB)(A

:5R

L)A(BL )B(A :N)(M,4R

)L)(II(L L:P)N,(M,3RFF :(N)2R

L)F(IT)I(F F :N)1(M,R

MNNMN

MNMNM

NPPMN

MPP

PMNP

MNMNM

MNNNMMN


FFT Variants• Cooley-Tukey

• Recursive FFT

• Iterative FFT

• Vector FFT (Stockham)

• Vector FFT (Korn-Lambiotte)

• Parallel FFT (Pease)

LFITIFF8

242

8

2428)()(

RFITIIFITIFF 824

4

22222

8

4428))()(()(

LLFITIFITIFF8

2

4

222

4

2222

8

4428)))()((()(

))()()(()( IFILITIFLTIFF 422

4

22

4

242

8

2

8

4428

RLIFLITIFLTIFF 8

8

242

8

22

4

242

8

2

8

4428)())(()(

RFILLITLFILLTLFILF 824

8

2

8

22

4

2

8

424

8

2

8

2

8

4

8

424

8

28)()()()(


Example TPL Programs

; Recursive 8-point FFT

(compose (tensor (F 2) (I 4)) (T 8 4)

(tensor (I 2)


(tensor (I 2) (F 2)) (L 4 2))

(L 8 2))

; Iterative 8-point FFT


(tensor (I 2) (F 2) (I 2)) (tensor (I 2) (T 4 2))

(tensor (F 2) (I 4))

(tensor (I 2) (L 4 2)

(L 8 2))


FFT Dataflow

• Different formulas for the FFT have different dataflow (memory access patterns).

• The dataflow in a class of FFT algorithms can be described by a sequence of permutations.

• An “FFT dataflow” is a sequence of permutations that can be modified with the insertion of butterfly computations (with appropriate twiddle factors) to form a factorization of the Fourier matrix.

• FFT dataflows can be classified wrt to cost, and used to find “good” FFT implementations.


Distributed FFT Algorithm

• Experiment with different dataflow and locality properties by changing radix and permutations

RTFIP

,...,

/11

Nnn

nnt

iii

t

i i


Cooley-Tukey Dataflow

RTFILITFIRTFILF 8324

4

222248124

8

28))(()()(


Pease Dataflow

RTFILTFILTFILF 8324

8

2224

8

2124

8

28)()()(


Tensor Permutations

• A natural class of permutations compatible with the FFT. Let be a permutation of {1,…,t}

• Mixed-radix counting permutation of vector indices

• Well-known examples are stride permutations and bit-reversal.

vvvv )()1(1:P tt


Example (Stride Permutation)

• 000 000• 001 100• 010 001• 011 011• 100 010• 101 110• 110 101

• 111 111

xxxxxxxx

xxxxxxxx

7

6

5

4

3

2

1

0

7

5

3

1

6

4

2

0

10

00

00

10

00

00

00

0000

00

00

00

10

00

00

1001

00

00

01

00

00

00

0000

00

00

00

01

00

00

01L

8

2


Example (Bit Reversal)

• 000 000• 001 100• 010 010• 011 110• 100 001• 101 101• 110 011

• 111 111

xxxxxxxx

xxxxxxxx

7

6

5

4

3

2

1

0

7

3

5

1

6

2

4

0

10

00

00

00

00

10

00

0000

00

10

00

00

00

00

1001

00

00

00

00

01

00

0000

00

01

00

00

00

00

01R

8

2


Twiddle Factor Matrix

• Diagonal matrix containing roots of unity

• Generalized Twiddle (compatible with tensor permutations)

JI},,...,1{JI,T,...,

JI,1 tnn t

eeeeemn

jin

ij

mn

n

j

m

i

ij

mn

n

j

m

i

mn

n T

TPTP )()1(1

,...,

(J)(I),

1,...,

JI,nnnn tt


Distributed Computation

• Allocate equal-sized segments of vector to each processor, and index distributed vector with pid and local offset.

• Interpret tensor product operations with this addressing scheme

bk+l-1 ……bl bl-1 …...……... b1 b0

pid offset


Distributed Tensor Product and Twiddle Factors

• Assume P processors

• InA, becomes parallel do over all processors when n P.

• Twiddle factors determined independently from pid and offset. Necessary bits determined from I, J, and (n1,…,nt) in generalized twiddle notation.


Distributed Tensor Permutations

b(k+l-1) … b(l) b(l-1) ………... b(1)b(0)

bk+l-1 ……bl bl-1 …...……... b1 b0

pid offset

TN


Classes of Distributed Tensor Permutations

1 Local (pid is fixed by )

Only permute elements locally within each processor2 Global (offset is fixed by )

Permute the entire local arrays amongst the processors3 Global*Local (bits in pid and bits in offset moved by , but no bits cross

the pid/offset boundary)

Permute elements locally followed by a Global permutation4 Mixed (at least one offset and pid bit are exchanged)

Elements from a processor are sent/received to/from more than one processor


Distributed Stride Permutation

• 000|0 000|0 000|1 100|0• 001|0 000|1 001|1 100|1• 010|0 001|0 010|1 101|0• 011|0 001|1011|1 101|1

• 100|0 010|0 100|1 110|0• 101|0 010|1 101|1 110|1• 110|0 011|0 110|1 111|0• 111|0 011|1111|1 111|1

L64

2 L64

2


Communication Pattern

X(0:2:6)

0 1

2

3

45

6

7

Y(4:1:3)

X(1:2:7)

Y(0:1:7)

L64

2



0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Each PE sends 1/2 data to 2 different PEs

L64

2



0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7


L64

4



0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Each PE sends1/8 data to 8 different PEs

L64

8


Implementation of Distributed Stride Permutation

D_Stride(Y,N,t,P,k,M,l,S,j,X)// Compute Y = L^N_S X// Inputs// Y,X distributed vectors of size N = 2^t, // with M = 2^l elements per processor// P = 2^k = number of processors// S = 2^j, 0 <= j <= k, is the stride.// Output// Y = L^N_S X

p = pid for i=0,...,2j-1 do put x(i:S:i+S*(n/S-1)) in y((n/S)*(p mod S):(n/S)*(p mod S)+N/S-1) on PE p/2^j + i*2^{k-j}


Cyclic Scheduling

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7


L64

4


Distributed Bit Reversal Permutation

• Mixed tensor permutation

• Implement using factorization

eeee iiii tt

2222N

211

R

RRLLRRRL

2

P

2

N

L

N

P

L

2

P

2

N

2

b7b6b5 b4b3b2b1b0 b5b6b7 b0b1b2b3b4

b7b6b5 b4b3b2b1b0 b0b1b2 b3b4b5b6b7


Experiments on the CRAY T3E

• All experiments were performed on a 240 node (8x4x8 with partial plane) T3E using 128 processors (300 MHz) with 128MB memory

– Task 1(pairwise communication)

Implemented with shmem_get, shmem_put, and mpi_sendrecv– Task 2 (all 7! = 5040 global tensor permutations)

Implemented with shmem_get, shmem_put, and mpi_sendrecv– Task 3 (local tensor permutations of the form I L I on vectors of size 2^22 words - only

run on a single node)

Implemented using streams on/off, cache bypass– Task 4 (distributed stride permutations)

Implemented using shmem_iput, shmem_iget, and mpi_sendrecv


Task 1 Performance Data



0 50 100 150 200 250 300 3500

200

400

600

800

1000

1200N

um

ber

of te

nsor

perm

uta

tions

Bandwidth in MB/sec/node

Distribution of Global Tensor Permutations on 128 Processors (shmem put)



0 200 400 600 800 1000 1200 1400 1600 18000

20

40

60

80

100

120

140

160

180

200Local Tensor Permutations (Bit Rotations) for 222 Words - Cache Bypass

Permutation Number

MB

/sec



2 3 4 5 6 7 824

26

28

30

32

34

36

38

40

42

44

Stride = 2j

MB

/sec

Performance of Distributed Stride Permutations on 128 Processors with 222 Words


Network Simulator

• An idealized simulator for the T3E was developed (with C. Grassl from Cray research) in order to study contention

– Specify processor layout and route table and number of virual processors with a given start node

– Each processor can simultaneously issue a single send

• Contention is measured as the maximum number of messages across any edge/node

• Simulator used to study global and mixed tensor permutations.


Task 2 Grid Simulation Analysis

0 2 4 6 8 10 12 14 16 180

100

200

300

400

500

600

700

800

900Distribution of Global Tensor Permutations on 8x8x2 Grid w/o Wrap Around

Max messages through a node

Num

ber

of t

enso

r pe

rmut

atio

ns3

1433

1002133214988865695017195313362244448


Task 2 Grid Simulation Analysis

1 2 3 4 5 6 7 8 9 10 11 120

500

1000

1500

Max messages through an edge

Num

ber

of t

enso

r pe

rmut

atio

ns

Distribution of Global Tensor Permutations on 8x8x2 Grid w/o Wrap Around

16330266

1440432

1116180972

0144

0144


0 2 4 6 8 10 12 14 16 180

100

200

300

400

500

600

700

800

900

Num

ber

of t

enso

r pe

rmut

atio

ns

Max messages through a node

Distribution of Global Tensor Permutations on 8x8x2 Grid with Wrap Around

Task 2 Torus Simulation Analysis

31445

1352665177738427566615123231147081


Task 2 Torus Simulation Analysis

1 2 3 4 5 6 7 80

500

1000

1500

2000

2500Distribution of Global Tensor Permutations on 8x8x2 Grid with Wrap Around

Max messages through an edge

Num

ber

of t

enso

r pe

rmut

atio

ns 16612410

2456466486108486

Documents

Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs *