Upload
finola
View
22
Download
2
Embed Size (px)
DESCRIPTION
Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs *. Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture was derived from material from Johnson, Johnson, Pryor. Introduction. - PowerPoint PPT Presentation
Citation preview
Mar. 1, 2001 Parallel Processing 1
Parallel Processing (CS 730)
Lecture 9: Distributed Memory FFTs*
Jeremy R. Johnson
Wed. Mar. 1, 2001
*Parts of this lecture was derived from material from Johnson, Johnson, Pryor.
Mar. 1, 2001 Parallel Processing 2
Introduction
• Objective: To derive and implement a distributed-memory parallel program for computing the fast Fourier transform (FFT).
• Topics– Derivation of the FFT
• Iterative version
• Pease Algorithm & Generalizations
• Tensor permutations
– Distributed implementation of tensor permutations• stride permutation
• bit reversal
– Distributed FFT
Mar. 1, 2001 Parallel Processing 3
FFT as a Matrix FactorizationCompute y = Fnx, where Fn is n-point Fourier matrix.
1000
0010
0100
0001
1100
1100
0011
0011
000
0100
0010
0001
1010
0101
1010
0101
11
1111
11
1111
4
iii
iiF
LFITIFF4
222
4
2224)()(
LFF
WI
IIIIF
2
22
FI
0
0
T
0
0
IF 22
2
m
m
m
m
m
mm
mmm
mm
mm
Mar. 1, 2001 Parallel Processing 4
Matrix Factorizations and Algorithms
function y = fft(x)
n = length(x)
if n == 1 y = xelse% [x0 x1] = L^n_2 x x0 = x(1:2:n-1); x1 = x(2:2:n);
% [t0 t1] = (I_2 tensor F_m)[x0 x1] t0 = fft(x0); t1 = fft(x1);
% w = W_m(omega_n) w = exp((2*pi*i/n)*(0:n/2-1));
% y = [y0 y1] = (F_2 tensor I_m) T^n_m [t0 t1] y0 = t0 + w.*t1; y1 = t0 - w.*t1; y = [y0 y1]end
Mar. 1, 2001 Parallel Processing 5
Rewrite Rules
TTT
TN
N
AB
(AB)
:7RD)C)(B(A
CD)(AB
:6R
C)(BACB)(A
:5R
L)A(BL )B(A :N)(M,4R
)L)(II(L L:P)N,(M,3RFF :(N)2R
L)F(IT)I(F F :N)1(M,R
MNNMN
MNMNM
NPPMN
MPP
PMNP
MNMNM
MNNNMMN
Mar. 1, 2001 Parallel Processing 6
FFT Variants• Cooley-Tukey
• Recursive FFT
• Iterative FFT
• Vector FFT (Stockham)
• Vector FFT (Korn-Lambiotte)
• Parallel FFT (Pease)
LFITIFF8
242
8
2428)()(
RFITIIFITIFF 824
4
22222
8
4428))()(()(
LLFITIFITIFF8
2
4
222
4
2222
8
4428)))()((()(
))()()(()( IFILITIFLTIFF 422
4
22
4
242
8
2
8
4428
RLIFLITIFLTIFF 8
8
242
8
22
4
242
8
2
8
4428)())(()(
RFILLITLFILLTLFILF 824
8
2
8
22
4
2
8
424
8
2
8
2
8
4
8
424
8
28)()()()(
Mar. 1, 2001 Parallel Processing 7
Example TPL Programs
; Recursive 8-point FFT
(compose (tensor (F 2) (I 4)) (T 8 4)
(tensor (I 2)
(compose (tensor (F 2) (I 2)) (T 4 2)
(tensor (I 2) (F 2)) (L 4 2))
(L 8 2))
; Iterative 8-point FFT
(compose (tensor (F 2) (I 4)) (T 8 4)
(tensor (I 2) (F 2) (I 2)) (tensor (I 2) (T 4 2))
(tensor (F 2) (I 4))
(tensor (I 2) (L 4 2)
(L 8 2))
Mar. 1, 2001 Parallel Processing 8
FFT Dataflow
• Different formulas for the FFT have different dataflow (memory access patterns).
• The dataflow in a class of FFT algorithms can be described by a sequence of permutations.
• An “FFT dataflow” is a sequence of permutations that can be modified with the insertion of butterfly computations (with appropriate twiddle factors) to form a factorization of the Fourier matrix.
• FFT dataflows can be classified wrt to cost, and used to find “good” FFT implementations.
Mar. 1, 2001 Parallel Processing 9
Distributed FFT Algorithm
• Experiment with different dataflow and locality properties by changing radix and permutations
RTFIP
,...,
/11
Nnn
nnt
iii
t
i i
Mar. 1, 2001 Parallel Processing 10
Cooley-Tukey Dataflow
RTFILITFIRTFILF 8324
4
222248124
8
28))(()()(
Mar. 1, 2001 Parallel Processing 11
Pease Dataflow
RTFILTFILTFILF 8324
8
2224
8
2124
8
28)()()(
Mar. 1, 2001 Parallel Processing 12
Tensor Permutations
• A natural class of permutations compatible with the FFT. Let be a permutation of {1,…,t}
• Mixed-radix counting permutation of vector indices
• Well-known examples are stride permutations and bit-reversal.
vvvv )()1(1:P tt
Mar. 1, 2001 Parallel Processing 13
Example (Stride Permutation)
• 000 000• 001 100• 010 001• 011 011• 100 010• 101 110• 110 101
• 111 111
xxxxxxxx
xxxxxxxx
7
6
5
4
3
2
1
0
7
5
3
1
6
4
2
0
10
00
00
10
00
00
00
0000
00
00
00
10
00
00
1001
00
00
01
00
00
00
0000
00
00
00
01
00
00
01L
8
2
Mar. 1, 2001 Parallel Processing 14
Example (Bit Reversal)
• 000 000• 001 100• 010 010• 011 110• 100 001• 101 101• 110 011
• 111 111
xxxxxxxx
xxxxxxxx
7
6
5
4
3
2
1
0
7
3
5
1
6
2
4
0
10
00
00
00
00
10
00
0000
00
10
00
00
00
00
1001
00
00
00
00
01
00
0000
00
01
00
00
00
00
01R
8
2
Mar. 1, 2001 Parallel Processing 15
Twiddle Factor Matrix
• Diagonal matrix containing roots of unity
• Generalized Twiddle (compatible with tensor permutations)
JI},,...,1{JI,T,...,
JI,1 tnn t
eeeeemn
jin
ij
mn
n
j
m
i
ij
mn
n
j
m
i
mn
n T
TPTP )()1(1
,...,
(J)(I),
1,...,
JI,nnnn tt
Mar. 1, 2001 Parallel Processing 16
Distributed Computation
• Allocate equal-sized segments of vector to each processor, and index distributed vector with pid and local offset.
• Interpret tensor product operations with this addressing scheme
bk+l-1 ……bl bl-1 …...……... b1 b0
pid offset
Mar. 1, 2001 Parallel Processing 17
Distributed Tensor Product and Twiddle Factors
• Assume P processors
• InA, becomes parallel do over all processors when n P.
• Twiddle factors determined independently from pid and offset. Necessary bits determined from I, J, and (n1,…,nt) in generalized twiddle notation.
Mar. 1, 2001 Parallel Processing 18
Distributed Tensor Permutations
b(k+l-1) … b(l) b(l-1) ………... b(1)b(0)
bk+l-1 ……bl bl-1 …...……... b1 b0
pid offset
TN
Mar. 1, 2001 Parallel Processing 19
Classes of Distributed Tensor Permutations
1 Local (pid is fixed by )
Only permute elements locally within each processor2 Global (offset is fixed by )
Permute the entire local arrays amongst the processors3 Global*Local (bits in pid and bits in offset moved by , but no bits cross
the pid/offset boundary)
Permute elements locally followed by a Global permutation4 Mixed (at least one offset and pid bit are exchanged)
Elements from a processor are sent/received to/from more than one processor
Mar. 1, 2001 Parallel Processing 20
Distributed Stride Permutation
• 000|0 000|0 000|1 100|0• 001|0 000|1 001|1 100|1• 010|0 001|0 010|1 101|0• 011|0 001|1011|1 101|1
• 100|0 010|0 100|1 110|0• 101|0 010|1 101|1 110|1• 110|0 011|0 110|1 111|0• 111|0 011|1111|1 111|1
L64
2 L64
2
Mar. 1, 2001 Parallel Processing 21
Communication Pattern
X(0:2:6)
0 1
2
3
45
6
7
Y(4:1:3)
X(1:2:7)
Y(0:1:7)
L64
2
Mar. 1, 2001 Parallel Processing 22
Communication Pattern
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Each PE sends 1/2 data to 2 different PEs
L64
2
Mar. 1, 2001 Parallel Processing 23
Communication Pattern
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Each PE sends 1/4 data to 4 different PEs
L64
4
Mar. 1, 2001 Parallel Processing 24
Communication Pattern
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Each PE sends1/8 data to 8 different PEs
L64
8
Mar. 1, 2001 Parallel Processing 25
Implementation of Distributed Stride Permutation
D_Stride(Y,N,t,P,k,M,l,S,j,X)// Compute Y = L^N_S X// Inputs// Y,X distributed vectors of size N = 2^t, // with M = 2^l elements per processor// P = 2^k = number of processors// S = 2^j, 0 <= j <= k, is the stride.// Output// Y = L^N_S X
p = pid for i=0,...,2j-1 do put x(i:S:i+S*(n/S-1)) in y((n/S)*(p mod S):(n/S)*(p mod S)+N/S-1) on PE p/2^j + i*2^{k-j}
Mar. 1, 2001 Parallel Processing 26
Cyclic Scheduling
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Each PE sends 1/4 data to 4 different PEs
L64
4
Mar. 1, 2001 Parallel Processing 27
Distributed Bit Reversal Permutation
• Mixed tensor permutation
• Implement using factorization
eeee iiii tt
2222N
211
R
RRLLRRRL
2
P
2
N
L
N
P
L
2
P
2
N
2
b7b6b5 b4b3b2b1b0 b5b6b7 b0b1b2b3b4
b7b6b5 b4b3b2b1b0 b0b1b2 b3b4b5b6b7
Mar. 1, 2001 Parallel Processing 28
Experiments on the CRAY T3E
• All experiments were performed on a 240 node (8x4x8 with partial plane) T3E using 128 processors (300 MHz) with 128MB memory
– Task 1(pairwise communication)
Implemented with shmem_get, shmem_put, and mpi_sendrecv– Task 2 (all 7! = 5040 global tensor permutations)
Implemented with shmem_get, shmem_put, and mpi_sendrecv– Task 3 (local tensor permutations of the form I L I on vectors of size 2^22 words - only
run on a single node)
Implemented using streams on/off, cache bypass– Task 4 (distributed stride permutations)
Implemented using shmem_iput, shmem_iget, and mpi_sendrecv
Mar. 1, 2001 Parallel Processing 29
Task 1 Performance Data
Mar. 1, 2001 Parallel Processing 30
Task 2 Performance Data
0 50 100 150 200 250 300 3500
200
400
600
800
1000
1200N
um
ber
of te
nsor
perm
uta
tions
Bandwidth in MB/sec/node
Distribution of Global Tensor Permutations on 128 Processors (shmem put)
Mar. 1, 2001 Parallel Processing 31
Task 3 Performance Data
0 200 400 600 800 1000 1200 1400 1600 18000
20
40
60
80
100
120
140
160
180
200Local Tensor Permutations (Bit Rotations) for 222 Words - Cache Bypass
Permutation Number
MB
/sec
Mar. 1, 2001 Parallel Processing 32
Task 4 Performance Data
2 3 4 5 6 7 824
26
28
30
32
34
36
38
40
42
44
Stride = 2j
MB
/sec
Performance of Distributed Stride Permutations on 128 Processors with 222 Words
Mar. 1, 2001 Parallel Processing 33
Network Simulator
• An idealized simulator for the T3E was developed (with C. Grassl from Cray research) in order to study contention
– Specify processor layout and route table and number of virual processors with a given start node
– Each processor can simultaneously issue a single send
• Contention is measured as the maximum number of messages across any edge/node
• Simulator used to study global and mixed tensor permutations.
Mar. 1, 2001 Parallel Processing 34
Task 2 Grid Simulation Analysis
0 2 4 6 8 10 12 14 16 180
100
200
300
400
500
600
700
800
900Distribution of Global Tensor Permutations on 8x8x2 Grid w/o Wrap Around
Max messages through a node
Num
ber
of t
enso
r pe
rmut
atio
ns3
1433
1002133214988865695017195313362244448
Mar. 1, 2001 Parallel Processing 35
Task 2 Grid Simulation Analysis
1 2 3 4 5 6 7 8 9 10 11 120
500
1000
1500
Max messages through an edge
Num
ber
of t
enso
r pe
rmut
atio
ns
Distribution of Global Tensor Permutations on 8x8x2 Grid w/o Wrap Around
16330266
1440432
1116180972
0144
0144
Mar. 1, 2001 Parallel Processing 36
0 2 4 6 8 10 12 14 16 180
100
200
300
400
500
600
700
800
900
Num
ber
of t
enso
r pe
rmut
atio
ns
Max messages through a node
Distribution of Global Tensor Permutations on 8x8x2 Grid with Wrap Around
Task 2 Torus Simulation Analysis
31445
1352665177738427566615123231147081
Mar. 1, 2001 Parallel Processing 37
Task 2 Torus Simulation Analysis
1 2 3 4 5 6 7 80
500
1000
1500
2000
2500Distribution of Global Tensor Permutations on 8x8x2 Grid with Wrap Around
Max messages through an edge
Num
ber
of t
enso
r pe
rmut
atio
ns 16612410
2456466486108486