Upload
osborn-glenn
View
221
Download
0
Embed Size (px)
Citation preview
NCSU 2/24/06 1
Array Dependence Analysis with the Chains of Recurrences Framework for Loop Optimization
Robert van Engelen
Florida State UniversityAlso thanks to J. Birch, Y. Shou, and K. Gallivan
NCSU 2/24/06 2
Outline
Motivation Restructuring compilers Chains of recurrences algebra and associated
algorithms for the GCC and Polaris compilers Nonlinear array dependence testing for loop
restructuring and vectorization Experimental results Conclusions
NCSU 2/24/06 3
Motivation
Intel CTO: “the increased power requirements of newer chips will lead to CPUs that are hotter than the surface of the sun by 2010”
Enter multi-core CPUs Increase the overall system speed by adding CPU cores Speed up multi-threaded applications Can effectively lower the power consumption
Enter (more?) multi-media extensions Vector-like instruction sets: MMX, SSE, AltiVec Speed up multi-media codes, such as JPEG, MPEG
NCSU 2/24/06 4
Code Optimization by Hand or Automatic? Rewriting applications by hand to exploit parallelism is
doable, if: Tasks can be identified that run independently, such as a Web
browser’s rendering and communications tasks Course-grain parallelism: tasks must have sufficient work
Rewriting applications by hand to exploit lots of fine-grain parallelism is not doable Thousands of read-after-write (RAW), write-after-read (WAR),
and write-after-write (WAW), data dependences must be analyzed
NCSU 2/24/06 5
Restructuring Compilers
A restructuring compiler typically applies source-code transformations automatically to meet various performance enhancement criteria: Exploit parallelism in loops by reordering the loop structure to
run loop iterations in parallel Find small loops to replace with vector instructions Optimize data locality by reordering code to change memory
access order and cache
All code changes are safe as long as RAW, WAR, and WAW data dependences are preserved!
NCSU 2/24/06 6
Example: Loop Fission
Loop fission splits a single loop into multiple loops Allows vectorization and
parallelization of the new loops when original loop was sequential
Loop fission must preserve all dependence relations of the original loop
S1 DO I = 1, 10S2 DO J = 1, 10S3 A(I,J) = B(I,J) + C(I,J)S4 D(I,J) = A(I,J-1) * 2.0S5 ENDDOS6 ENDDO
S1 DO I = 1, 10S2 DO J = 1, 10S3 A(I,J) = B(I,J) + C(I,J)Sx ENDDOSy DO J = 1, 10S4 D(I,J) = A(I,J-1) * 2.0S5 ENDDOS6 ENDDO
S1 PARALLEL DO I = 1, 10S3 A(I,1:10)=B(I,1:10)+C(I,1:10)S4 D(I,1:10)=A(I,0:9) * 2.0S6 ENDDO
S3 (=,<) S4
S3 (=,<) S4
S3 (=,<) S4
NCSU 2/24/06 7
Loop Fission: Algorithm
Compute the acyclic condensation of the dependence graph to find a legal order of the loops
S1 DO I = 1, 10S2 A(I) = A(I) + B(I-1)S3 B(I) = C(I-1)*X + ZS4 C(I) = 1/B(I)S5 D(I) = sqrt(C(I))S6 ENDDO
S2
S3
S4
S5
0
01
1
Dependence graph
S2 S5
S3 S4
Acyclic condensation
S1 DO I = 1, 10S3 B(I) = C(I-1)*X + ZS4 C(I) = 1/B(I)Sx ENDDOS2 A(1:10) = A(1:10) + B(0:9)S5 D(1:10) = sqrt(C(1:10))
S3 (<) S2
S4 (<) S3
S3 (=) S4
S4 (=) S5
NCSU 2/24/06 8
Example: Loop Interchange
Changes the loop nesting order Allows vectorization of an
outer loop and more effective parallelization of an inner loop
Can be used to improve spatial locality
Loop interchange must preserve all dependence relations of the original loop
S1 DO I = 1, NS2 DO J = 1, MS3 A(I,J) = A(I,J-1) + B(I,J)S4 ENDDOS5 ENDDO
S2 DO J = 1, MS1 DO I = 1, NS3 A(I,J) = A(I,J-1) + B(I,J)S4 ENDDOS5 ENDDO
S2 DO J = 1, MS3 A(1:N,J)=A(1:N,J-1)+B(1:N,J)S5 ENDDO
S3 (=,<) S3
S3 (<,=) S3
S3 (<,=) S3
NCSU 2/24/06 9
Loop Interchange: Algorithm
Compute the direction matrix and find which columns (and therefore which loops) can be permuted without violating dependence relations in the original loop nest
S1 DO I = 1, NS2 DO J = 1, MS3 DO K = 1, LS4 A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1)S5 ENDDOS6 ENDDOS7 ENDDO
S4 (<,<,=) S4
S4 (<,=,>) S4
< < =< = >
Direction matrix
< = <= > <
< < =< = >
Invalid
< < == < >
< < =< = >
Valid
NCSU 2/24/06 10
Complications
Loop restructuring is complicated by: The presence of several induction variables Nonlinear and symbolic array index expressions The use of pointer arithmetic instead of arrays in C Non-unit loop strides and unstructured loops Control flow
Need loop normalization and preprocessing Apply induction variable substitution Convert pointer dereferences to array accesses Normalize the loop iteration space
NCSU 2/24/06 11
Induction Variable Substitution
Example loop After IV substitution (IVS) (note the affine indexes)
After parallelization
I = 0 J = 1 while (I<N) I = I+1 … = A[J] J = J+2 K = 2*I A[K] = … endwhile
for i=0 to N-1 S1: … = A[2*i+1] S2: A[2*i+2] = … endfor
forall (i=0,N-1) … = A[2*i+1] A[2*i+2] = … endforall
GCD test to solve dependence equation 2id - 2iu = -1Since 2 does not divide 1 there is no data dependence.
W R W R W R
A[2*i+1]
…
A[2*i+2]
A[]
Dep testIVS
NCSU 2/24/06 12
IV Recognitionon SSA Forms
I1 = 3M1 = 0do I2 = (I1,I3) J1 = (?,J3) K1 = (?,K2) L1 = (?,L2) M2 = (M1,M3) J2 = 3 I3 = I2+1 L2 = M2+1 M3 = L2+2 J3 = I3+J2
K2 = 2*J3
while (…)
I2(i) = 3+i J1(i) = 7+iL2(i) = 1+3i K1(i) = 14+2iM2(i) = 3i
Spanningtree
[Cytron91, Wolfe92]
NCSU 2/24/06 13
Symbolic Differencingdo x = x+z y = z+1 z = y+1while (…)
Iteration x y z
1 x+z diff z+1 diff z diff
2 x+2z+2 z+2 diff z+3 2 z+2 2
3 x+3z+6 z+4 2 z+5 2 z+4 2
Use abstract interpretation to evaluate loop iterations and construct symbolic difference table of the IV values
x(i) = x0 + z0i + (i2-i) y(i) = z0 + 2i + 1 z(i) = z0 + 2i
[Haghighat95]
NCSU 2/24/06 14
Pointer-to-Array Conversion
f += 2;lsp += 2;for (i = 2; i <= 5; i++){ *f = f[-2]; for (j = 1; j < i; j++, f--) *f += f[-2]-2*(*lsp)*f[-1]; *f -= 2*(*lsp); f += i; lsp += 2;}
Lsp_az speech codec segmentfrom ETSI with pointer updates.
for (i = 0; i <= 3; i++){ f[i+2] = f[i]; for (j = 0; j <= i; j++) f[i-j+2] += f[i-j]- 2*lsp[2*i+2]*f[i-j+1]; f[1] -= 2*lsp[2*i+2];}
Lsp_az speech codec segmentafter pointer-to-array conversion.
Note that all array indexexpressions are affine.
[vanEngelen01, Franke01]
NCSU 2/24/06 15
Control-Flow Issues
Conditional array accesses and conditionally updated induction variables present problems:
do { K = 3; K = K+J; if (…) J = K; else J = J+3; A[J] = …} while (J<N)
DO I=1,10 IF … J = J+2 ELSE J = I ENDIF A(J) = …ENDDO
for (…) { if (…) A[I] = … else … = A[J]
}
Assume RAW andWAR dependences
Extensive analysisreveals that J:=J+3
Problem: J has nosingle recurrence form
NCSU 2/24/06 16
Chains of Recurrences for Compiler Optimization
Chains of recurrence forms and algebra can be used to: Detect (non)linear coupled IVs Analyze pointer arithmetic Effectively handle control flow Implement array dependence testing
NCSU 2/24/06 17
Chains of Recurrences
A chain of recurrences (CR) represents a polynomial or exponential function or mix evaluated over a unit-distance grid [Zima92]
Basic form: {init, , stride}
Iteration {init, , stride} f(i) = 2i+1 = {1,+,2} f(i) = 2i = {1,*,2}
i = 0 init 1 1
i = 1 init stride 3 2
i = 2 init stride stride 5 4
i = 3 init stride stride stride 7 8
NCSU 2/24/06 18
Chains of Recurrences:General Formulation The key idea is to represent a non-constant CR stride in
CR form itself, thereby forming a chain of recurrences
Example: f(i) = i2 = {0, +, s(i-1)} = {0, +, 1, +, 2} where s(i-1) = {1, +, 2}
Iteration {init, , s(i-1)} s(i) = {1, +, 2} f(i) = {0, +, s(i-1)}
i = 0 init 1 0
i = 1 init s(0) 3 1
i = 2 init s(0) s(1) 5 4
i = 3 init s(0) s(1) s(2) 7 9
NCSU 2/24/06 19
CRs for Expediting Function Evaluations on Grids Suppose f(i) = a + b·i + c·i2 = {a, +, {b+c, +, 2c}} We have two IVs x and y:
f(i) = x = {x0, +, y} with x0 = as(i) = y = {y0, +, 2c} with y0 = b+c
Implement loop to update x and y for efficient evaluation of f(i) over a unit-distance grid i = 0, …, n :
x = ay = b+cfor i=0 to n f[i] = x x = x+y y = y+2*cendfor
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10
Iteration
s(i)
NCSU 2/24/06 20
Multi-Dimensional Example
Let f(i,j) = i2 + i·j + 1
1. Create IV k for f(i,j) in j-loop:f(i,j) = kj = {pi, +, ri}j with pi = i2 + 1 and ri = i
2. Create IVs for pi and ri in i-loop:pi = {p0, +, qi}i with p0 = 1qi = {q0, +, 2}i with q0 = 1ri = {r0, +, 1}i with r0 = 0
3. Implement k, p, q, and r ini-j-loop nest
p = 1q = 1r = 0for i = 0 to n k = p for j = 0 to m f[i,j] = k k = k+r endfor p = p+q q = q+2 r = r+1endfor
NCSU 2/24/06 21
CR Construction with the CR Algebra To construct the CR form of a symbolic function f(i):
1. Replace i with CR {0,+,1}2. Apply CR algebra rewrite rules (selected rules shown):
Example:f(i) = c·(i+a) = c·({0, +, 1}+a) = c{a, +, 1} = {c·a, +, c}
{x, +, y} + c {x+c, +, y}
c{x, +, y} {c·x, +, c·y}
{x, +, y} + {u, +, v} {x+u, +, y+v}
{x, +, y} * {u, +, v} {x·u, +, y{u, +, v}+v{x, +, y}+y·v}
NCSU 2/24/06 22
Loop Analysis with CR Forms
The basic idea: Scan the loop to detect IV updates Construct the CR form for each IV using the CR algebra
do J = J+I I = I+3 P = 2*P while (…)
J = {J0, +, I} J = {J0, +, {I0, +, 3}} I = {I0, +, 3} P = {P0, *, 2}
[vanEngelen01]
NCSU 2/24/06 23
Algorithm 1: Find Recurrences
Input: Loop L with live variable informationOutput: Set S of recurrence relations of IVs
1. Start with set S = { v, v | v is live at loop header }2. Search L from bottom to top:
for each assignment v = x of expression x to scalar variable v update tuples u, y in S by replacing v in y with x
Loop L Step Changes to S = {H, H, I, I, J, J, K, K}
do M = 2 L = J-H J = L+M K = K+M*I I = I+1 while (…)
54321
S5 = {H, H, I, I+1, J, J-H+2, K, K+2*I}S4 = {H, H, I, I+1, J, J-H+M, K, K+M*I}S3 = {H, H, I, I+1, J, L+M, K, K+M*I}S2 = {H, H, I, I+1, J, J, K, K+M*I}S1 = {H, H, I, I+1, J, J, K, K}
NCSU 2/24/06 24
Algorithm 2: Compute CR Forms
Input: Set S with recurrence relationsOutput: CR forms for IVs in S
1. For each relation v, x in S do:if x is of the form v then v = v0 (v is loop invariant) if x is of the form v + y then v = {v0, +, y}if x is of the form v * y then v = {v0, *, y}if x does not contain v then v = {v0, #, y} (v is wrap around)
2. Simplify the CR forms with the CR algebra rewrite rules
Recurrence relation in S CR form Simplified CR form
H, HI, I+1J, J-H+2K, K+2*I
H = H0
I = {I0, +, 1}
J = {J0, +, 2-H}
K = {K0, +, 2*I}
H = H0
I = {I0, +, 1}
J = {J0, +, 2-H0}
K = {K0, +, 2I0, +, 2}
NCSU 2/24/06 25
Algorithm 3: Solve
Input: CR forms for IVsOutput: Closed-form solutions for IVs (when possible)
1. For each CR form of v apply the CR inverse algebra, assuming loop is normalized for i = 0, …, n
2. Certain “exotic” mixed non-polynomial and non-exponential CR forms may not have closed forms
Loop L Simplified CR form Closed form
do M = 2 L = J-H J = L+M K = K+M*I I = I+1 while (…)
J = {J0, +, 2-H0} K = {K0, +, 2I0, +, 2} I = {I0, +, 1}
J(i) = J0 + (2-H0)*i K(i) = K0 + i2 + (2I0-1)*i I(i) = I0 + i
NCSU 2/24/06 26
Example 1
Loop L Step S = {x, x, z, z} CR form Closed form
x = 2 z = 0 do A(x) = A(z) x = x+z y = z+1 z = y+1 while (z<N)
321
S3 = {x, x+z, z, z+2}S2 = {x, x, z, z+2}S1 = {x, x, z, y+1}
x = {x0, +, z} z = {z0, +, 2}
x(i) = x0 + z0i + i2-i z(i) = z0+2i
do i=0,2*N-2 A(i*i-i+2) = A(2*i)end do
NCSU 2/24/06 27
Example 2
DO I=1,M DO J=1,I ij = ij+1 ijkl = ijkl+I-J+1 DO K=I+1,M DO L=1,K ijkl = ijkl+1 xijkl[ijkl]=xkl[L] ENDDO ENDDO ijkl = ijkl+ij+left ENDDOENDDO
TRFD code segmentfrom Perfect Benchmark
with IV updates
DO I=0,M-1 DO J=0,I DO K=0,M-I-2 DO L=0,I+K+1 tmp = ijkl+L+I*(K+(M+M*M+2*left+6)/4)+J*(left+(M+M*M)/2)+((I*I*M*M)+2*(K*K+3*K+I*I*(left+1))+M*I*I)/4+2 xijkl[tmp] = xkl[L+1] ENDDO ENDDO ENDDOENDDO
TRFD after aggressiveinduction variable substitution
IVS
NCSU 2/24/06 28
Example 3 (SSA)
a = 1; a0 = 1while (a<10) { if (a0>=10) goto L2 x = a+2; L1: a = a+1; a1 = (a0, a2) } x0 = a1 + 2 a2 = a1+1 if (a2<10) goto L1 L2:
1
a1
a0
+a2
1
x0
+
2
a1 = {1,+,1}
GCC 4.x uses our approachapplied to SSA form.
Note: GCC developers referto CRs as “scalar evolutions”
NCSU 2/24/06 29
Example 4 (SSA)
x = 0; x0 = 0 i = 1; i0 = 1while (i<10) { if (i0>=10) goto L2 x = x+i; L1: x1 = (x0, x2) i = i+1; i1 = (i0, i2) } x2 = x1+i1 i2 = i1+1 if (i2<10) goto L1 L2:
1
i1
i0
+i2
1
x1
x0
0
+x2
i1 = {1,+,1}x1 = {0,+,i1} = {0,+,1,+,1}
NCSU 2/24/06 30
Example 5 (SSA)
j = 0;i = 1;while (i<10) { if (p) j = j+2; else j = j+3; i = i+1;}
j0 = 0 i0 = 1 if (i0>=10) goto L2
L1: i1 = (i0, i2)
j1 = (j0, j4)
if (!p) goto L3
j2 = j1+2 goto L4
L3: j3 = j1+3
L4: j4 = (j2, j3)
i2 = i1+1 if (i2<10) goto L1
L2:
0
j1
j0
+
j4
2
j2
j3
+
3
{0,+,2} < j1 < {0,+,3}
NCSU 2/24/06 31
Recognizing Mixed Functional Forms and Reductions
Loop L Simplified CR form Factorial
I = 1 do F = F*I I = I+1 while (…)
F = {F0, *, 1, +, 1} I = {1, +, 1}
F = F0 * i!
Loop L Simplified CR form Reduction
I = 0; S = 0 do S = S+A[I] I = I+2 while (…)
S = {0, +, A[{0, +, 2}]} I = {0, +, 2}
S = ∑ A[2i]
NCSU 2/24/06 32
Pointer Access Descriptions of Pointer and Array References
A pointer access description (PAD) [vanEngelen01] is a CR form of a pointer or array reference in a loop nest
PADs are computed with the CR-based IV algorithms
Loop Code PAD Sequence
a[i] {a, +, 1} a[0],a[1],a[2],a[3]
a[2*i+1] {a+1, +, 2} a[1],a[3],a[5],a[7]
a[(i*i-i)/2] {a, +, 0, +, 1} a[0],a[0],a[1],a[3]
a[1<<i] {a+1, +, 1, *, 2} a[1],a[2],a[4],a[8]
p++ {a, +, 1} a[0],a[1],a[2],a[3]
p+=i {a, +, 0, +, 1} a[0],a[0],a[1],a[3]
short a[…], *p;int i;p = a;for(i=0;…;i++){
}
NCSU 2/24/06 33
CR-Enhanced Array Dependence Testing
Basic idea: construct dependence equations in CR form for both pointer and array accesses Determine the solution intervals by computing the value
ranges of the equations in CR form If the solution space is empty, there is no dependence
NCSU 2/24/06 34
Example
float a[…], *p, *q; p = a; q = a+2*n; for (i=0; i<n; i++) { t = *p; S: *p++ = *q; *q-- = t; }
Dependence equation:{a, +, 1}id = {a+2n, + ,-1}iu
Constraints:0 < id < n-10 < iu < n-1
Rewrite dependence equation:{a, +, 1}id = {a+2n, +, -1}iu
{a, +, 1}id - {a+2n, +, -1}iu = 0 {{-2n, +, 1}iu, +, 1}id = 0
Compute solution interval:Low[{{-2n, +, 1}iu, +, 1}id]= Low[{-2n, +, 1}iu]= -2nUp[{{-2n, +, 1}iu, +, 1}id]= Up[{-2n, +, 1}iu + n-1]= Up[-2n + 2n - 2]= -2
No dependence
S *
p={a, +, 1}q={a+2n, +, -1}
NCSU 2/24/06 35
Determining the Value Range of a CR Form
Suppose x(i) = {x0, +, s(i-1)} for i = 0, …, n If s(i-1) > 0 then x(i) is monotonically increasing If s(i-1) < 0 then x(i) is monotonically decreasing
If a function is monotonic on its domain, then it is trivial to find its exact value range
NCSU 2/24/06 36
Example: Nonlinear and Symbolic Dependence Testing
float a[…], *p, *q;p = q = a;for (i=0; i<n; i++){ for (j=0; j<=i; j++) *q += *++p; q++;}
CR dep. test disprovesflow dependence (<, <)
p = {{a+1, +, 1, +, 1}i, +, 1}j = a[(i2+i)/2+j+1]q = {a, +, 1}i = a[i]
DO i = 1, M+1 S1: A[I*N+10] = ... S2: ... = A[2*I+K] K = 2*K+N ENDDO
S1: A[{N+10, +, N}i]S2: A[{K0+2N, +, K0+ N+2, *, 2}i]
CR range test disprovesdependence when
K+N > 10 and K > 2
NCSU 2/24/06 37
Results
Implemented a CR-enhanced trapezoidal Banerjee test Relatively simple test Enhanced with support for nonlinear forms Enhanced with support for conditional flow Construct dependence equations in CR form
Implementation based on the Polaris compiler Pros: can compare to powerful dependence tests such as
Omega and Range test Cons: Fortran only
NCSU 2/24/06 38
Additional Independences Filtered over Omega Test
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
DYFESM
MDGOCEAN
QCD TRFD GEP NEP SEP
CR-EVT
Omega
LAPACKPerf. Benchmark
NCSU 2/24/06 39
Additional Independences Filtered over Range Test
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
DYFESM
MDGOCEAN
QCD TRFD GEP NEP SEP
CR-EVT
Range
NCSU 2/24/06 40
Additional Independences Filtered over Omega+Range
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
DYFESM
MDGOCEAN
QCD TRFD GEP NEP SEP
CR-EVT
Omega+Range
NCSU 2/24/06 41
Percentage of Conditional IVs w/o Closed Forms in LAPACK
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
GEP NEP SEP
Conditional IVs
Other IVs
NCSU 2/24/06 42
Timing Comparison: Perf Bench.
0
1
2
3
4
5
6
7
8
9
10
DYFESM MDG OCEAN QCD TRFD
Time (s)
Range
Omega
CR-EVT
CR-EVT (opt)
NCSU 2/24/06 43
Timing Comparison: LAPACK
0
10
20
30
40
50
60
70
GEP NEP SEP
Time (s)
Range
Omega
CR-EVT
CR-EVT (opt)
NCSU 2/24/06 44
Conclusions
A CR-based compiler framework has advantages: Applicable to CFG, AST, and SSA forms Handles conditional flow Handles nonlinear and symbolic induction variable expressions Allows array and pointer-based dependence testing to be
applied directly to the CR forms without induction variable substitution
Future work: Improve GCC implementation Enhance other dependence tests with CR forms
NCSU 2/24/06 45
Further Reading Robert van Engelen, Johnnie Birch, Yixin Shou, Burt Walsh, and Kyle Gallivan, “A
Unified Framework for Nonlinear Dependence Testing and Symbolic Analysis”, in the proceedings of the ACM International Conference on Supercomputing (ICS), 2004, pages 106-115.
Robert van Engelen, Johnnie Birch, and Kyle Gallivan, “Array Dependence Testing with the Chains of Recurrences Algebra”, in the proceedings of the IEEE International Workshop on Innovative Architectures for Future Generation High-Performance Processors and Systems (IWIA), January 2004, pages 70-81.
Robert van Engelen and Kyle Gallivan, “An Efficient Algorithm for Pointer-to-Array Access Conversion for Compiling and Optimizing DSP Applications”, in proceedings of the 2001 International Workshop on Innovative Architectures for Future Generation High-Performance Processors and Systems (IWIA), January 2001, pages 80-89.
Robert van Engelen, “Efficient Symbolic Analysis for Optimizing Compilers”, in proceedings of the International Conference on Compiler Construction, ETAPS 2001, LNCS 2027, pages 118-132.
NCSU 2/24/06 46
The End