Vector-Parallel Modernization Tim Prince PhD (ME) Intel® Black Belt Software Developer Rev. Dec. 17,…

Vector-Parallel ModernizationTim Prince PhD (ME)Intel® Black Belt Software DeveloperRev. Dec. 17, 2015

Introduction

This presentation shows how to optimize some difficult loops with Intel® compilers for Fortran*, C, and C++. Examples are selected from the classic netlib.org vector benchmark, cases which are not optimized automatically by current Intel compilers, but exhibit good vector or parallel performance after modernization. A variety of application excerpts are exhibited, showing relationship among the source languages.

*Other names and brands may be claimed as the property of others.

https://software.intel.com/en-us/intel-compilers

Resolve suspected anti-dependence by OpenMP* 4.01. equivalence(array(64),x(1))2. ! this example has true anti-dependence: stores 64 elements beyond load3. ! illustrates a pitfall of equivalence and equivalent C pointer overlaps4. #if _OPENMP >= 2013075. !$omp simd safelen(32)6. ! this allows 2 cache lines of read ahead for 32-bit data type7. #endif8. do i= 1,n-19. x(i+1)= array(i)+a(i)10. enddo


Anti-dependence example in C and C++1. #define x ((real *)&cdata_1 + 63)2. i2 = *n - 1;3. #if _OPENMP >= 2013074. #pragma omp simd safelen(32)5. #endif6. for (int i = 1; i <= i2; ++i)7. cdata_1.array[i] = x[i - 1] + a[i];8. // C++ :9. // no pragma needed by g++ (with __restrict pointers)10. #pragma ivdep11. transform(&x[0],&x[i2],&a[1],&cdata_1.array[1],plus<float>());

False suspected anti-dependence

1. do i= 1,n-12. if(a(i) < 0.)then3. ! order as in netlib public source4. ! if(b(i) < 0.) a(i)= a(i)+c(i)*d(i)5. ! b(i+1)= c(i)+d(i)*e(i)6. ! switch order to remove vector dependence7. b(i+1)= c(i)+d(i)*e(i)8. if(b(i) < 0.) a(i)= a(i)+c(i)*d(i)9. endif10. enddo

C false anti-dependence avoided1. // no pragma needed if pointers are qualified by __restrict2. for (int i = 1; i <= i2; ++i)3. if (a[i] < 0.f) {4. b[i + 1] = c[i] + d[i] * e[i];5. if (b[i] < 0.f)6. a[i] += c[i] * d[i];7. }

Optimize circular loop carried dependency

1. ! a(:n)= (b(:n)+cshift(b(:n),1)+cshift(b(:n),2))*.3332. x= b(n)3. y= b(n-1)4. !$omp simd5. do i= 1,n6. a(i)= (b(i)+x+y)*.3337. y= x8. x= b(i)9. enddo

C pragma circular dependency optimization

1. x = b[*n];2. y = b[*n - 1];3. i2 = *n;4. // OpenMP 4.5 may require change in this usage5. #pragma omp simd6. for (int i = 1; i <= i2; ++i) {7. a[i] = (b[i] + x + y) * .333f;8. y = x;9. x = b[i];10. }

Partial read after write dependency1. ! do i= 1,n ! Hidden partial dependency2. ! x= a(n-i+1)+b(i)*c(i)3. ! a(i)= x-1.04. ! b(i)= x5. ! enddo6. ! resolve by separating the dependencies7. b(:(n+1)/2)= a(n:n/2+1:-1)+b(:(n+1)/2)*c(:(n+1)/2)8. ! ifort fuses here at -O39. a(:(n+1)/2)= b(:(n+1)/2)-1.010. ! fusion is not valid here due to read after write11. b((n+3)/2:n)= a(n/2:1:-1)+b((n+3)/2:n)*c((n+3)/2:n)12. a((n+3)/2:n)= b((n+3)/2:n)-1.0

Resolve false assumed WAR by C OpenMP* 4.0(explicit fusion creates false WAR dependence)

1. #pragma omp simd2. for (int i= 1; i <= (i2+1)/2; ++i) //VS2015 CL prefers separate

assignments3. a[i] = (b[i] = a[i2 - i + 1] + b[i] * c[i])- 1.f;4. #pragma omp simd5. for (int i= (i2+3)/2; i <= i2; ++i)6. a[i] = (b[i] = a[i2 - i + 1] + b[i] * c[i])- 1.f;


Vectorize by splitting search and compute

1. ! i= 12. ! do while (a(i) >= 0.) ! Not vectorized3. ! a(i)= a(i)+b(i)*c(i)4. ! i= i+15. ! enddo6. ! no more old-fashioned explicit masking7. do i= 1,n8. if(a(i) < 0) exit9. enddo10. a(:i-1)= a(:i-1)+b(:i-1)*c(:i-1)

C vectorized linear search and compute

1. i2 = *n;2. // first i has scope outside for3. for (i = 1; i <= i2; ++i)4. if (a[i] < 0.f) break;5. i2 = i - 1;6. // this one needs * __restrict a or pragma7. for (int i = 1; i <= i2; ++i)8. a[i] += b[i] * c[i];

Overcome “protects against exception”by taking arithmetic outside and directive1. ! do i= 1,n2. ! if(d(i) < 0)then3. ! a(i)= a(i)+b(i)*c(i)4. ! else5. ! if(d(i).ne.0)then6. ! a(i)= a(i)+c(i)*c(i)7. ! else8. ! a(i)= a(i)+b(i)*b(i)9. ! endif10. ! endif11. ! enddo12. !dir$ vector aligned13. a(:n)= a(:n)+merge(b(:n),c(:n),d(:n)<=0)*merge(c(:n),b(:n),d(:n)/=0)

C avoidance of “protects against exception”

1. #pragma vector aligned2. // using __restrict (or another pragma)3. for (int i = 1; i <= i2; ++i)4. a[i] +=(d[i] <= 0.f?b[i]:c[i]) * (d[i]==0.f?b[i]:c[i]);

linear search not optimizedIntel® Fortran Compiler (ifort) doesn’t resolve 2 level reduction

1. max= aa(1,1)2. xindex= 13. yindex= 14. do j= 1,n5. do i= 1,n6. if(aa(i,j) > max)then7. max= aa(i,j)8. xindex= i9. yindex= j10. endif11. enddo12. enddo

https://software.intel.com/en-us/fortran-compilers

Parallel-vector linear search1. max_= aa(1,1)2. xindex=13. yindex=14. !$omp parallel do private(ml) if(n>103) reduction(max: max_) &5. !$omp& lastprivate(xindex,yindex)6. do j=1,n7. ml= maxloc(aa(:n,j),dim=1) ! Ifort still needs old_maxminloc8. if(aa(ml,j)>max_ .or. aa(ml,j)==max_ .and. j<yindex)then9. xindex= ml10. yindex= j11. max_=aa(ml,j)12. endif13. enddo

C parallel-vector linear search1. max__ = aa[aa_dim1 + 1];2. xindex = yindex = 1;3. i2 = i3 = *n;4. #pragma omp parallel for if(i2 > 103) reduction(max: max__) lastprivate(xindex,yindex)5. for (int j = 1; j <= i2; ++j) {6. int indxj=0;7. float maxj=max__;8. #pragma omp simd reduction(max: maxj) lastprivate(indxj)9. for (int i = 1; i <= i3; ++i) if (aa[i + j * aa_dim1] > maxj){10. maxj = aa[i + j * aa_dim1];11. indxj = i; }12. if(maxj > max__) { // not dealing with potential ties13. max__= maxj;14. xindex=indxj;15. yindex=j;}}

Parallel vector convolution

1. #if __INTEL_COMPILER && _OPENMP2. !$omp parallel do if(n>103)3. do i= 1,m 4. a(i)= a(i)+dot_product(b(i:i+m-1),c(m:1:-1))5. #else6. ! single thread version or array reduction (slightly less accurate)7. do j= 1,m8. a(1:m)= a(1:m)+b(1+m-j:m+m-j)*c(j)9. #endif10. enddo

C parallel vector convolution

1. #pragma omp parallel for if(i3 > 103)2. for (int i = 1; i <= i3; ++i) {3. float sum = 0;4. #pragma omp simd reduction(+: sum)5. for (int j = 1; j <= i2; ++j) 6. sum += b[i + j - 1] * c[i2 - j + 1];7. a[i] += sum;8. }

C++ parallel vector convolution

1. // and here's a C++ version, which shouldn't need Intel® AVX2 to optimize

2. // reverse the vector which is used repeatedly3. // It won't optimize with /Qprotect-parens (investigation requested)4. vector<float> Cr(m);5. reverse_copy(&c[1],&c[i3]+1,Cr.begin());6. #pragma omp parallel for if(i3 > 103)7. for (int i = 1; i <= i3; ++i) 8. a[i] += inner_product(Cr.begin(),Cr.end(),&b[i],0.f);

False indexing dependency, no optimization

1. k = 12. do 10 i = 1,n3. do 20 j = 2,n4. bb(i,j) = bb(i,j-1) + array(k) * cc(i,j)5. k = k + 16. 20 continue7. k = k + 18. 10 continue

Optimize by making inner loops independent1. !$omp parallel do private(k) if(n>103)2. do i= 1,n3. k= i*n+1-n4. do j= 2,n5. bb(i,j)= bb(i,j-1)+array(k)*cc(i,j)6. k= k+17. enddo8. enddo9. ! version for single core10. ! do j= 2,n11. ! bb(:n,j)= bb(:n,j-1)+array(j-1:n*n+j-1:n)*cc(:n,j)12. ! enddo

Loop nesting not corrected due to indexing1. do 10 i = 1,n2. k = i*(i-1)/2+i3. do 20 j = i,n4. array(k) = array(k) + bb(i,j)5. k = k + j6. 20 continue7. 10 continue8. ! swap loops for inner loop data locality9. do j= 1,n10. k= j*(j-1)/211. array(k+1:k+j)= array(k+1:k+j)+bb(:j,j)12. enddo13. ! That's good enough for single CPU auto-parallel auto-vectorizer

another auto-renesting failure1. ! do 30 i = 2,n2. ! do 20 j = 2,n3. ! aa(i,j) = aa(i,j-1) + cc(i,j)4. ! 20 continue5. ! do 30 j = 2,n6. ! bb(i,j) = bb(i-1,j) + cc(i,j)7. ! 30 continue8. do j= 2,n9. do i= 2,n10. aa(i,j)= aa(i,j-1)+cc(i,j)11. bb(i,j)= bb(i-1,j)+cc(i,j)12. enddo13. enddo

Explicit parallel C code1. #pragma omp parallel if(i2 > 53) {2. #pragma omp for nowait3. // setting up to proceed to the next loop when some cores finish here4. #pragma novector5. for (int j = 2; j <= i3; ++j)6. for (int i = 2; i <= i2; ++i)7. bb[i + j * bb_dim1] = bb[i - 1 + j*bb_dim1] + cc[i + j * cc_dim1];8. #pragma omp for simd9. // many-core outer loop vectorized parallel version10. for (int i = 2; i <= i2; ++i)11. for (int j = 2; j <= i3; ++j)12. aa[i + j * aa_dim1] = aa[i + (j - 1) * aa_dim1] + cc[i + j * cc_dim1];13. }

Fallacy: compilers always optimize final assignment out of loop1. ! do 10 i = 1,n-12. ! a(i) = b(i) + c(i) * d(i)3. ! b(i) = c(i) + b(i)4. ! a(i+1) = b(i) + a(i+1) * d(i)5. ! 10 continue6. do i= 1,n-17. a(i)= b(i)+c(i)*d(i)8. b(i)= c(i)+b(i)9. enddo10. a(n)= b(n-1)+a(n)*d(n-1)

Repeated update of 1D array in 2D loop1. do 10 i = 1,n2. do 20 j = 2,n3. a(j) = aa(i,j) - a(j-1)4. aa(i,j) = a(j) + bb(i,j)5. 20 continue6. 10 continue7. ! store only the final values of a(:):8. do j= 2,n9. aa(:n,j)= aa(:n,j)+bb(:n,j)-a(j-1)10. a(j)=aa(n,j)-bb(n,j)11. enddo

Non-vectorizable 1D array hoisting

1. ! loop swap should have been obvious (?)2. do 10 i = 2,n3. do 20 j = 1,n4. a(i) = aa(i,j) - a(i-1)5. aa(i,j) = a(i) + bb(i,j)6. 20 continue7. 10 continue

it’s parallelizable, relatively tedious1. !$omp parallel if(n>103)2. !$omp do private(tmp)3. do j= 1,n-14. tmp= a(1)5. do i= 2,n6. tmp= aa(i,j)-tmp7. aa(i,j)= tmp+bb(i,j)8. enddo9. enddo10. !$omp end do nowait11. !$omp single12. do i= 2,n13. a(i)= aa(i,n)-a(i-1)14. aa(i,n)= a(i)+bb(i,n)15. enddo16. !$omp end single17. !$omp end parallel

Showing the C with the final loop first1. #pragma omp parallel if(i2 > 103){2. #pragma omp single3. for (int i = 2; i <= i2; ++i) {4. a[i] = aa[i + i3 * aa_dim1] - a[i - 1];5. aa[i + i3 * aa_dim1] = a[i] + bb[i + i3 * bb_dim1];6. }7. #pragma omp for nowait8. for (int j = 1; j < i3; ++j){9. float tmp= a[1];10. for (int i = 2; i <= i2; ++i) {11. tmp = aa[i + j * aa_dim1] - tmp;12. aa[i + j * aa_dim1] = tmp + bb[i + j * bb_dim1];13. }}}

Save embarrassing one for last

1. do 10 i = 1,n2. a(i) = a(1)3. 10 continue! don't over-write the rhs in a potentially recursive manner ! (even by same value) unless by array assignment;

a(:n)= a(1)

In case you wondered, it’s the same between C and CEANa[1:i2] = a[1]; // is OK even though over-writing a[1]

Conclusions

Auto-vectorization often needs helpParallelization often needs more explicit “modernization”Use tools to identify where to “modernize:”

Intel® VTune™ Amplifier XE Intel® Advisor XE

http://software.intel.com/modern-code

https://software.intel.com/en-us/modern-code/tools

Documents

Vector-Parallel Modernization Tim Prince PhD (ME) Intel® Black Belt Software Developer Rev. Dec. 17,…