Optimizing compiler. Vectorization

Software & Services Group, Developer Products DivisionCopyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimizing compiler.Vectorization.


The trend in the microprocessor development

Sequential instruction execution

Parallel instruction execution

Different kinds of parallel:

Pipeline

Superscalar

Vector operations

Multi-core and multiprocessor tasks

An optimizing compiler is a tool which translates source code into an executable module and optimize source code for better performance. Parallelization is a transformation of the sequential program into multi-threaded or vectorized (or both) to utilize multiple execution units simultaneously without breaking the correctness of the program.


Vectorization is an example of data parallelism (SIMD)

C[1] = A[1]+B[1]C[2] = A[2]+B[2]C[3] = A[3]+B[3]C[4] = A[4]+B[4]

A[1]A[2]

A[3]A[4]

A[1]A[2]A[3]A[4]

Vector operation

B[1]B[2]

B[3]B[4]

B[1]B[2]B[3]B[4]

+

=

C[1] C[2] C[3] C[4]

C[1] C[2] C[3] C[4]

Software & Services Group, Developer Products DivisionCopyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 10/17/10

for(i=0;i<100;i++) p[i]=b[i]

0 p[0]=b[0]

1 p[1]=b[1]

2 p[2]=b[2]

3 p[2]=b[2]

…

97 p[97]=b[97]

98 p[98]=b[98]

99 p[99]=b[99]

100 operations

0 p[0]=b[0]

1 p[1]=b[1]

0 p[98]=b[98]

1 p[99]=b[99]

0 p[2:9]=b[2:9]

…

12 p[90:97]=b[90:97]

16 operations

The approximate scheme of the loop vectorization

Non-aligned starting elements passed

Vector operationsVector register - 8 elements

Tail loop


A typical vector instruction is an operation on two vectors in memory or in fixed length registers. These vectors can be loaded from memory by a single or multiple operations.

Vectorization is a compiler optimization that inserts vector instructions instead of scalar. This optimization "wraps" the data into vector; scalar operations are replaced by operations with these vectors (packets).

Such optimization can be also performed manually by developer.A (1: n: k) - section of the array in Fortran is very convenient for vector register

representation.

for(i=0;i<U,i++) { S1: lhs1[i] = rhs1[i]; … Sn: lhsn[i] = rhsn[i];}

for(i=0;i<U,i+=vl) { S1: lhs1[i:i+vl-1:1] = rhs1[i:i+vl-1:1]; … Sn: lhsn[i:i+vl-1:1] = rhsn[i:i+vl-1:1];}

A(1:10:3)


MMX,SSE vector instruction setsMMX is a single instruction, multiple data (SIMD) instruction set designed by

Intel, introduced in 1996 for P5-based Pentium line of microprocessors, known as "Pentium with MMX Technology”.

MMX (Multimedia Extensions) is a set of instructions perform specific actions for streaming audio/video encoding and decoding.

MMX is: MMM0-MMM7 64 bit registers (were aliased with existing 80 bit FP stack

registers) the concept of packages (each register can store one 64 bit integer or 2 - 32

bit or 4 - 16 bit or 8 - 8-bit) 57 instructions are divided into groups: data movement, arithmetic,

comparison, conversion, logical, shift, shufle, unpack, cache control and prefetch state management.

MMX provides only integer operations. Simultaneous operations with floats and packed integers are not possible.


Streaming SIMD Extensions (SSE) is a SIMD instruction set extension of the x86 architecture, designed by Intel and introduced with Pentium III series processors in 1999. SIMD instructions can greatly increase the performance when exactly the same operations are performed on the multiple data objects. Typical applications are digital signal processing and computer graphics.

SSE is: 8 128-bit registers (xmm0 to xmm7) set of instructions for operations with scalar and packed data types 70 new instructions for single precision floating point data mainlySSE2, SSE3, SSEE3, SSE4 are further extensions of SSE.SSE2 has packed data type with double precision floating point. Advanced Vector Extensions (AVX) is an extension of the x86 instruction set architecture

for Intel and AMD microprocessors proposed by Intel in March 2008. Firstly supported by Intel Westmere processor in Q1 2011 and by AMD Bulldozer processor in Q3 2011.

AVX provides new features, new instructions and a new coding scheme.The size of vector registers is increased from 128 to 256 bits (YMM0-YMM15).The existing 128-bit instructions use the lower half of YMM registers.


Different data types can be packed in vector registers as follows

Packed data type Vectorlength

Bits per element

Data type range

signed bytes 16 8 -2**7 to 2**7-1unsigned bytes 16 8 0 до 2**8-1signed words 8 16 -2**15 to 2**15-1

unsigned words 8 16 0 до 2**16signed doublewords 4 32 -2**31 to 2**31-1

unsigned doublewords 4 32 0 до 2**32-1signed quadwords 2 64 -2**63 to 2*63-1

unsigned quadwords 2 64 0 до 2**64-1single-precision fps 4 32 2**-126 to 2**127

double-precision fps 2 64 2**-1022 to 2**1023

Selecting the appropriate data type for calculations can significantly affect application performance.

Software & Services Group, Developer Products DivisionCopyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 9

Optimization with SwitchesSIMD – SSE, SSE2, SSE3, SSE4.2 Support

16x bytes16x bytes

8x words8x words

4x dwords4x dwords

2x qwords2x qwords

1x dqword1x dqword

4x floats 4x floats

2x doubles2x doubles

MMX*MMX*

SSESSE

SSE4.2SSE3SSE2

SSE4.2SSE3SSE2

* MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers


Instruction groupsData movement instructions :

Instruction Suffix Description

movdqa move double quadword aligned

movdqu move double quadword unaligned

mova [ ps, pd ] move floating-point aligned

movu [ ps, pd ] move floating-point unaligned

movhl [ ps ] move packed floating-point high to low

movlh [ ps ] move packed floating-point low to high

movh [ ps, pd ] move high packed floating-point

movl [ ps, pd ] move low packed floating-point

mov [ d, q, ss, sd ] move scalar data

lddqu load double quadword unaligned

mov<d/sh/sl>dup move and duplicate

pextr [ w ] extract word

pinstr [ w ] insert word

pmovmsk [ b ] move mask

movmsk [ ps, pd ] move mask

An aligned data movement instruction cannot be applied to the memory location which is not aligned by 16 (bytes).

10/17/10


Intel arithmetic instructions :


padd [ b, w, d, q ] packed addition (signed and unsigned)

psub [ b, w, d, q ] packed subtraction (signed and unsigned)

padds [ b, w ] packed addition with saturation (signed)

paddus [ b, w ] packed addition with saturation (unsigned)

psubs [ b, w ] packed subtraction with saturation (signed)

psubus [ b, w ] packed subtraction with saturation (unsigned)

pmins [ w ] packed minimum (signed)

pminu [ b ] packed minimum (unsigned)

pmaxs [ w ] packed maximum (signed)

pmaxu [ b ] packed maximum (unsigned)

10/17/10


Floating-point arithmetic instructions : Instruction Suffix Description

add [ ss, ps, sd, pd ] addition

div [ ss, ps, sd, pd ] division

min [ ss, ps, sd, pd ] minimum

max [ ss, ps, sd, pd ] maximum

mul [ ss, ps, sd, pd ] multiplication

sqrt [ ss, ps, sd, pd ] square root

sub [ ss, ps, sd, pd ] subtraction

rcp [ ss, ps] approximated reciprocal

rsqrt [ ss, ps] approximated reciprocal square root

Idiomatic arithmetic instructions :


pang [ b, w ] packed average with rounding (unsigned)

pmulh/pmulhu/pmull [ w ] packed multiplication

psad [ bw ] packed sum of absolute differences (unsigned)

pmadd [ wd ] packed multiplication and addition (signed)

addsub [ ps, pd ] floating-point addition/subtraction

hadd [ ps, pd ] floating-point horizontal addition

hsub [ ps, pd ] floating-point horizontal subtraction

10/17/10


Logical instructions :


pand bitwise logical AND

pandn bitwise logical AND-NOT

por bitwise logical OR

pxor bitwise logical XOR

and [ ps, pd ] bitwise logical AND

andn [ ps, pd ] bitwise logical AND-NOT

or [ ps, pd ] bitwise logical OR

xor [ ps, pd ] bitwise logical XOR

Comparison instructions :


pcmp<cc> [ b, w, d ] packed compare

cmp<cc> [ ss, ps, sd, pd ] floating-point compare

<cc> defines comparison operation.

lt – less, gt – greater, eq - equal

10/17/10


Conversion instructions :


packss [wb, dw] pack with saturation (signed)

paсkus [wb] pack with saturation (unsigned)

cvt<s2d> conversion

cvtt<s2d> conversion with truncation

Shift instructions :


psll [ w, d, q, dq ] shift left logical (zero in)

psra [w, d] shift right arithmetic (sign in)

psrl [ w, d, q, dq ] shift right logical (zero in)

Shuffle instructions :


pshuf [ w, d ] packed shuffle

pshufh [w] packed shuffle high

pshufl [w] packed shuffle low

ырга [ ps, pd ] shuffle

10/17/10


Unpack instructions :


punpckh [bw, wd, dq, qdq] unpack high

punpckl [bw, wd, dq, qdq] unpack low

unpckh [ps, pd] unpack high

unpckl [ps, pd] unpack low

Cacheability control and prefetch instructions :


movnt [ ps, pd, q, dq ] move aligned non-temporal

prefetch<hint> prefetch with hint

State management instructions

These instructions are commonly used by operating system.

10/17/10

Software & Services Group, Developer Products DivisionCopyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 16

Three Sets of Switches to Enable Processor-specific Extensions

1. Switches –x<EXT> like –xSSE4_1– Imply an Intel processor check– Run-time error message at program start when launched on processor without

<EXT>2. Switches –m<EXT> like –mSSE3

– No processor check – Illegal instruction fault when launched on processor without <EXT>

3. Switches –ax<EXT> like –axSSE4_2– Automatic processor dispatch – multiple code paths– Processor check only available on Intel processors– Non-Intel processors take default path– Default path is –mSSE2 but can be modified again by another –x<EXT> switch


Simple estimation of vectorization profitabilityA typical vector instruction is an operation on two vectors in memory or in registers of

fixed length. These vectors can be loaded from memory in a single operation or in part.

Description of the foundations of SSE technology can be found in CHAPTER 10 PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) of the document «Intel 64 and IA-32 Intel Architecture Software Developer's Manual» (Volume 1)

Microsoft Visual Studio supports a set of SSE intrinsics that allows you to use SSE instructions directly from C/C++ code. You need to include xmmintrin.h, which defines the vector type __m128 and vector operations.

For example, we need to vectorize manually the following loop: for(i=0;i<N;i++) C[i]=A[i]*B[i];To do this: 1.) organize / fill the vector variables 2.) use multiply intrinsic for the vector variables 3.) write the results of calculations back to memory


#include <stdio.h>#include <xmmintrin.h>#define N 40int main() {float a[N][N][N],b[N][N][N],c[N][N][N];int i,j,k,rep; __m128 *xa,*xb,*xc;for(i=0;i<N;i++) for(j=0;j<N;j++) for(k=0;k<N;k++) { a[i][j][k]=1.0; b[i][j][k]=2.0; }for(rep=0;rep<10000;rep++) {#ifdef PERFfor(i=0;i<N;i++) for(j=0;j<N;j++) for(k=0;k<N;k+=4) { xa=(__m128*)&(a[i][j][k]); xb=(__m128*)&(b[i][j][k]);

xc=(__m128*)&(c[i][j][k]); *xc=_mm_mul_ps(*xa,*xb); }#elsefor(i=0;i<N;i++) for(j=0;j<N;j++) for(k=0;k<N;k++) c[i][j][k]=a[i][j][k]*b[i][j][k];#endif}printf("%f\n ",c[21][11][18]);}

An example illustrating the vectorization with SSE intrinsics

icl -Od test.c -Fetest1.exeicl -Od test.c -DPERF -Fetest_opt.exetime test1.exe2.000000 CPU time for command: 'test1.exe'real 3.406 secuser 3.391 secsystem 0.000 sectime test_opt.exe2.000000 CPU time for command: 'test_opt.exe'real 1.281 secuser 1.250 secsystem 0.000 sec

Intel compiler 12.0 was used for this experiment


The resulting speedup is 2.7x.We used aligned by 16 memory access instructions in this example.

Alignment was matched accidently, in the real case, you need to worry about it.

Test-optimized compiler shows the following result: icl test.c-Qvec_report3-Fetest_intel_opt.exe time test_intel_opt.exe 2.000000CPU time for command: 'test_intel_opt.exe' real 0.328 sec user 0.313 sec system 0.000 sec


Admissibility of vectorizationVectorization is a permutation optimization. Initial execution order is changed during

vectorization.

Permutation optimization is acceptable, if it preserves the order of dependencies. Thus we have a criterion for the admissibility of vectorization in terms of dependencies.

The simplest case when there are no dependencies inside the processed loop.

In more complicated case there are dependences inside the vectorized loop but its order is the same as inside the initial scalar loop.

Let’s recall the description of the dependency in a loop:

There is a loop dependency between the statements S1 and S2 in the set of nested loops, if and only if 1) there are two loop nest iteration vectors i and j such that i <j or i = j and there is a path from S1 to S2 inside the loop 2) statement S1 for iteration i and statement S2 for iteration j refer to the same memory area. 3) One of these statements writes to this memory.


Option for vectorization control /Qvec-report[n]

control amount of vectorizer diagnostic information

n=0 no diagnostic information

n=1 indicate vectorized loops (DEFAULT)

n=2 indicate vectorized/non-vectorized loops

n=3 indicate vectorized/non-vectorized loops and prohibiting

data dependence information

n=4 indicate non-vectorized loops

n=5 indicate non-vectorized loops and prohibiting data

dependence information

Usage: icl -c -Qvec_report3 loop.c

Diagnostic examples:

C:\loops\loop1.c(5) (col. 1): remark: LOOP WAS VECTORIZED.

C:\loops\loop3.c(5) (col. 1): remark: loop was not vectorized: vectorization possible but seems inefficient.

C:\loops\loop6.c(5) (col. 1): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

10/17/10


Simple criteria of vectorization admissibilityLet’s write vectorization of loop with usage of fortran array sections.A good criterion for vectorization is the fact that the introduction section of the array

does not create dependency.DO I=1,N A(I)=A(I)END DO

DO I=1,N/VL A(I:I+VL)=A(I:I+VL)END DO

DO I=1,N A(I+1)=A(I)END DO

DO I=1,N/VL A(I+1:I+1+VL)=A(I:I+VL)END DO

There is dependency because A(I+1:I+1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 are intersected.

DO I=1,N A(I-1)=A(I)END DO

DO I=1,N/VL A(I-1:I-1+VL)=A(I:I+VL)END DO

There is no dependency because A(I-1:I-1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 aren’t intersected.


Let’s check an assumption: Loop can be vectorized, if the dependence distance

greater or equal to number of array elements within the vector register.

Check this with compiler: ifort test.F90 -o a.out –vec_report3 echo ------------------------------------- ifort test.F90 -DPERF -o b.out –vec_report3 ./build.shtest.F90(11): (col. 1) remark: loop was not

vectorized: existence of vector dependence.-------------------------------------test.F90(11): (col. 1) remark: LOOP WAS

VECTORIZED.

PROGRAM TEST_VECINTEGER,PARAMETER :: N=1000#ifdef PERFINTEGER,PARAMETER :: P=4#elseINTEGER,PARAMETER :: P=3#endifINTEGER A(N) DO I=1,N-P A(I+P)=A(I)END DO

PRINT *,A(50)END


Dependency analysis and directives

There are two tasks which compiler should perform for dependency evaluation:1. Alias analysis (pointers which can address the same memory

should be detected)2. Definition-use chains analysisCompiler should prove that there are not aliased objects and precisely calculate the dependencies. It is hard task and sometimes compiler isn’t able to solve it.There are methods of providing additional information to the compiler:- Option –ansi_alias (the pointers can refer only to the objects of the same or compatible type).- restrict attributes for pointer arguments (C/C++).- #pragma ivdep says that there are not dependencies in the following loop. (C/C++)- !DEC$ IVDEP Fortran analogue of #pragma ivdep


Some performance issues for the vectorized code

INTEGER :: A(1000),B(1000)INTEGER I,KINTEGER, PARAMETER :: REP = 500000

A = 2DO K=1,REP CALL ADD(A,B)END DOPRINT *,SHIFT,B(101)

CONTAINS SUBROUTINE ADD(A,B) INTEGER A(1000),B(1000) INTEGER I!DEC$ UNROLL(0) DO I=1,1000-SHIFT B(I) = A(I+SHIFT)+1 END DO END SUBROUTINEEND

Let’s consider some simple test with a assignment which is appropriate for vectorization. Let us obtain vectorized code with usage of Intel Fortran compiler for different values of SHIFT macro. /fpp – option for preprocessorIntel compiler makes vectorization if level of optimization is 2 or 3. (-O2 or –O3)Option –Ob0 is used to forbid inlining.


Experiment results ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=0 -Fea.exe -Qvec_report >a.out 2>&1 ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=1 -Feb.exe -Qvec_report >b.out 2>&1

time.exe a.exe 0 3 CPU time for command: 'a.exe' real 0.125 sec user 0.094 sec system 0.000 sec

time.exe b.exe 1 3 CPU time for command: 'b.exe' real 0.297 sec user 0.281 sec system 0.000 sec


ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=0 /Fas -Ob0 -S –Fafast.s

.B2.5: ; Preds .B2.5 .B2.4$LN83:;;; B(I) = A(I+SHIFT)+1 movdqa xmm1, XMMWORD PTR [eax+ecx*4] ;17.11$LN84: paddd xmm1, xmm0 ;17.4$LN85: movdqa XMMWORD PTR [edx+ecx*4], xmm1 ;17.4$LN86: add ecx, 4 ;16.3$LN87: cmp ecx, 1000 ;16.3$LN88: jb .B2.5 ; Prob 99% ;16.3

fast.s


ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=1 /Fas -Ob0 -S –Faslow.s

.B2.5: ; Preds .B2.5 .B2.4$LN81:;;; B(I) = A(I+SHIFT)+1 movdqu xmm1, XMMWORD PTR [4+eax+ecx*4] ;17.11$LN82: paddd xmm1, xmm0 ;17.4$LN83: movdqa XMMWORD PTR [edx+ecx*4], xmm1 ;17.4$LN84: add ecx, 4 ;16.3$LN85: cmp ecx, 996 ;16.3$LN86: jb .B2.5 ; Prob 99% ;16.3

slow.s

CONCLUSION:

MOVDQA—Move Aligned Double QuadwordMOVDQU—Move Unaligned Double Quadword

In fast version aligned instructions are used and vector registers are filled faster.Unaligned instructions are slower. For latest architectures they shows the same performance as aligned instructions if applied to the aligned data.


Performance of vectorized loop depends on the memory location of the objects used. The important aspect of program performance is the memory alignment of the data.

Data Structure Alignment is computer memory data placement. This concept includes two distinct but related issues: alignment of the data (Data alignment) and data structure filling (Data structure padding).

Data alignment specifies how certain data is located relative to the boundaries of memory. This property is usually associated with a data type.

Filling data structures involves insertion of unnamed fields into the data structure in order to preserve the relative alignment of structure fields.


Data alignmentInformation about the alignment can be obtained with intrinsic __alignof__. The size and the default alignment of the variable of a type may depend on the compiler. (ia32 or intel64) printf("int: sizeof=%d align=%d\n",sizeof(a),__alignof__(a));Alignment for ia32 Intel C++ compiler:

bool sizeof = 1 alignof = 1wchar_t sizeof = 2 alignof = 2short int sizeof = 2 alignof = 2int sizeof = 4 alignof = 4long int sizeof = 4 alignof = 4long long int sizeof = 8 alignof = 8float sizeof = 4 alignof = 4double sizeof = 8 alignof = 8long double sizeof = 8 alignof = 8void* sizeof = 4 alignof = 4

The same rules are used for array alignment. There is the possibility to force the compiler to align object in a certain way: __declspec(align(16)) float x[N];


Data Structure Alignment

struct foo { bool a; short b; long long c; bool d; };

struct foo { bool a; char pad1[1]; short b; char pad2[4]; long long c; bool d; char pad3[7]; };

The order of fields in the structure affects the size of the object of a derived type. To reduce the size of the object structure fields should be sorted by descending of its size. You can use __declspec to align structure fields.

typedef struct aStuct{ __declspec(align(16)) float x[N]; __declspec(align(16)) float y[N]; __declspec(align(16)) float z[N];};


for(i=0;i<100;i++) p[i]=b[i]

0 p[0]=b[0]

1 p[1]=b[1]

2 p[2]=b[2]

3 p[2]=b[2]

…

97 p[97]=b[97]

98 p[98]=b[98]

99 p[99]=b[99]

0 p[0]=b[0]

1 p[1]=b[1]

0 p[98]=b[98]

1 p[99]=b[99]

0 p[2:9]=b[2:9]

…

12 p[90:97]=b[90:97]

The approximate scheme of the loop vectorization

Non-aligned starting elements passed

Vector operationsVector register - 8 elements

Tail loop

Loop vectorization usually produces three loops: loop for non-aligned staring elements, vectorized loop and tail. Vectorization of loop with small number of iterations can be unprofitable.


Additional vectorization example

10/17/10

void Calculate(float * a,float * b, float * c , int n) {int i; for(i=0;i<n;i++) { a[i] = a[i]+b[i]+c[i]; } return;}

#include <stdio.h>#define N 1000extern void Calculate(float *,float *, float *,int);int main() {float x[N],y[N],z[N];int i,rep;for(i=0;i<N;i++) { x[i] = 1;y[i] = 0; z[i] = 1;}for(rep=0;rep<10000000;rep++) { Calculate(&x[1],&y[0],&z[0],N-1);}printf("x[1]=%f\n",x[1]);}

Vector.c Main.c

First argument alignment differs

icl main.c vec.c -O1 –FeA

time a.exe 12.6 s.


Compiler makes auto vectorization for –O2 or –O3. Option -Qvec_report informs about vectorized loops. icl main.c vec.c –O2 –Qvec_report –Febvec.c(3): (col. 3) remark: LOOP WAS VECTORIZED.time b.exe 3.67 s.Vectorization is possible because the compiler inserts run-time check for vectorizing when some of the pointers may be not aliased. The application size is enlarged.

1)

void Calculate(float * resrtict a,float * restrict b, float * restrict c , int n) {2)

To restrict align attribute we need to add option –Qstd=c99 icl main.c vec.c –Qstd=c99 –O2 –Qvec_report –Fecvec.c(3): (col. 3) remark: LOOP WAS VECTORIZED.time c.exe 3.55 s.Small improvement because of avoiding run-time check

Useful fact: For modern calculation systems performance of aligned and unaligned instructions almost the same when applied to aligned objects.


int main() { __declspec(align(16)) float x[N]; __declspec(align(16)) float y[N]; __declspec(align(16)) float z[N];

void Calculate(float * resrtict a,float * restrict b, float * restrict c , int n) {Int n;__assume_aligned(a,16);__assume_aligned(b,16);__assume_aligned(c,16);

3)

icl main.c vec.c –Qstd=c99 –O2 –Qvec_report –Fedvec.c(3): (col. 3) remark: LOOP WAS VECTORIZED.time d.exe 3.20 s.This update demonstrates improvement because of the better alignment of vectorized objects. Arrays in main are aligned to 16. With this update all argument pointers are well aligned and the compiler is informed by __assume_aligned directive. It allows to remove the first scalar loop.

Calculate(&x[0],&y[0],&z[0],N-1);


Data alignment Good array data alignment for SSE: 16B for AVX: 32B

Data alignment directives:

C/C++ Windows : __declspec(align(16)) float X[N]; Linux/MacOS : float X[N] __attribute__ ((aligned (16));Fortran !DIR$ ATTRIBUTES ALIGN: 16:: A

Aligned malloc_aligned_malloc()_mm_malloc()

Data alignment assertion (16B example)

C/C++: __assume_aligned(p,16);

Fortran: !DIR$ ASSUME_ALIGNED A(1):16

Aligned loop assertion

C/C++: #pragma vector aligned

Fortran: !DIR$ VECTOR ALIGNED

10/17/10


Non-unit stride and unit stride access

10/17/10

Well aligned data is better for vectorization because in this case vector register is filled by the single operation. In case with non-unit stride access to array, register filling is more complicated task and vectorization is less profitable. Auto vectorizer cooperates with loop optimizations for improving access to objects. There are compiler directives which recommend to make vectorization in case when compiler doesn’t make it because it looks unprofitable.

C/C++#pragma vector{aligned|unaligned|always} #pragma novector Fortran!DEC$ VECTOR ALWAYS!DEC$ NOVECTOR


Vectorization of outer loopUsually auto vectorizer processes the nested loop. Vectorization of the outer

loop can be done using “simd” directive.

#define N 200#include <stdio.h>

int main() {int A[N][N],B[N][N],C[N][N];int i,j,rep;

for(i=0;i<N;i++) for(j=0;j<N;j++) { A[i][j]=i+j; B[i][j]=2*j-i; C[i][j]=0; }

for(rep=0;rep<10000000;rep++) { #pragma simd for(i=0;i<N;i++) { j=0; while(A[i][j]<=B[i][j] && j<N) { C[i][j]=C[i][j]+B[j][i]-A[j][i]; j++; } }}

printf("%d\n",C[0][2]);}

icl vec.c -O3 -Qvec- -Fea (Qvec- disable vectorization) 20.7 sicl vec.c -O3 -Qvec_report -Feb 17.svec.c(17): (col. 3) remark: SIMD LOOP WAS VECTORIZED.


Thank you.

Documents

Optimizing compiler. Vectorization