Developing Code for Cell - SIMD

IBM Systems & Technology Group

Cell Programming Workshop 04/19/23 © 2007 IBM Corporation1

Developing Code for Cell - SIMD

Cell Programming Workshop

IBM Systems & Technology Group –

© 2007 IBM CorporationCell Programming Workshop 04/19/232

Class Agenda

Vector Programming (SIMD)

– Data types for vector programming

– Application partitioning

SPU Intrinsics

Trademarks - Cell Broadband Engine ™ is a trademark of Sony Computer Entertainment, Inc.



PPE vs. SPE Comparison



PPE vs SPE

Both PPE and SPE execute SIMD instructions

– PPE processes SIMD operations in the VXU within its PPU

– SPEs process SIMD operations in their SPU

Both processors execute different instruction sets

Programs written for the PPE and SPEs must be compiled by different compilers



PPE and SPE Language-Extension Differences Both operate on 128-bit SIMD vectors

Only the Vector/SIMD Multimedia Extension instruction set supports pixel vectors

Only the SPU instruction set supports doubleword vectors



Register Layout of Data Types and Preferred Slot

When instructions use or produce scalar operands or addresses, the values are in the preferred scalar slot

The left-most word (bytes 0, 1, 2, and 3) of a register is called the preferred slot



SIMD Architecture

SIMD = “single-instruction multiple-data”

SIMD exploits data-level parallelism

– a single instruction can apply the same operation to multiple data elements in parallel

SIMD units employ “vector registers”

– each register holds multiple data elements

SIMD is pervasive in the BE

– PPE includes VMX (SIMD extensions to PPC architecture)

– SPE is a native SIMD architecture (VMX-like)

SIMD in VMX and SPE

– 128bit-wide datapath

– 128bit-wide registers

– 4-wide fullwords, 8-wide halfwords, 16-wide bytes

– SPE includes support for 2-wide doublewords



SIMD (Single Instruction Multiple Data) processing

Scalar code

read the next instruction and decode it

get this number

get that number

add them

put the result here


get this number

get that number

add them

put the result here


get this number

get that number

add them

put the result here.


get this number

get that number

add them

put the result there

Vector code

read instruction and decode it

get these 4 numbers

get those 4 numbers

add them

put the results here

• Data-level parallelism / operate on multiple data elements simultaneously

• (Significantly) smaller amount of code => improved execution efficiency• Number of elements processed in parallel = (size of vector / size of element)

Result 1

+

=

Instr. Decode

Number 1

Number 2Repeat4 times

Do Once

Result 3

+

=

Instr. Decode

Number 1c

Number 2c

Result 4

+

=

Number 1d

Number 2d

Result 1

+

=

Number 1a

Number 2a

Result 2

+

=

Number 1b

Number 2bMultiple Data

Single Instruction



In both the PPE and SPEs, vector registers hold multiple data elements as a single vector. The data paths and registers supporting SIMD operations are 128 bits wide, corresponding to four full 32-bit words.

This means that four 32-bit words can be loaded into a single register, and, for example, added to four other words in a different register in a single operation.

Similar operations can be performed on vector operands containing 16 bytes, 8 halfwords, or 2 doublewords.

I - Example of a SIMD operation



SIMD “Cross-Element” Instructions

VMX and SPE architectures include “cross-element” instructions

– shifts and rotates

– permutes / shuffles

Permute / Shuffle

– selects bytes from two source registers and places selected bytes in a target register

– byte selection and placement controlled by a “control vector” in a third source register

extremely useful for reorganizing data in the vector register file



PowerPC Processor Element

The bytes selected for the shuffle from the source registers, VA and VB, are based on byte entries in the control vector, VC, in which a 0 specifies VA and a 1 specifies VB. The result of the shuffle is placed in register VT. The shuffle operation is extremely powerful and finds its way into many applications in which data reordering, selection, or merging is required .

byte-shuffle (permute) operation

II - Example of a SIMD operation



PowerPC Processor Element

A vector is an instruction operand containing a set of data elements packed into a one-dimensional array

The elements can be fixed-point or floating-point values

SIMD processing exploits data-level parallelism:

Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel (SIMD)

Support for SIMD

operations:

SIMD Vectorization

By the vector/SIMD multimedia

extension instructions

By the SPU instruction setIn the SPEs

In the PPE



Basic SIMD Operations – A PPU program

/*

* declares vector variables which points to scalar arrays

*/

__vector signed int *va = (__vector signed int *) a;

__vector signed int *vb = (__vector signed int *) b;

__vector signed int *vc = (__vector signed int *) c;

/*

* adds four signed integers at once

*/

*vc = vec_add(*va, *vb); // 1 + 2, 3 + 4, 5 + 6, 7 + 8

/*

* output results

*/

printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]);

return 0;

}

#include <stdio.h>

#include <altivec.h> --- not required in IBM SDK for XLC

/*

* declares input/output scalar variables

*/

int a[4] __attribute__((aligned(16))) = { 1, 3, 5, 7 };

int b[4] __attribute__((aligned(16))) = { 2, 4, 6, 8 };

int c[4] __attribute__((aligned(16)));

int main(int argc, char **argv)

{

Use altivec header file

Use altivec vector intrinsic instruction

Source: PS3 Programming



Basic SIMD Operations – An SPU program

/*

* declares vector variables which points to scalar arrays

*/

__vector signed int *va = (__vector signed int *) a;

__vector signed int *vb = (__vector signed int *) b;

__vector signed int *vc = (__vector signed int *) c;

/*

* adds four signed intergers at once

*/

*vc = spu_add(*va, *vb); // 1 + 2, 3 + 4, 5 + 6, 7 + 8

// use of spu_add instead of vec_add

/*

* output results

*/

printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]);

return 0;

}

#include <stdio.h>

#include <spu_intrinsics.h> //the SPU intrinsics functions header file

/*

* declares input/output scalar variables

*/

int a[4] __attribute__((aligned(16))) = { 1, 3, 5, 7 };

int b[4] __attribute__((aligned(16))) = { 2, 4, 6, 8 };

int c[4] __attribute__((aligned(16)));

int main(int argc, char **argv)

{

Use spu_intrinsics header file

Use spu intrinsics function

Source: PS3 Programming



Generic SPU Intrinsics

Generic

– Constant formation (spu_splats)

– Conversion (spu_convtf, spu_convts, ..)

– Arithmetic (spu_add, spu_madd, spu_nmadd, ...)

– Byte operations (spu_absd, spu_avg,...)

– Compare and branch (spu_cmpeq, spu_cmpgt,...)

– Bits and masks (spu_shuffle, spu_sel,...)

– Logical (spu_and, spu_or, ...)

– Shift and rotate (spu_rlqwbyte, spu_rlqw,...)

– Control (spu_stop, spu_ienable, spu_idisable, ...)

– Channel Control (spu_readch, spu_writech,...)

– Scalar (spu_insert, spu_extract, spu_promote)



Vectorizing a Loop – A Simple Example

Loop does term-by-term multiply of two arrays

– arrays assumed here to remain scalar outside of subroutine If array size is not a multiple of 4, extra work is necessary

/* scalar version */

int mult1(float *in1, float *in2, float *out, int N){ int i;

for (i = 0, i < N, i++) {

out[i] = in1[i] * in2[i]; }

return 0;

/* vectorized version */

int vmult1(float *in1, float *in2, float *out, int N){ /* assumes that arrays are quadword aligned */ /* assumes that N is divisible by 4 */

int i, Nv; Nv = N >> 2; /* N/4 vectors */ vector float *vin1 = (vector float *) (in1); vector float *vin2 = (vector float *) (in2); vector float *vout = (vector float *) (out);

for (i = 0, i < Nv, i++) {

vout[i] = spu_mul(vin1[i],vin2[i]); }

return 0;



Another SIMD Example – data incrementing

void load_data(int *dest)

{

int i;

for (i=0; i<4096; ++i)

{

++dest[i];

}

}

void load_data_SIMD(int *dest)

{

int i;

vector unsigned int *vdest;

vector unsigned int v1 = (vector unsigned int) {1, 1, 1, 1};

vdest = (vector unsigned int *) dest;

for (i=0; i<1024; ++i)

{

vdest[i] = spu_add(vdest[i], v1);

}

}



Complex Multiplication SPE - Shuffle Vectors

a1 b1 a2 b2A1 a3 b3 a4 b4A2

c1 d1 c2 d2B1 c3 d3 c4 d4B2

0-3 8-11 16-19 24-27I P

4-7 12-15 20-23 28-31Q PInput Shuffle patterns

I1 = spu_shuffle(A1, A2, I_Perm_Vector); a1 b1 a2 b2A1 a3 b3 a4 b4A2

0-3 8-11 16-19 24-27I P

a1 a2 a3 a4I1

I2 = spu_shuffle(B1, B2, I_Perm_Vector); c1 c2 c3 c4I2

Q1 = spu_shuffle(A1, A2, Q_Perm_Vector); a1 b1 a2 b2A1 a3 b3 a4 b4A2

b1 b2 b3 b4Q1

d1 d2 d3 d4Q2

4-7 12-15 20-23 28-31Q P

Q2 = spu_shuffle(B1, B2, Q_Perm_Vector);

Z1 Z2

By analogy

By analogy

Z3 Z4

W1 W2 W3 W4



Complex Multiplication

A1 = spu_nmsub(Q1, Q2, v_zero);

A2 = spu_madd(Q1, I2, v_zero);

Q1 = spu_madd(I1, Q2, A2);

I1 = spu_madd(I1, I2, A1);

b1 b2 b3 b4Q1

d1 d2 d3 d4Q2

0 0 0 0Z

*

-

-(b1*d1)A1 -(b2*d2) -(b3*d3) -(b4*d4)

b1 b2 b3 b4Q1

*

+

b1*c1A2 b2*c2 b3*c3 b4*c4

c1 c2 c3 c4I2

0 0 0 0Z

*

+a1*d1+b1*c1Q1

a1 a2 a3 a4I1

d1 d2 d3 d4Q2

b1*c1A2 b2*c2 b3*c3 b4*c4

a2*d2+b2*c2

a3*d3+b3*c3

a4*d4+b4*c4

*

+a1*c1-b1*d1I1

a1 a2 a3 a4I1

a2*c2-b2*d2

a3*c3-b3*d3

a4*c4-b4*d4

c1 c2 c3 c4I2

-(b1*d1)A1 -(b2*d2) -(b3*d3) -(b4*d4)



Complex multiplication – SPE - Summary vector float A1, A2, B1, B2, I1, I2, Q1, Q2, D1, D2;

/* in-phase (real), quadrature (imag), temp, and output vectors*/

vector float v_zero = (vector float)(0,0,0,0);

vector unsigned char I_Perm_Vector = (vector unsigned char)(0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27);

vector unsigned char Q_Perm_Vector = (vector unsigned char)(4,5,6,7,12,13,14,15,20,21,22,23,28,29,30,31);

vector unsigned char vcvmrgh = (vector unsigned char) (0,1,2,3,16,17,18,19,4,5,6,7,20,21,22,23);

vector unsigned char vcvmrgl = (vector unsigned char) (8,9,10,11,24,25,26,27,12,13,14,15,28,29,30,31);

/* input vectors are in interleaved form in A1,A2 and B1,B2 with each input vector

representing 2 complex numbers and thus this loop would repeat for N/4 iterations

*/

I1 = spu_shuffle(A1, A2, I_Perm_Vector); /* pulls out 1st and 3rd 4-byte element from vectors A1 and A2 */

I2 = spu_shuffle(B1, B2, I_Perm_Vector); /* pulls out 1st and 3rd 4-byte element from vectors B1 and B2 */

Q1 = spu_shuffle(A1, A2, Q_Perm_Vector); /* pulls out 2nd and 4th 4-byte element from vectors A1 and A2 */

Q2 = spu_shuffle(B1, B2, Q_Perm_Vector); /* pulls out 3rd and 4th 4-byte element from vectors B1 and B2 */

A1 = spu_nmsub(Q1, Q2, v_zero); /* calculates –(bd – 0) for all four elements */

A2 = spu_madd(Q1, I2, v_zero); /* calculates (bc + 0) for all four elements */

Q1 = spu_madd(I1, Q2, A2); /* calculates ad + bc for all four elements */

I1 = spu_madd(I1, I2, A1); /* calculates ac – bd for all four elements */

D1 = spu_shuffle(I1, Q1, vcvmrgh); /* spreads the results back into interleaved format */

D2 = spu_shuffle(I1, Q1, vcvmrgl); /* spreads the results back into interleaved format */

IBM Systems & Technology Group

Cell Programming Workshop 04/19/23 © 2007 IBM Corporation21

Hands-on SIMD

Cell Programming WorkshopCell/Quasar Ecosystem & Solutions Enablement



Class Agenda

Example #1 - Develop and run a simple PPU scalar program

Example #2 – Converting to an SPU program with data alignment

Example #3 – Static profiling with spu_timing

Learn how to SIMDize scalar code on CellBE

Coding examples

/opt/cell_class/Hands-on-30/SIMD



Coding Strategy

Develop 2 functions

– mult1.c to multiply two arrays of SP floats together. This program will be used to characterize the SIMD features

– main.c is a non-vector driver program



example1 - SIMD#include <stdio.h>

#define N 16

int mult1(float *in1, float *in2, float *out, int num);

float a[N] = { 1.1, 2.2, 4.4, 5.5,

6.6, 7.7, 8.8, 9.9,

2.2, 3.3, 3.3, 2.2,

5.5, 6.6, 6.6, 5.5};

float b[N] = { 1.1, 2.2, 4.4, 5.5,

5.5, 6.6, 6.6, 5.5,

2.2, 3.3, 3.3, 2.2,

6.6, 7.7, 8.8, 9.9};

float c[N];

int main()

{

int num = N;

int i;

mult1(a, b, c, num);

for (i=0;i<N;i+=4)

printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]);

return 0;

}

#include <stdio.h>

int mult1(float *in1, float *in2, float *out, int N)

{

int i;

for (i=0;i<N;i++)

{

out[i] = in1[i] * in2[i];

}

return 0;

}

mult1.c

main.c



Example1.b - SIMD#include <stdio.h>

#define N 16


float a[N] __attribute__ ((aligned(16)))

= { 1.1, 2.2, 4.4, 5.5,

6.6, 7.7, 8.8, 9.9,

2.2, 3.3, 3.3, 2.2,

5.5, 6.6, 6.6, 5.5};

float b[N] __attribute__ ((aligned(16)))

= { 1.1, 2.2, 4.4, 5.5,

5.5, 6.6, 6.6, 5.5,

2.2, 3.3, 3.3, 2.2,

6.6, 7.7, 8.8, 9.9};

float c[N] __attribute__ ((aligned(16)));

int main()

{

int num = N;

int i;


for (i=0;i<N;i+=4)


return 0;

}

#include <stdio.h>


{

int i;

for (i=0;i<N;i++)

{

out[i] = in1[i] * in2[i];

}

return 0;

}

mult1.c

main.c



#include <stdio.h>

#define N 16



= { 1.1, 2.2, 4.4, 5.5,

6.6, 7.7, 8.8, 9.9,

2.2, 3.3, 3.3, 2.2,

5.5, 6.6, 6.6, 5.5};


= { 1.1, 2.2, 4.4, 5.5,

5.5, 6.6, 6.6, 5.5,

2.2, 3.3, 3.3, 2.2,

6.6, 7.7, 8.8, 9.9};


int main()

{

int num = N;

int i;


for (i=0;i<N;i+=4)


return 0;

}

#include <stdio.h>

#include <spu_intrinsics.h>


{

int i;

vector float *a = (vector float *) in1;

vector float *b = (vector float *) in2;

vector float *c = (vector float *) out;

int Nv = N/4;

// NV=4

for (i=0;i<Nv;i++)

{

c[i] = spu_mul(a[i], b[i]);

}

return 0;

}

mult1.c

main.c

Example #2 – Converting to an SPU !



Example #3 – Static profiling with spu_timing



What is the SPU timing tool

Annotates an assembly source file with static analysis of instruction timing assuming linear (branchless) execution.

Annotated

assembly source

(example.s.timing)

C source code

(example.c)

Assembly source

(example.s)

spu-gcc –S example.cor

spuxlc –S example.c

spu_timing example.s



Sample - assembly source (example.s)

.file "example.c“.text

.align 3

.global saxpy

.type saxpy, @functionsaxpy:

ila $2,66051shlqbyi $7,$3,0cgti $3,$3,0shufb $8,$4,$4,$2nop $127biz $3,$lrori $4,$7,0hbra .L8,.L4il $7,0lnop

.L4:ai $4,$4,-1lqx $2,$7,$5lqx $3,$7,$6fma $2,$8,$2,$3stqx $2,$7,$6ai $7,$7,16

.L8:brnz $4,.L4bi $lr



Example .timing

.file "mult1.c" .text .align 3 .global mult1 .type mult1, @function mult1:0D 01 cgti $2,$6,-11D 0123 shlqbyi $10,$3,00D 12 ai $3,$6,31D 1234 shlqbyi $9,$4,00 -34 selb $3,$3,$6,$20 -5678 rotmai $8,$3,-20d ---90 cgti $2,$8,01d --1234 brnz $2,.L8 .L2:0D 23 il $3,01D 2345 bi $lr .L8:0D 34 il $7,01D 345678 hbra .L9,.L40D 45 il $6,01D 4 lnop .L4:0d 56 ai $7,$7,11d -678901 lqx $2,$10,$61 789012 lqx $3,$9,$60 89 ceq $4,$8,$70d ----345678 fm $2,$2,$31d ------901234 stqx $2,$5,$60D 01 ai $6,$6,16 .L9:1D 0123 brz $4,.L40D 1 nop $1271D 1234 br .L2 .size mult1, .-mult1 .ident "GCC: (GNU) 4.0.2 (CELL 3-2, Apr 11 2006)"

example.s.timing

Multiplication loopspu_mul stalling

on load



Sample – annotated source (example.s.timing)

.file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr

running-count – cycle count for which each instruction starts. Useful for determining the cycles in a loop. For example, our loop is 17 cycles (23-6).

Inner Loop





Execution pipeline – the pipeline in which the instruction is issued. Either 0 (even pipeline) or 1 (odd pipeline).





dual-issue status – blank indicates single issue.

D – indicates the pair of instructions will be dual-issued.

d – indicates that for the pair of instructions, dual-issue is possible





Instruction clock cycle occupancy – A digit (0-9) is displayed for every clock cycle the instruction executes. Operand dependency stalls are flagged by a dash (“-”) for every clock cycle the instruction is expect to stall.



SIMD - Example 5 in /opt/cell_class/Hands-on-30/SIMD

By setting SPU_TIMING = 1 as either an environment variable or in the SPU Makefile and executing:

make <spu_source>.s

a static timing report is generated with the extension .s.timing

Another way is to issue the command

SPU_TIMING=1 make <spu_source>.s



example3 – SIMD spu_timing #include <stdio.h>

#define N 16



= { 1.1, 2.2, 4.4, 5.5,

6.6, 7.7, 8.8, 9.9,

2.2, 3.3, 3.3, 2.2,

5.5, 6.6, 6.6, 5.5};


= { 1.1, 2.2, 4.4, 5.5,

5.5, 6.6, 6.6, 5.5,

2.2, 3.3, 3.3, 2.2,

6.6, 7.7, 8.8, 9.9};


int main()

{

int num = N;

int i;


for (i=0;i<N;i+=4)


return 0;

}

#include <stdio.h>


{

int i;

vector float *a = (vector float *) in1;

vector float *b = (vector float *) in2;

vector float *c = (vector float *) out;

int Nv = N/4;

for (i=0;i<Nv;i++)

{

c[i] = spu_mul(a[i], b[i]);

}

return 0;

}

mult1.c

main.c

PROGRAM_spu = example5

SPU_TIMING = 1

include $(CELL_TOP)/buildutils/make.footer

Makefile

!

Documents

Developing Code for Cell - SIMD