Upload
enrique-rios
View
27
Download
2
Embed Size (px)
DESCRIPTION
Developing Code for Cell - SIMD. Cell Programming Workshop. Class Agenda. Vector Programming (SIMD) Data types for vector programming Application partitioning SPU Intrinsics. Trademarks - Cell Broadband Engine ™ is a trademark of Sony Computer Entertainment, Inc. PPE vs. SPE Comparison. - PowerPoint PPT Presentation
Citation preview
IBM Systems & Technology Group
Cell Programming Workshop 04/19/23 © 2007 IBM Corporation1
Developing Code for Cell - SIMD
Cell Programming Workshop
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/232
Class Agenda
Vector Programming (SIMD)
– Data types for vector programming
– Application partitioning
SPU Intrinsics
Trademarks - Cell Broadband Engine ™ is a trademark of Sony Computer Entertainment, Inc.
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/233
PPE vs. SPE Comparison
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/234
PPE vs SPE
Both PPE and SPE execute SIMD instructions
– PPE processes SIMD operations in the VXU within its PPU
– SPEs process SIMD operations in their SPU
Both processors execute different instruction sets
Programs written for the PPE and SPEs must be compiled by different compilers
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/235
PPE and SPE Language-Extension Differences Both operate on 128-bit SIMD vectors
Only the Vector/SIMD Multimedia Extension instruction set supports pixel vectors
Only the SPU instruction set supports doubleword vectors
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/236
Register Layout of Data Types and Preferred Slot
When instructions use or produce scalar operands or addresses, the values are in the preferred scalar slot
The left-most word (bytes 0, 1, 2, and 3) of a register is called the preferred slot
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/237
SIMD Architecture
SIMD = “single-instruction multiple-data”
SIMD exploits data-level parallelism
– a single instruction can apply the same operation to multiple data elements in parallel
SIMD units employ “vector registers”
– each register holds multiple data elements
SIMD is pervasive in the BE
– PPE includes VMX (SIMD extensions to PPC architecture)
– SPE is a native SIMD architecture (VMX-like)
SIMD in VMX and SPE
– 128bit-wide datapath
– 128bit-wide registers
– 4-wide fullwords, 8-wide halfwords, 16-wide bytes
– SPE includes support for 2-wide doublewords
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/238
SIMD (Single Instruction Multiple Data) processing
Scalar code
read the next instruction and decode it
get this number
get that number
add them
put the result here
read the next instruction and decode it
get this number
get that number
add them
put the result here
read the next instruction and decode it
get this number
get that number
add them
put the result here.
read the next instruction and decode it
get this number
get that number
add them
put the result there
Vector code
read instruction and decode it
get these 4 numbers
get those 4 numbers
add them
put the results here
• Data-level parallelism / operate on multiple data elements simultaneously
• (Significantly) smaller amount of code => improved execution efficiency• Number of elements processed in parallel = (size of vector / size of element)
Result 1
+
=
Instr. Decode
Number 1
Number 2Repeat4 times
Do Once
Result 3
+
=
Instr. Decode
Number 1c
Number 2c
Result 4
+
=
Number 1d
Number 2d
Result 1
+
=
Number 1a
Number 2a
Result 2
+
=
Number 1b
Number 2bMultiple Data
Single Instruction
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/239
In both the PPE and SPEs, vector registers hold multiple data elements as a single vector. The data paths and registers supporting SIMD operations are 128 bits wide, corresponding to four full 32-bit words.
This means that four 32-bit words can be loaded into a single register, and, for example, added to four other words in a different register in a single operation.
Similar operations can be performed on vector operands containing 16 bytes, 8 halfwords, or 2 doublewords.
I - Example of a SIMD operation
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2310
SIMD “Cross-Element” Instructions
VMX and SPE architectures include “cross-element” instructions
– shifts and rotates
– permutes / shuffles
Permute / Shuffle
– selects bytes from two source registers and places selected bytes in a target register
– byte selection and placement controlled by a “control vector” in a third source register
extremely useful for reorganizing data in the vector register file
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2311
PowerPC Processor Element
The bytes selected for the shuffle from the source registers, VA and VB, are based on byte entries in the control vector, VC, in which a 0 specifies VA and a 1 specifies VB. The result of the shuffle is placed in register VT. The shuffle operation is extremely powerful and finds its way into many applications in which data reordering, selection, or merging is required .
byte-shuffle (permute) operation
II - Example of a SIMD operation
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2312
PowerPC Processor Element
A vector is an instruction operand containing a set of data elements packed into a one-dimensional array
The elements can be fixed-point or floating-point values
SIMD processing exploits data-level parallelism:
Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel (SIMD)
Support for SIMD
operations:
SIMD Vectorization
By the vector/SIMD multimedia
extension instructions
By the SPU instruction setIn the SPEs
In the PPE
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2313
Basic SIMD Operations – A PPU program
/*
* declares vector variables which points to scalar arrays
*/
__vector signed int *va = (__vector signed int *) a;
__vector signed int *vb = (__vector signed int *) b;
__vector signed int *vc = (__vector signed int *) c;
/*
* adds four signed integers at once
*/
*vc = vec_add(*va, *vb); // 1 + 2, 3 + 4, 5 + 6, 7 + 8
/*
* output results
*/
printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]);
return 0;
}
#include <stdio.h>
#include <altivec.h> --- not required in IBM SDK for XLC
/*
* declares input/output scalar variables
*/
int a[4] __attribute__((aligned(16))) = { 1, 3, 5, 7 };
int b[4] __attribute__((aligned(16))) = { 2, 4, 6, 8 };
int c[4] __attribute__((aligned(16)));
int main(int argc, char **argv)
{
Use altivec header file
Use altivec vector intrinsic instruction
Source: PS3 Programming
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2314
Basic SIMD Operations – An SPU program
/*
* declares vector variables which points to scalar arrays
*/
__vector signed int *va = (__vector signed int *) a;
__vector signed int *vb = (__vector signed int *) b;
__vector signed int *vc = (__vector signed int *) c;
/*
* adds four signed intergers at once
*/
*vc = spu_add(*va, *vb); // 1 + 2, 3 + 4, 5 + 6, 7 + 8
// use of spu_add instead of vec_add
/*
* output results
*/
printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]);
return 0;
}
#include <stdio.h>
#include <spu_intrinsics.h> //the SPU intrinsics functions header file
/*
* declares input/output scalar variables
*/
int a[4] __attribute__((aligned(16))) = { 1, 3, 5, 7 };
int b[4] __attribute__((aligned(16))) = { 2, 4, 6, 8 };
int c[4] __attribute__((aligned(16)));
int main(int argc, char **argv)
{
Use spu_intrinsics header file
Use spu intrinsics function
Source: PS3 Programming
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2315
Generic SPU Intrinsics
Generic
– Constant formation (spu_splats)
– Conversion (spu_convtf, spu_convts, ..)
– Arithmetic (spu_add, spu_madd, spu_nmadd, ...)
– Byte operations (spu_absd, spu_avg,...)
– Compare and branch (spu_cmpeq, spu_cmpgt,...)
– Bits and masks (spu_shuffle, spu_sel,...)
– Logical (spu_and, spu_or, ...)
– Shift and rotate (spu_rlqwbyte, spu_rlqw,...)
– Control (spu_stop, spu_ienable, spu_idisable, ...)
– Channel Control (spu_readch, spu_writech,...)
– Scalar (spu_insert, spu_extract, spu_promote)
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2316
Vectorizing a Loop – A Simple Example
Loop does term-by-term multiply of two arrays
– arrays assumed here to remain scalar outside of subroutine If array size is not a multiple of 4, extra work is necessary
/* scalar version */
int mult1(float *in1, float *in2, float *out, int N){ int i;
for (i = 0, i < N, i++) {
out[i] = in1[i] * in2[i]; }
return 0;
/* vectorized version */
int vmult1(float *in1, float *in2, float *out, int N){ /* assumes that arrays are quadword aligned */ /* assumes that N is divisible by 4 */
int i, Nv; Nv = N >> 2; /* N/4 vectors */ vector float *vin1 = (vector float *) (in1); vector float *vin2 = (vector float *) (in2); vector float *vout = (vector float *) (out);
for (i = 0, i < Nv, i++) {
vout[i] = spu_mul(vin1[i],vin2[i]); }
return 0;
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2317
Another SIMD Example – data incrementing
void load_data(int *dest)
{
int i;
for (i=0; i<4096; ++i)
{
++dest[i];
}
}
void load_data_SIMD(int *dest)
{
int i;
vector unsigned int *vdest;
vector unsigned int v1 = (vector unsigned int) {1, 1, 1, 1};
vdest = (vector unsigned int *) dest;
for (i=0; i<1024; ++i)
{
vdest[i] = spu_add(vdest[i], v1);
}
}
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2318
Complex Multiplication SPE - Shuffle Vectors
a1 b1 a2 b2A1 a3 b3 a4 b4A2
c1 d1 c2 d2B1 c3 d3 c4 d4B2
0-3 8-11 16-19 24-27I P
4-7 12-15 20-23 28-31Q PInput Shuffle patterns
I1 = spu_shuffle(A1, A2, I_Perm_Vector); a1 b1 a2 b2A1 a3 b3 a4 b4A2
0-3 8-11 16-19 24-27I P
a1 a2 a3 a4I1
I2 = spu_shuffle(B1, B2, I_Perm_Vector); c1 c2 c3 c4I2
Q1 = spu_shuffle(A1, A2, Q_Perm_Vector); a1 b1 a2 b2A1 a3 b3 a4 b4A2
b1 b2 b3 b4Q1
d1 d2 d3 d4Q2
4-7 12-15 20-23 28-31Q P
Q2 = spu_shuffle(B1, B2, Q_Perm_Vector);
Z1 Z2
By analogy
By analogy
Z3 Z4
W1 W2 W3 W4
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2319
Complex Multiplication
A1 = spu_nmsub(Q1, Q2, v_zero);
A2 = spu_madd(Q1, I2, v_zero);
Q1 = spu_madd(I1, Q2, A2);
I1 = spu_madd(I1, I2, A1);
b1 b2 b3 b4Q1
d1 d2 d3 d4Q2
0 0 0 0Z
*
-
-(b1*d1)A1 -(b2*d2) -(b3*d3) -(b4*d4)
b1 b2 b3 b4Q1
*
+
b1*c1A2 b2*c2 b3*c3 b4*c4
c1 c2 c3 c4I2
0 0 0 0Z
*
+a1*d1+b1*c1Q1
a1 a2 a3 a4I1
d1 d2 d3 d4Q2
b1*c1A2 b2*c2 b3*c3 b4*c4
a2*d2+b2*c2
a3*d3+b3*c3
a4*d4+b4*c4
*
+a1*c1-b1*d1I1
a1 a2 a3 a4I1
a2*c2-b2*d2
a3*c3-b3*d3
a4*c4-b4*d4
c1 c2 c3 c4I2
-(b1*d1)A1 -(b2*d2) -(b3*d3) -(b4*d4)
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2320
Complex multiplication – SPE - Summary vector float A1, A2, B1, B2, I1, I2, Q1, Q2, D1, D2;
/* in-phase (real), quadrature (imag), temp, and output vectors*/
vector float v_zero = (vector float)(0,0,0,0);
vector unsigned char I_Perm_Vector = (vector unsigned char)(0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27);
vector unsigned char Q_Perm_Vector = (vector unsigned char)(4,5,6,7,12,13,14,15,20,21,22,23,28,29,30,31);
vector unsigned char vcvmrgh = (vector unsigned char) (0,1,2,3,16,17,18,19,4,5,6,7,20,21,22,23);
vector unsigned char vcvmrgl = (vector unsigned char) (8,9,10,11,24,25,26,27,12,13,14,15,28,29,30,31);
/* input vectors are in interleaved form in A1,A2 and B1,B2 with each input vector
representing 2 complex numbers and thus this loop would repeat for N/4 iterations
*/
I1 = spu_shuffle(A1, A2, I_Perm_Vector); /* pulls out 1st and 3rd 4-byte element from vectors A1 and A2 */
I2 = spu_shuffle(B1, B2, I_Perm_Vector); /* pulls out 1st and 3rd 4-byte element from vectors B1 and B2 */
Q1 = spu_shuffle(A1, A2, Q_Perm_Vector); /* pulls out 2nd and 4th 4-byte element from vectors A1 and A2 */
Q2 = spu_shuffle(B1, B2, Q_Perm_Vector); /* pulls out 3rd and 4th 4-byte element from vectors B1 and B2 */
A1 = spu_nmsub(Q1, Q2, v_zero); /* calculates –(bd – 0) for all four elements */
A2 = spu_madd(Q1, I2, v_zero); /* calculates (bc + 0) for all four elements */
Q1 = spu_madd(I1, Q2, A2); /* calculates ad + bc for all four elements */
I1 = spu_madd(I1, I2, A1); /* calculates ac – bd for all four elements */
D1 = spu_shuffle(I1, Q1, vcvmrgh); /* spreads the results back into interleaved format */
D2 = spu_shuffle(I1, Q1, vcvmrgl); /* spreads the results back into interleaved format */
IBM Systems & Technology Group
Cell Programming Workshop 04/19/23 © 2007 IBM Corporation21
Hands-on SIMD
Cell Programming WorkshopCell/Quasar Ecosystem & Solutions Enablement
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2322
Class Agenda
Example #1 - Develop and run a simple PPU scalar program
Example #2 – Converting to an SPU program with data alignment
Example #3 – Static profiling with spu_timing
Learn how to SIMDize scalar code on CellBE
Coding examples
/opt/cell_class/Hands-on-30/SIMD
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2323
Coding Strategy
Develop 2 functions
– mult1.c to multiply two arrays of SP floats together. This program will be used to characterize the SIMD features
– main.c is a non-vector driver program
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2324
example1 - SIMD#include <stdio.h>
#define N 16
int mult1(float *in1, float *in2, float *out, int num);
float a[N] = { 1.1, 2.2, 4.4, 5.5,
6.6, 7.7, 8.8, 9.9,
2.2, 3.3, 3.3, 2.2,
5.5, 6.6, 6.6, 5.5};
float b[N] = { 1.1, 2.2, 4.4, 5.5,
5.5, 6.6, 6.6, 5.5,
2.2, 3.3, 3.3, 2.2,
6.6, 7.7, 8.8, 9.9};
float c[N];
int main()
{
int num = N;
int i;
mult1(a, b, c, num);
for (i=0;i<N;i+=4)
printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]);
return 0;
}
#include <stdio.h>
int mult1(float *in1, float *in2, float *out, int N)
{
int i;
for (i=0;i<N;i++)
{
out[i] = in1[i] * in2[i];
}
return 0;
}
mult1.c
main.c
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2325
Example1.b - SIMD#include <stdio.h>
#define N 16
int mult1(float *in1, float *in2, float *out, int num);
float a[N] __attribute__ ((aligned(16)))
= { 1.1, 2.2, 4.4, 5.5,
6.6, 7.7, 8.8, 9.9,
2.2, 3.3, 3.3, 2.2,
5.5, 6.6, 6.6, 5.5};
float b[N] __attribute__ ((aligned(16)))
= { 1.1, 2.2, 4.4, 5.5,
5.5, 6.6, 6.6, 5.5,
2.2, 3.3, 3.3, 2.2,
6.6, 7.7, 8.8, 9.9};
float c[N] __attribute__ ((aligned(16)));
int main()
{
int num = N;
int i;
mult1(a, b, c, num);
for (i=0;i<N;i+=4)
printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]);
return 0;
}
#include <stdio.h>
int mult1(float *in1, float *in2, float *out, int N)
{
int i;
for (i=0;i<N;i++)
{
out[i] = in1[i] * in2[i];
}
return 0;
}
mult1.c
main.c
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2326
#include <stdio.h>
#define N 16
int mult1(float *in1, float *in2, float *out, int num);
float a[N] __attribute__ ((aligned(16)))
= { 1.1, 2.2, 4.4, 5.5,
6.6, 7.7, 8.8, 9.9,
2.2, 3.3, 3.3, 2.2,
5.5, 6.6, 6.6, 5.5};
float b[N] __attribute__ ((aligned(16)))
= { 1.1, 2.2, 4.4, 5.5,
5.5, 6.6, 6.6, 5.5,
2.2, 3.3, 3.3, 2.2,
6.6, 7.7, 8.8, 9.9};
float c[N] __attribute__ ((aligned(16)));
int main()
{
int num = N;
int i;
mult1(a, b, c, num);
for (i=0;i<N;i+=4)
printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]);
return 0;
}
#include <stdio.h>
#include <spu_intrinsics.h>
int mult1(float *in1, float *in2, float *out, int N)
{
int i;
vector float *a = (vector float *) in1;
vector float *b = (vector float *) in2;
vector float *c = (vector float *) out;
int Nv = N/4;
// NV=4
for (i=0;i<Nv;i++)
{
c[i] = spu_mul(a[i], b[i]);
}
return 0;
}
mult1.c
main.c
Example #2 – Converting to an SPU !
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2327
Example #3 – Static profiling with spu_timing
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2328
What is the SPU timing tool
Annotates an assembly source file with static analysis of instruction timing assuming linear (branchless) execution.
Annotated
assembly source
(example.s.timing)
C source code
(example.c)
Assembly source
(example.s)
spu-gcc –S example.cor
spuxlc –S example.c
spu_timing example.s
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2329
Sample - assembly source (example.s)
.file "example.c“.text
.align 3
.global saxpy
.type saxpy, @functionsaxpy:
ila $2,66051shlqbyi $7,$3,0cgti $3,$3,0shufb $8,$4,$4,$2nop $127biz $3,$lrori $4,$7,0hbra .L8,.L4il $7,0lnop
.L4:ai $4,$4,-1lqx $2,$7,$5lqx $3,$7,$6fma $2,$8,$2,$3stqx $2,$7,$6ai $7,$7,16
.L8:brnz $4,.L4bi $lr
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2330
Example .timing
.file "mult1.c" .text .align 3 .global mult1 .type mult1, @function mult1:0D 01 cgti $2,$6,-11D 0123 shlqbyi $10,$3,00D 12 ai $3,$6,31D 1234 shlqbyi $9,$4,00 -34 selb $3,$3,$6,$20 -5678 rotmai $8,$3,-20d ---90 cgti $2,$8,01d --1234 brnz $2,.L8 .L2:0D 23 il $3,01D 2345 bi $lr .L8:0D 34 il $7,01D 345678 hbra .L9,.L40D 45 il $6,01D 4 lnop .L4:0d 56 ai $7,$7,11d -678901 lqx $2,$10,$61 789012 lqx $3,$9,$60 89 ceq $4,$8,$70d ----345678 fm $2,$2,$31d ------901234 stqx $2,$5,$60D 01 ai $6,$6,16 .L9:1D 0123 brz $4,.L40D 1 nop $1271D 1234 br .L2 .size mult1, .-mult1 .ident "GCC: (GNU) 4.0.2 (CELL 3-2, Apr 11 2006)"
example.s.timing
Multiplication loopspu_mul stalling
on load
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2331
Sample – annotated source (example.s.timing)
.file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr
running-count – cycle count for which each instruction starts. Useful for determining the cycles in a loop. For example, our loop is 17 cycles (23-6).
Inner Loop
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2332
Sample – annotated source (example.s.timing)
.file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr
Execution pipeline – the pipeline in which the instruction is issued. Either 0 (even pipeline) or 1 (odd pipeline).
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2333
Sample – annotated source (example.s.timing)
.file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr
dual-issue status – blank indicates single issue.
D – indicates the pair of instructions will be dual-issued.
d – indicates that for the pair of instructions, dual-issue is possible
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2334
Sample – annotated source (example.s.timing)
.file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr
Instruction clock cycle occupancy – A digit (0-9) is displayed for every clock cycle the instruction executes. Operand dependency stalls are flagged by a dash (“-”) for every clock cycle the instruction is expect to stall.
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2335
SIMD - Example 5 in /opt/cell_class/Hands-on-30/SIMD
By setting SPU_TIMING = 1 as either an environment variable or in the SPU Makefile and executing:
make <spu_source>.s
a static timing report is generated with the extension .s.timing
Another way is to issue the command
SPU_TIMING=1 make <spu_source>.s
IBM Systems & Technology Group –
© 2007 IBM CorporationCell Programming Workshop 04/19/2336
example3 – SIMD spu_timing #include <stdio.h>
#define N 16
int mult1(float *in1, float *in2, float *out, int num);
float a[N] __attribute__ ((aligned(16)))
= { 1.1, 2.2, 4.4, 5.5,
6.6, 7.7, 8.8, 9.9,
2.2, 3.3, 3.3, 2.2,
5.5, 6.6, 6.6, 5.5};
float b[N] __attribute__ ((aligned(16)))
= { 1.1, 2.2, 4.4, 5.5,
5.5, 6.6, 6.6, 5.5,
2.2, 3.3, 3.3, 2.2,
6.6, 7.7, 8.8, 9.9};
float c[N] __attribute__ ((aligned(16)));
int main()
{
int num = N;
int i;
mult1(a, b, c, num);
for (i=0;i<N;i+=4)
printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]);
return 0;
}
#include <stdio.h>
int mult1(float *in1, float *in2, float *out, int N)
{
int i;
vector float *a = (vector float *) in1;
vector float *b = (vector float *) in2;
vector float *c = (vector float *) out;
int Nv = N/4;
for (i=0;i<Nv;i++)
{
c[i] = spu_mul(a[i], b[i]);
}
return 0;
}
mult1.c
main.c
PROGRAM_spu = example5
SPU_TIMING = 1
include $(CELL_TOP)/buildutils/make.footer
Makefile
!