Solving Systems of Linear Equations on the CELL Processor ...graal.ens-lyon.fr/~abuttari/mypapers/tech_cholcell.pdf · ity. In linear algebra it enforces the departure from the model

Solving Systems of Linear Equations on the CELL ProcessorUsing Cholesky Factorization – LAPACK Working Note 184

Jakub Kurzak1, Alfredo Buttari1, Jack Dongarra1,2

1Department of Computer Science, University Tennessee,Knoxville, Tennessee 37996

2Computer Science and Mathematics Division, Oak Ridge National Laboratory,Oak Ridge, Tennessee, 37831

April 30, 2007

ABSTRACT: The STI CELL processor introducespioneering solutions in processor architecture. At thesame time it presents new challenges to developmentof numerical algorithms. One is effective exploita-tion of the differential between the speed of singleand double precision arithmetic, the other is efficientparallelization between the short vector SIMD cores.In this work the first challenge is addressed by uti-lizing a mixed-precision algorithm for the solution ofa dense symmetric positive definite system of linearequations, which delivers double precision accuracy,while performing the bulk of the work in single pre-cision. The second challenge is approached by intro-ducing much finer granularity of parallelization thanhas been used for other architectures, and using alightweight decentralized synchronization. The im-plementation of the computationally intensive sec-tions gets within 90% of peak floating point perfor-mance, while the implementation of the memory in-tensive sections reaches within 90% of peak memorybandwidth. On a single CELL processor the algo-rithm achieves over 170 Gflop/s when solving a sym-metric positive definite system of linear equation insingle precision and over 150 Gflop/s when deliveringthe result in double precision accuracy.

KEYWORDS: CELL BE, iterative refinement,mixed-precision algorithms, Cholesky factorization

1 Motivation

In numerical computing, there is a fundamental per-formance advantage of using single precision floatingpoint data formant over double precision data for-mat, due to more compact representation, thanks towhich, twice the number of single precision data ele-ments can be stored at each stage of the memory hi-erarchy. Short vector SIMD processing provides yetmore potential for performance gains from using sin-gle precision arithmetic over double precision. Sincethe goal is to process the entire vector in a single op-eration, the computation throughput can be doubledwhen the data representation is halved.

Most of processor architectures available todayhave been at some point augmented with short vec-tor SIMD extensions. Examples include StreamingSIMD Extensions (SSE) for AMD and Intel line ofprocessors, PowerPC Velocity Engine / AltiVec /VMX, Sparc Visual Instruction Set (VIS), Alpha Mo-tion Video Instructions (MVI), PA-RISC MultimediaAcceleration eXtensions (MAX), MIPS-3D Applica-tion Specific Extensions (ASP), and Digital MediaExtensions (MDMX), ARM NEON. The different ar-chitectures exhibit big differences in their capabili-ties. The vector size is either 64 bits or, more com-monly, 128 bits. The register file size ranges from justa few to as many as 128 registers. Some extensions

1

only support integer types, other also operate on sin-gle precision floating point numbers, and yet othersalso process double precision values.

Today the Synergistic Processing Element (SPE)of the STI CELL processor [1–3] can probably beconsidered the state of the art in short vector SIMDprocessing. Possessing 128-byte long registers anda fully pipelined fused add-multiply instruction, it iscapable of completing 8 single precision floating pointoperations each clock cycle, which combined with thesize of the register file of 128 registers, delivers closeto peak performance on many common workloads.At the same time, built with multimedia and em-bedded applications in mind, the current incarnationof the CELL architecture does not implement dou-ble precision arithmetic on a par with single preci-sion performance-wise, which makes the processor avery attractive target for exploring mixed-precisionapproaches.

Another important phenomenon in recent yearshas been the gradual shift of focus in processor ar-chitecture from aggressive exploitation of instructionlevel parallelism towards thread-level parallelism, re-sulting in the introduction of chips with multiple pro-cessing units commonly referred to as multi-core pro-cessors. The new architectures deliver the much de-sired improvement in performance, and at the sametime challenge the scalability of existing algorithms,and force the programmers to seek more parallelismby going to much finer levels of problem granular-ity. In linear algebra it enforces the departure fromthe model relying on parallelism encapsulated at thelevel of BLAS, and shift to more flexible methods ofscheduling work.

2 Related Work

Iterative refinement is a well known method for im-proving the solution of a linear system of equations ofthe form Ax = b. Typically, a dense system of linearequations is solved by applying a factorization to thecoefficient matrix, followed by a back solve. Due toroundoff errors, the solution carries an error relatedto the condition number of the coefficient matrix. Inorder to improve the computed solution, an iterative

refinement process can be applied, which producesa correction to the computed solution at each itera-tion. In principle the algorithm can produce solutioncorrect to the working precision.

Iterative refinement is a fairly well understood con-cept, and was analyzed by Wilkinson [4], Moler [5]and Stewart [6]. Higham gives error bounds for bothsingle and double precision iterative refinement algo-rithms, where the entire algorithm is implementedwith the same precision (single or double respec-tively) [7]. He also gives error bounds in single preci-sion arithmetic, with refinement performed in doubleprecision arithmetic. Error analysis for the case de-scribed in this work, where the factorization is per-formed in single precision and the refinement in dou-ble precision, is given by Langou et al.[8].

The authors of this work have previously presentedan initial implementation of the mixed-precision algo-rithm for the general, non-symmetric, case using LUfactorization on the CELL processors. Although re-spectable performance numbers were presented, boththe factorization and the refinement steps relied onrather classic parallelization approaches. Also, some-what general discussion of algorithmic and imple-mentation details was presented. This work extendsthe previous presentation by introducing a novelscheme for parallelization of the computational com-ponents of the algorithm, and also describing in muchmore detail the implementation of both computation-intensive, as well as memory-intensive operations.

3 Algorithm

The standard approach to solving symmetric posi-tive definite systems of linear equations is to use theCholesky factorization. The Cholesky factorizationof a real symmetric positive definite matrix A has theform A = LLT , where L is a real lower triangular ma-trix with positive diagonal elements. The system issolved by first solving Ly = b (forward substitution),and then solving LT x = y (backward substitution). Inorder to improve the accuracy of the computed solu-tion, an iterative refinement process is applied, whichproduces a correction to the computed solution, x, ateach iteration.

2

Algorithm 1 Solution of a symmetric positive defi-nite system of linear equations using mixed-precisioniterative refinement based on Cholesky factorization.1: A(32), b(32) ← A, b2: L(32), L

T(32) ←SPOTRFa(A(32))

3: x(1)(32) ←SPOTRSb(L(32), L

T(32), b(32))

4: x(1) ← x(1)(32)

5: repeat6: r(i) ← b−Ax(i)

7: r(i)(32) ← r(i)

8: z(i)(32) ←SPOTRSb(L(32), L

T(32), r

(i)(32))

9: z(i) ← z(i)(32)

10: x(i+1) ← x(i) + z(i)

11: until x(i) is accurate enoughaLAPACK name for Cholesky factorizationbLAPACK name for symmetric back solve

64-bit representation is used in all cases where 32-bit repre-sentation is not indicated by a subscript.

The mixed-precision iterative refinement algorithmusing Cholesky factorization is outlined by Algo-rithm 1. The factorization A = LLT (line 2) andthe solution of the triangular systems Ly = b andLT x = y (lines 3 and 8) are computed using singleprecision arithmetic. The residual calculation (line 6)and the update of the solution (line 10) are computedusing double precision arithmetic and the originaldouble precision coefficients. The most computation-ally expensive operations, including the factorizationof the coefficient matrix A and the forward and back-ward substitution, are performed using single preci-sion arithmetic and take advantage of the single pre-cision speed. The only operations executed in doubleprecision are the residual calculation and the updateof the solution.

It can be observed that all operations of O(n3)computational complexity are handled in single pre-cision, and all operations performed in double preci-sion are of at most O(n2) complexity. The coefficientmatrix A is converted to single precision for the LUfactorization. At the same time, the original matrixin double precision is preserved for the residual cal-culation.

The algorithm described above, and shown on Al-gorithm 1 is available in the LAPACK 3.1 library andimplemented by the routine DSGESV.

4 Implementation

4.1 Essential Hardware Features

Extensive hardware overview would be beyond thescope of this publication. Vast amount of informationis publicly available for both experienced program-mers [9], as well as newcomers to the field [10, 11].It is assumed that the reader has some familiaritywith the architecture. Here the features are men-tioned which influence the most the implementationpresented here.

The CELL is a multi-core chip including nine dif-ferent processing elements. One core, the POWERProcessing Element (PPE), represents a standardprocessor design implementing the PowerPC instruc-tion set. The remaining eight cores, the SynergisticProcessing Elements (SPEs), are short vector Sin-gle Instruction Multiple Data (SIMD) engines withbig register files of 128 128-bit vector registers and256 KB of local memory referred to as local store(LS). Although standard C code can be compiledfor execution on the SPEs, the SPEs do not exe-cute scalar code efficiently. For efficient executionthe code has to be vectorized in the SIMD sense, byusing C language vector extensions (intrinsics), or byusing assembly code. The system’s main memory isaccessible to the PPE through L1 and L2 caches andto the SPEs through DMA engines associated withthem. The SPEs can only execute code residing inthe local store and only operate on data in the localstore. All data has to be transferred in and out oflocal store via DMA transfers, The theoretical com-puting power of a single SPE is 25.6 Gflop/s in singleprecision and roughly 1.8 Gflop/s in double precision.Floating point arithmetic follows the IEEE format,with double precision operations complying numeri-cally with the standard and single precision not com-plying. The theoretical communication speed for asingle SPE is 25.6 GB/s. The theoretical peak band-width of the main memory is 25.6 GB/s as well.

3

A

C

A

B C

T TT

T = T – A AT

SSYRK

T = LLT

SPOTRFC = C – B AT

SGEMM

C = C \ TSTRSM

A T

C

A

B C

TT

Figure 1: Top - steps of left-looking Cholesky factorization. Botton - tiling of operations.

The size of the register file and the size of the localstore dictate the size of the elementary operation sub-ject to scheduling to the SPEs. The ratio of comput-ing power to the memory bandwidth dictates overallproblem decomposition for parallel execution.

4.2 Factorization

A few flavors of Cholesky factorization are known. Inparticular the right-looking variant, the left-lookingvariant and the Crout variant [12]. It has also beenpointed out that those variants are borders of a con-tinuous spectrum of possible execution paths [13].

Generally, the left-looking factorization is preferreddue to the following reasons: During the factorizationthe most time is spent in calculating a matrix-matrixproduct. In case of the right-looking factorizationthis product involves a triangular matrix. In caseof the left-looking factorization this product only in-volves rectangular matrices. It is generally more effi-cient to implement a matrix-matrix product for rect-angular matrices and it is easier to balance the loadin parallel execution. The implementation presentedhere is derived from the left-looking formulation ofthe Cholesky factorization, which follows the imple-mentation of the LAPACK routine SPOTRF.

Due to the limited size of the local store, all numeri-cal operations have to be decomposed into elementary

operations which fit in the local store. The simplic-ity of implementing Cholesky factorization lies in thefact that it can be easily decomposed into tile oper-ations - operations on fixed-size submatrices whichtake from one to three tiles as input and produce onetile as output. These elementary operations will fur-ther be referred to as tile kernels. Figure 1 presentsthe steps of the left-looking Cholesky factorizationand how each step is broken down to tile operations.

4.2.1 Computational Kernels

Implementation of the tile kernels assumes fixed sizeof the tiles. Smaller tiles (finer granularity) have pos-itive effect on scheduling for parallel execution andfacilitate better load balance and higher parallel ef-ficiency. Bigger tiles provide better performance insequential execution on a single SPE.

I case of the CELL chip the crossover point israther simple to find for problems in dense linear al-gebra. From the standpoint of this work, the mostimportant operation is matrix multiplication in sin-gle precision. It turns out that this operation canachieve within a few percent off the peak performanceon a single SPE for matrices of size 64×64. The factthat the peak performance can be achieved for a tileof such a small size has to be attributed to the bigsize of the register file and fast access to the local

4

Operation BLAS / LAPACK Call

T ← T −A×ATcblas ssyrk(CblasRowMajor,

CblasLower, CblasNoTrans,64, 64, 1.0, A, 64, 1.0, T, 64);

C ← C −B ×ATcblas sgemm(CblasRowMajor,

CblasNoTrans, CblasTrans,64, 64, 64,1.0, B, 64, A, 64, 1.0, C, 64);

B ← B × T−Tcblas strsm(CblasRowMajor,

CblasRight, CblasLower,(B = B/TT )a

CblasTrans, CblasNonUnit,64, 64, 1.0, T, 64, B, 64);

T ← L× LTlapack spotrf(lapack lower,

64, trans(T), 64, &info);b

ausing MATLAB notationbusing LAPACK C interface by R. Delmas and J. Langou,

http://icl.cs.utk.edu/∼delmas/lapwrapc.html

Table 1: Single precision Cholesky tile kernels.

store, undisturbed with any intermediate levels ofmemory. Also, such matrix occupies 16 KB blockof memory which is the maximum size of a singleDMA transfer. Eight of such matrices fit in half ofthe local store providing enough flexibility for multi-buffering and, at the same time, leaving enough roomfor the code. Discussion of tile size consideration wasalso presented before by the authors of this publica-tion [14]. Table 1 represents the Cholesky tile kernelsfor tile size of 64×64 as BLAS and LAPACK calls.

It has already been pointed out that a few deriva-tions of Cholesky factorization are known. In particu-lar the right-looking variant, the left-looking variantand the Crout variant [12]. Dynamic scheduling ofoperation is another possibility. No matter, however,which static variant is used, or whether dynamic ex-ecution in used, the same set of tile kernels is needed.The change from one to another will only alter theorder in which the tile operations execute.

All the kernels were developed using SIMD C lan-guage extensions. Extra effort was invested into op-timization of the matrix multiplication (SGEMM1)kernel performing the calculation C = C − B × AT ,since this operation dominates the execution time.All the tile kernels were written by consistently ap-plying the idea of blocking with block size of four.Short discussion of each case follows.

1Tile kernels are referred to using the names of BLAS and

The SSYRK kernel applies a symmetric rank-k up-date to a tile. In other words, it performs the opera-tion T ← T −A×AT , where only the lower triangu-lar part of T is modified. The SSYRK kernel consistsof two nested loops, where in each inner iteration a4×4 block of the output tile is produced and the bodyof the inner loop is completely unrolled. The #defineand nested #define directives are used to create a sin-gle basic block - a block of straight line code. Staticoffsets are used within the basic block and pointerarithmetic is used to advance pointers between iter-ations.

Algorithm 2 SSYRK tile kernel T ← T −A×AT .1: for j = 0 to 15 do2: for i = 0 to j − 1 do3: Compute block [j,i]4: Permute/reduce block [j,i]5: end for6: Compute block [j,j]7: Permute/reduce block [j,j]8: end for

block is a 4×4 submatrix of tile T .

The construction of the SSYRK tile kernel is pre-sented by Algorithm 2. The unrolled code consistsof four distinct segments. A computation segment(line 3) includes only multiply and add operations tocalculate a product of two 4×64 blocks of tile A andproduces 16 4-element vectors as output. A permu-tation/reduction segment (line 4) performs transposi-tions and reductions on these 16 vectors and deliversthe 4 4-element vectors constituting the 4×4 block ofthe output tile. The two segments mentioned abovehandle the off-diagonal, square blocks (lines 3 and 4)and two more segments handle the triangular, diag-onal blocks (lines 6 and 7). It is an elegant and com-pact, but suboptimal design.

The SGEMM kernel performs the operationC ← C −B ×AT , which is very similar to the opera-tion performed by the SSYRK kernel. One differenceis that it updates the tile C with a product of twotiles, B and AT , and not the tile A and its transpose.

LAPACK routines implementing the same functionality.

5

Algorithm 3 SGEMM tile kernel C ← C −B ×AT

1: Compute block 02: for i = 1 to 127 do3: Permute/reduce blk 2i−2 & compute blk 2i−14: Permute/reduce blk 2i− 1 & compute blk 2i5: end for6: Permute/reduce blk 254 & compute blk 2557: Permute/reduce blk 255

blk is a 4×4 submatrix of tile C.

The second difference is that the output is the entiretile and not just its lower triangular portion. Also,since the SGEMM kernel is performance-critical, it issubject to more optimizations compared to the SSYKkernel.

The main idea behind further optimization of theSGEMM kernel comes from the fact that the SPEis a dual issue architecture, where arithmetic oper-ations can execute in parallel with permutations ofvector elements. Thanks to this fact a pipelined exe-cution can be implemented, where the operations ofthe permute/reduce segment from iteration k can bemixed with the operations of the compute segmentfrom iteration k + 1. The two nested loops used forSSYRK are replaced with a single loop, where the256 4×4 blocks of the output tile are produced in alinear row-major order, what results in Algorithm 3.

Algorithm 4 STRSM tile kernel B ← B × T−T .1: for j = 0 to 15 do2: for i = 0 to j − 1 do3: Apply block i towards block j4: end for5: Solve block j6: end for

block is a 64×4 submatrix of tile B.

The STRSM kernel computes triangular solve withmultiple right-hand-sides B ← B × T−T . The com-putation is conceptually easiest to SIMD’ize if thesame step is applied at the same time to differentright-hand-sides. This can be easily achieved if thememory layout of tile B is such that each 4-element

vector contains elements of the same index of differ-ent right-hand-sides. Since this is not the case here,each 4×4 block of the tile is transposed, in place, be-fore and after the main loop implementing the solve.The operation introduces a minor overhead, but al-lows for very efficient implementation of the solve -one which achieves good ratio of the peak with smalland simple code.

Algorithm 4 presents the structure of the code im-plementing the triangular solve, which is an applica-tion of the lazy variant of the algorithm. The choiceof the lazy algorithm versus the aggressive algorithmis arbitrary. The code includes two segments of fullyunrolled code, which both operate on 64×4 blocksof data. The outer loop segment (line 5) produces a64×4 block j of the solution. The inner loop segment(line 3) applies the update of step i to block j.

Algorithm 5 SPOTRF tile kernel T ← L× LT .1: for k = 0 to 15 do2: for i = 0 to k − 1 do3: SSYRK (apply block [k,i] to block [k,k])4: end for5: SPOTF2 (factorize block [k,k])6: for j = k to 15 do7: for i = 0 to k − 1 do8: SGEMM (apply block [j,i] to block [j,k])9: end for

10: end for11: for j = k to 15 do12: STRSM (apply block [k,k] to block [j,k])13: end for14: end for

block is a 4×4 submatrix of tile T .

The SPOTRF kernel computes Cholesky factoriza-tion, T ← L×LT , of a 64×64 tile. It is the lazy vari-ant of the algorithm, more commonly referred to asthe left-looking factorization, where updates to thetrailing submatrix do not immediately follow panelfactorization. Instead updates are applied to a panelright before factorization of that panel.

It could be expected that this kernel is imple-mented using Level 2 BLAS operations, as this is theusual way of implementing panel factorizations. Such

6

Kernel Source Compilation Object Execution Execution FractionKernel Code Code Time Rate of Peak

[LOC] [KB] [µs] [Gflop/s] [%]SSYRK 160 spuxlca -O3 4.7 13.23 20.12 78SGEMM 330 spu-gccb -Os 9.0 22.78 23.01 90STRSM 310 spuxlca -O3 8.2 16.47 15.91 62SPOTRF 340 spu-gccb -O3 4.1 15.16 5.84 23aversion 1.0 (SDK 1.1)bversion 4.0.2 (toolchain 3.2 - SDK 1.1)

Table 2: Performance of single precision Cholesky factorization tile kernels.

solution would, however, lead to code being difficultto SIMD’ize and yielding very poor performance. In-stead, the implementation of this routine is simplyan application of the idea of blocking with block sizeequal to the SIMD vector size of 4. Algorithm 5presents the structure of SPOTRF tile kernel.

Table 2 compares the tile kernels in terms of sourceand object code size and performance. Although per-formance is the most important metric, code size isnot withough meaning, due to limited size of localstore. Despite the fact that code can be replacedin the local store at runtime, it is desirable that theentire code implementing the single precision factor-ization fits into local store at the same time. Codemotion during the factorization would both compli-cate the code and adversely affect performance.

Although the matrix multiplication achieves quitegood performance - 23 Gflop/s, which is 90% of thepeak, there is no doubt that yet better performancecould be achieved by using assembly code insteadC language SIMD extensions. Performance in excessof 25 Gflop/s has been reported for similar, althoughnot identical, SGEMM kernels [15]. It is remarkablethat this performance can be achieved for operationsof such small granularity, which has to be attributedto the unique features of the CELL architecture, es-pecially register file and memory organization.

It is worth noting that, although significant effortwas required to optimize the SGEMM kernel (and yetmore would be required to further optimize it), theother kernels involved rather small programming ef-fort in a higher level language to deliver satisfactoryperformance (execution time shorter than execution

time of SGEMM kernel). This means that the Paretoprinciple (also known as the 80-20 rule) applies verywell in this case. Only small portion of the code re-quires strenuous optimizations for the application todeliver close to peak performance.

4.2.2 Parallelization

The presentation of the parallelization of theCholesky factorization needs to be preceded by a dis-cussion of memory bandwidth considerations.

The SGEMM kernel can potentially execute at therate very close to 25.6 Gflop/s on a single SPE. Insuch case it performs the 2×643 = 524288 operationsin 20.48 µs. The operation requires transmission ofthree tiles from main memory to local store (tiles A,B and C) and a transmission of one tile from localstore to main memory (updated tile C), consumingthe bandwidth equal to:

4tiles × 642 × 4sizeof(float)[B]20.48[µs]

= 3.2[GB/s].

This means that 8 SPES performing the SGEMM op-eration at the same time will require the bandwidthof 8× 3.2GB/s = 25.6GB/s, which is the theoreticalpeak main memory bandwidth.

It has been shown that arithmetic can execute al-most at the theoretical peak rate on the SPE. At thesame time it would not be realistic to expect the-oretical peak bandwidth from the memory system.By the same token data reuse has to be introducedinto the algorithm to decrease the load on the mainmemory. A straightforward solution is to introduce

7

1D processing, where each SPE processes one row oftiles of the coefficient matrix at a time.

Please follow Figure 1 for the following discussion.In the SSYRK part, one SPE applies a row of tilesA to the diagonal tile T , followed by the SPOTRFoperation (factorization) of the tile T . Tile T onlyneeds to be read in at the beginning of this step andwritten back at the end. The only transfers takingplace in between are reads of tiles A. Similarly, inthe SGEMM part, one SPE applies a row of tilesA and a row of tiles B to tile C, followed by theSTRSM operation (triangular solve) on tile C usingthe diagonal, triangular tile T . Tile C only needs tobe read in at the beginning of this step and writtenback at the end. Tile T only needs to be read inright before the triangular solve. The only transferstaking place in between are reads of tiles A and B.Such work partitioning approximately halves the loadon the memory system.

It may also be noted that 1D processing is a nat-ural match for the left-looking factorization. In theright-looking factorization the update to the trailingsubmatrix can easily be partitioned in two dimen-sions. However, in case of the left-looking factoriza-tion, 2D partitioning would not be feasible due to thewrite dependency on the panel blocks (tiles T and C).

1D partitioning introduces a load balancing prob-lem. With work being distributed by block rows, ineach step of the factorization, a number of proces-sors is going to be idle, which is equal to the numberof block rows factorized in a particular step modulothe number of processors. Figure 2 shows three con-secutive steps on a factorization and the processorsbeing occupied and idle in these steps. Such behav-ior is going to put a harsh upper bound on achievableperformance.

It can be observed, however, that at each step ofthe factorization, a substantial amount of work canbe scheduled, to the otherwise idle processors, fromthe upcoming steps of the factorization. The only op-erations which cannot be scheduled at a given pointin time are these that involve panels that have notbeen factorized yet. This situation is illustrated onFigure 3. Of course, this kind of processing requitesdependency tracking in two dimensions, but since alloperations proceed at the granularity of tiles, this

1

2

3

4

2

3 2

1

1

5

6

7

8

4

5

6

7

3

4

5

6

8 7

8

Figure 2: Load imbalance caused by 1D processing.

does not pose a problem.

1

2

3

4

6

7 ...

5

8

Figure 3: Pipelining of factorization steps.

In the implementation presented here all SPEs fol-low a static schedule presented on Figure 3, withcyclic distribution of work from the steps of the fac-torization. In this case static schedule works well, dueto the fact that performance of the SPEs is very de-terministic (unaffected by any non-deterministic phe-nomena, like cache misses). This way the potentialbottleneck of a centralized scheduling mechanism isavoided.

Figure 4 presents the execution chart (Gantt chart)of factorization of a 1024×1024 matrix. Colors corre-spond to the ones in Figure 1. The two shades of graydistinguish the SGEMM operation in odd and evensteps of the factorization. The yellow color marksthe barrier operation, which corresponds to the loadimbalance of the algorithm.

It can be observed that: load imbalance is mini-mal (the yellow region), dependency stalls is minimal(the white gaps), communication and computation

8

Figure 4: Execution chart of Cholesky factorization of a matrix of size 1024×1024. Color scheme follows theone from Figure 1. Different shades of green correspond to odd and even steps of the factorization.

overlapping is very good (the colored blocks repre-sent purely computational blocks).

4.2.3 Synchronization

With the SPEs following a static schedule, synchro-nization is required, such that an SPE does not pro-ceed if data dependencies for an operation are notsatisfied.

Following dependencies have to be tracked: TheSSYRK and SGEMM operations cannot use as in-put tiles A and B that have not been factorized yet.The off-diagonal tiles A and B are factorized if theSTRSM operation has completed on these tiles. Inturn the STRSM operation cannot use as input adiagonal tile T that has not been factirized yet. Adiagonal tile T is factorized if the SPOTRF operationhas completed on this tile,

Dependency tracking is implemented by means ofa replicated progress table. The progress table is a2D triangular array with each entry representing atile of the coefficient matrix and specifying if the tilehas been factorized or not, which means completionof SPOTRF operation for diagonal tiles and STRSMoperation for off-diagonal tiles.

By replication of the progress table, the potentialbottleneck of a centralized progress tracking mecha-nism is avoided. Each SPE can check dependenciesby testing an entry in its local copy of the progresstable. The SPE completing factorization of a tile up-dates the progress tables of all other SPEs, whichis done by a DMA transfer and introduces no over-head due to the non-blocking nature of these trans-fers. Since the smallest amount of data subject toDMA transfer is one byte, the progress table entriesare one byte in size. These transfers consume in-

significant amount of bandwidth of the EIB and theirlatency is irrelevant to the algorithm.

4.2.4 Communication

The most important feature of communication isdouble-buffering of data. With eight tile buffers avail-able, each operation involved in the factorization canbe double-buffered independently.

Thanks to this fact, double-buffering is imple-mented not only between operations of the same type,but also between different operations. In other wordsdata is always prefetched for the upcoming operation,no matter what operation it is. In absence of depen-dency stalls, the SPEs never wait for data, what re-sults in big portions of the execution chart withoutany gaps between computational blocks (Figure 5).

Figure 5: Magnification of a portion of the executionchart from Figure4.

9

Tiles are never exchanged internally between localstores, but always read from main memory and writ-ten to main memory. An attempt to do otherwisecould tie up buffers in local store and prevent asyn-chronous operation of SPEs. At the same time, withthe work partitioning implemented here, the memorysystem provides enough bandwidth to fully overlapcommunication and computation.

Reads of tiles involve dependency checks. When itcomes to prefetching of a tile, a dependency is testedand a DMA transfer is initiated if the dependency issatisfied. The DMA completion is tested right be-fore processing of the tile. If the dependency is notsatisfied in the time for the prefetch, the prefetch isabandoned in order not to stall the execution. In-stead, right before processing of the tile, the SPEbusy-waits for the dependency and then transfers thetile in a blocking way (initiates the transfer and im-mediately waits for its completion).

4.2.5 Performance

Figure 6 shows performance of the single precisionCholesky factorization calculated as the ration of ex-ecution time to the number of floating point opera-tions calculated as N3/3, where N is the matrix sizeof the input matrix.

Table 3 gives numerical performance values for se-lected matrix sizes in Gflop/s and also as ratios rela-tive to the peak of the processor of 204.8 Gflop/s andthe peak of the SGEMM kernel of 8× 23.01 = 184.8Gflop/s.

Size Gflop/s % CELL Peak % SGEMM Peak512 92 45 50640 113 55 621024 151 74 821280 160 78 871536 165 80 891664 166 81 902176 170 83 924096 175 85 95

Table 3: Selected performance points of single preci-sion Cholesky factorization.

0 1000 2000 3000 40000

25

50

75

100

125

150

175

200

Size

Gflo

p/s

SP peak

SGEMM peak

DP peak

SPOTRF

Figure 6: Performance of single precision Choleskyfactorization.

The factorization achieves 95% of the peak ofthe SGEMM kernel, which means that overheads ofdata communication, synchronization and load im-balance are minimal, and at this point the only in-efficiency comes from the suboptimal performance ofthe SGEMM kernel. Hopefully in the future the ker-nel will be fully optimized, perhaps using hand-codedassembly.

4.3 Refinement

The two most expensive operations of the refinementare the back solve (Algorithm 1, steps 3 and 8) andresidual calculation (Algorithm 1, step 6).

The back solve consists of two triangular solvesinvolving the entire coefficient matrix and asingle right-hand-side (BLAS STRSV operation).The residual calculation is a double precisionmatrix-vector product using a symmetric matrix(BLAS DSYMV operation).

Both operations are BLAS Level 2 operations andon most processors would be strictly memory-bound.The STRSV operation actually is a perfect exampleof a strictly-memory bound operation on the CELLprocessor. However, the DSYMV operation is on the

10

border line of being memory bound versus computebound due to very high speed of the memory sys-tem versus the relatively low performance of doubleprecision arithmetic.

4.3.1 Triangular Solve

The triangular solve is a perfect example of amemory-bound workload, where the memory accessrate sets the upper limit on achievable performance.The STRSV performs approximately two floatingpoint operations per each data element of four bytes,which means that the peak memory bandwidth of25.6 Gflop/s allows for at most:

25.6 GB/s× 2ops/float/4bytes/float = 12.8 Gflop/s,

which is only 1/16 or 0.625 % of the single precisionfloating point peak of the processor. Owing to thisfact, the goal of implementing memory-bound opera-tions is to get close to the peak memory bandwidth,unlike for compute-bound operations, where the goalis to get close to the floating point peak, This taskshould be readily achievable, given that a single SPEpossesses as much bandwidth as the main memory.

Practice shows, however, that a single SPE is notcapable of generating enough traffic to fully exploitthe bandwidth, and a few SPEs solving the problemin parallel should be used. Efficient parallel imple-mentation of the STRSV operation has to pursue twogoals: continuous generation of traffic in order to sat-urate the memory system and aggressive pursuit ofthe algorithmic critical path in order to avoid depen-dency stalls. Related question is the one of desiredlevel of parallelism - optimal number of processing el-ements used. Since the triangular solve is rich in de-pendencies, increasing the number of SPEs increasesthe number of execution stalls caused by interpro-cessor dependencies. Obviously, there is a crossoverpoint, a sweet spot, for the number of SPEs used forthe operation.

Same as other routines in the code, the STRSVoperation processes the coefficient matrix by 64×64tiles. Triangular solve is performed on the diagonaltiles and a matrix-vector product (SGEMV equiva-lent) is performed on the off-diagonal tiles. Process-ing of the diagonal tiles constitutes the critical path in

the algorithm. One SPE is solely devoted to process-ing of the diagonal tiles, while the goal of the othersis to saturate the memory system with processing ofthe off-diagonal tiles. The number of SPEs used toprocess the off-diagonal tiles is a function of a few pa-rameters. The efficiency of the computational kernelsused is one of the factors. In this case the number fourturned out to deliver the best results, with one SPEpursuing the critical path and three others fulfillingthe task of memory saturation. Figure 7 presents thedistribution of work in the triangular solve routines.

11

2 2 2 2

3 3 3 3 00

00

00

00

00

00

00

00

00

1 1

2 2

3 3

1 1

11

22

33

11

22

33

11

x

A

Figure 7: Distribution of work in the triangular solveroutine.

The solve is done in place. The unknown/solutionvector is read in its entirety by each SPE to its lo-cal store at the beginning and returned to the mainmemory at the end. As the computation proceeds,updated pieces of the vector are exchanged internallyby means of direct, local store to local store, commu-nication. At the end SPE 0 possessed the full solutionvector and writes it back to the main memory. Syn-chronization is implemented analogously to the syn-chronization in the factorization and based on thereplicated triangular progress table (the same datastructure is reused).

Figure 8 shows performance, in terms of GB/s ofthe two triangular solve routines requited in the solu-tion/refinement step of the algorithm. The two rou-tines perform slightly differently due to different per-formance of their computational kernels. The figureshows clearly that the goal of saturating the memorysystem is achieved quite well. Performance as highas 23.77 GB/s is obtained, which is 93 % of the peak.

11

0 1000 2000 3000 40000

4

8

12

16

20

24

Size

GB

/s

Memory peak

Figure 8: Performance of the triangular solve rou-tines.

4.3.2 Matrix-Vector Product

For most hardware platforms the matrix-vector prod-uct would be a memory-bound operation, the sameas the triangular solve. On the CELL processor, how-ever, due to the relative slowness of the double pre-cision arithmetic, the operation is on the border ofbeing memory-bound and compute-bound, and evenwith use of all eight SPEs, saturation of the memoryis harder than in the case of the STRSV routine.

The DSYMV routine also operates on tiles. Here,however, the double precision representation of thecoefficient matrix is used with tile size of 32×32, suchthat an entire tile can also be transferred with a singleDMA request. The input vector is only read in once,at the beginning, in its entirety, and the output vectoris written back after the computation is completed.Since the input matrix is symmetric, only the lowerportion is accessed and implicit transposition is usedto reflect the effect of the upper portion - each tile isread in only once, but applied to the output vectortwice (with and without transposition).

Since, load balance is a very important aspect ofthe implementation, work is split among SPEs veryevenly by applying the partitioning reflected on Fig-ure 9. Such work distribution causes multiple write

dependencies on the output vector and in order to leteach SPE proceed without stalls, the output vector isreplicated on each SPE, and followed by a reductionstep. The reduction is also performed in parallel byall SPEs and introduces a very small overhead, com-pensated by the benefits of very good load balance.

00

11

22

3300 11 22 33

reduce

x

A

y

00

11

22

33

00

11

22

33

Figure 9: Distribution of work in the matrix-vectorproduct routine.

Figure 10 shows the performance, in terms of GB/sof the double precision symmetric matrix-vectorproduct routine. The performance of 20.93 GB/sis achieved, which is 82 % of the peak. AlthoughDSYMV routine represents a Level 2 BLAS operationand is parallelized among all SPEs, it is still computebound. Perhaps its computational components couldbe further optimized. Nevertheless, at this point thedelivered level of performance is considered satisfac-tory

5 Limitations

The implementation presented here should be con-sidered a proof of concept prototype with the pur-pose of establishing the upper bound on performanceachievable for mixed-precision dense linear algebraalgorithms on the CELL processor. As such, it hasa number of limitations. Only systems of sizes be-ing multiplicities of 64 are accepted and only sys-tems with a single right hand side are supported.There are no tests for overflow during conversionsfrom double to single precision. There is no test fornon-definiteness during the single precision factoriza-

12

0 1000 2000 3000 40000

4

8

12

16

20

24

Size

GB

/s

Memory peak

Figure 10: Performance of the matrix-vector productroutine.

tion step. The current implementation is wasteful inits use of the main memory. The entire coefficientmatrix is stored explicitly without taking advantageof its symmetry, both in single precision representa-tion and double precision representation. The maxi-mum size of the coefficient matrix is set to 4096, whatmakes it possible to fit the progress table in each lo-cal store and also makes it possible to fit the entireunknown/solution vector in each local store, whichfacilitates internal, local store to local store, commu-nication and is very beneficial for performance.

6 Results and Discussion

Figure 11 compares the performance of single preci-sion factorization (SPOTRF), solution of the systemin single precision (SPOSV) and solution of the sys-tem in double precision by using factorization in sin-gle precision and iterative refinement to double pre-cision (DSPOSV). These results were obtained on anIBM CELL blade using one of the two available CELLprocessors. Huge memory pages (16 MB) were usedfor improved performance. The performance is calcu-lated as the ratio of the execution time to the numberof floating point operations, set in all cases to N3/3.

0 1000 2000 3000 40000

25

50

75

100

125

150

175

200

Size

Gflo

p/s

SP peak

SGEMM peak

DP peak

DSPOSV

SPOSVSPOTRF

Figure 11: Performance of the mixed-precision algo-rithm versus the single precision algorithm on IBMCELL blade system.

In all cases well conditioned input matrices were used,resulting in two steps of refinement delivering accu-racy equal or higher than the one delivered by thepurely double precision algorithm.

At the maximum size of 4096 the factorizationachieves 175 Gflop/s and the system solution runsat the relative speed of 171 Gflop/s. At the sametime, the solution of the system in double precisionusing the refinement technique delivers the relativeperformance of 156 Gflop/s, which is an overhead ofless than 9 % compared to the solution of the systemin single precision. It can also be pointed out that byusing the mixed-precision algorithm double precisionresults are delivered at a speed over 10 times greaterthan the double precision peak of the processor.

Figure 12 shows results obtained on the SonyPlayStation 3, using the six available SPEs and256 KB1 available memory, also allocated using hugepages (16 MB). For the maximum problem size of2048 the performance of 127 Gflop/s was achievedfor the factorization, 122 Gflop/s for the single preci-sion solution and 104 Gflop/s for the double precisionsolution. In this case the double precision solution

1only approximately 200 KB available to the application

13

0 500 1000 1500 20000

20

40

60

80

100

120

140

160

Size

Gflo

p/s

SP peak

SGEMM peak

DP peak

DSPOSV

SPOSVSPOTRF

Figure 12: Performance of the mixed-precision algo-rithm versus the single precision algorithm on SonyPlayStation 3.

comes at the cost of roughly 15 % overhead comparedto the single precision solution.

7 Conclusions

The CELL Broadband Engine has a very high poten-tial for dense linear algebra workloads offering a veryhigh peak floating point performance and a capabil-ity to deliver performance close to the peak for quitesmall problems. The same applies to the memory sys-tem of the CELL processor, which allows to achievedata transfer rates very close to the peak bandwidthfor memory-intensive workloads.

Although the double precision performance of theCELL processor is much lower than the single pre-cision performance, mixed-precision algorithms allowto exploit the single precision speed while achievingfull double precision accuracy.

8 Future Plans

One of the main considerations for the future is appli-cation of the pipelined processing techniques to fac-

torizations where the panel does not easily split intoindependent operations, like the factorizations wherepivoting is used.

Another important question is the one of replac-ing the static scheduling of operations with dynamicscheduling by an automated system and also the im-pact of such mechanisms on programming ease andproductivity.

9 Acknowledgements

The authors thank Gary Rancourt and Kirk Jor-dan at IBM for taking care of our hardware needsand arranging for partial financial support for thiswork. The authors are thankful to numerous IBMresearchers for generously sharing their CELL exper-teese, in particular Sidney Manning, Daniel Broken-shire, Mike Kistler, Gordon Fossum, Thomas Chenand Michael Perrone.

References

[1] H. P. Hofstee. Power efficient processor archi-tecture and the Cell processor. In Proceedings ofthe 11th Int’l Symposium on High-PerformanceComputer Architecture, 2005.

[2] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R.Johns, T. R. Maeurer, and D. Shippy. Introduc-tion to the Cell multiprocessor. IBM J. Res. &Dev., 49(4/5):589–604, 2005.

[3] IBM. Cell Broadband Engine Architecture, Ver-sion 1.0, August 2005.

[4] J. H. Wilkinson. Rounding Errors in AlgebraicProcesses. Prentice-Hall, 1963.

[5] C. B. Moler. Iterative refinement in floatingpoint. J. ACM, 14(2):316–321, 1967.

[6] G. W. Stewart. Introduction to Matrix Compu-tations. Academic Press, 1973.

[7] N. J. Higham. Accuracy and Stability of Numer-ical Algorithms. SIAM, 1996.

14

[8] J. Langou, J. Langou, P. Luszczek, J. Kurzak,A. Buttari, and J. J. Dongarraa. Exploiting theperformance of 32 bit floating point arithmeticin obtaining 64 bit accuracy. In Proceedings ofthe 2006 ACM/IEEE Conference on Supercom-puting, 2006.

[9] IBM. Cell Broadband Engine ProgrammingHandbook, Version 1.0, April 2006.

[10] IBM. Cell Broadband Engine Programming Tu-torial, Version 2.0, December 2006.

[11] A. Buttari, P. Luszczek, J. Kurzak, J. J. Don-garra, and G. Bosilca. A rough guide to scien-tific computing on the PlayStation 3, version 1.0.Technical Report UT-CS-07-595, Computer Sci-ence Department, University of Tennessee, 2007.http://www.cs.utk.edu/ library/ TechReports/2007/ ut-cs-07-595.pdf.

[12] J. J. Dongarra, I. S. Duff, D. C. Sorensen, andH. A. van der Vorst. Numerical Linear Algebrafor High-Performance Computers. SIAM, 1998.

[13] J. Kurzak and J. J. Dongarraa. Implementinglinear algebra routines on multi-core processorswith pipelining and a look-ahead. In Proceed-ings of the 2006 Workshop on State-of-the-Art inScientific and Parallel Computing (PARA’O6),Umea, Sweden, 2006. Lecture Notes in Com-puter Science series, Springer, 2007 (to appear).

[14] J Kurzak and J. J. Dongarra. Implementationof mixed precision in solving systems of linearequations on the CELL processor. ConcurrencyComputat.: Pract. Exper, 2007. in press, DOI:10.1002/cpe.1164.

[15] T. Chen, R. Raghavan, J. Dale, and E. Iwata.Cell Broadband Engine architecture and itsfirst implementation, A performance view.http://www-128.ibm.com/ developerworks/power/ library/ pa-cellperf/, November 2005.

15

Documents

Solving Systems of Linear Equations on the CELL Processor ...graal.ens-lyon.fr/~abuttari/mypapers/tech_cholcell.pdf · ity. In linear algebra it enforces the departure from the model