Special Course on Computer Architecture 2008

Special Course on Computer Architecture 2008

Toolkit ver1.0

Special Course on Computer Architecture

This is based on “Cell Speed Challenge 2008”in IPSJ workshop SACSIS (Symposium on Advanced

Computing Systems and Infrastructures)


About this document

• Brief information about toolkit ver.1.0• Tools for solving simultaneous linear equations

with multiple SPEs• Please refer to implementation guide of your homework

• This document explains the algorithm to solve simultaneous linear equations

Summary of homework• Your task is to write a parallel program for solving

simultaneous linear equations.• You will compete performance of the program to

obtain a solution vector matrix x, from constant matrix A and right-hand vector b.

• A is a matrix of N×N elements(each element is float : 4 Byte)

• x and b are matrices of M×N element• Contact ： [email protected]


b=Ax

Toolkit ver1.0• Solving simultaneous linear equations by

multiple SPEs– Number of SPE can be modified.

Default is 6 (maximum).

• Limitation – Sizes of matrices MUST BE a multiple of 32 (N=32n)

• Implement program to modify function spe_soleqs() in spe1.c• Other modification will be ignored in evaluation

• You can implement freely even the code inside spe_soleqs().


Initial data distribution 1/2• Matrices A,b,x are distributed as follows:Main memory

bufaddress

• The head address of working memory which is available for users• It must be aligned to128 Byte• Constant matrix A (NxNx4) is stored in the buf.

b = buf+N×N×sizeof(float)• The head address of the region which is stored right-hand vector b(MxNx4Byte). • Notice : elements are ordered column-direction(data are stored in (0,0),…,(0,N-1),(1,0),…,(1,N-1),…,(M-1, N-1))

ｘ = buf+(N×N+M×N)×sizeof(float)• The head address of solution vector x(MxNx4Byte)• Ordering of data is same as b

A

bx

BlankRegion

SPE


Main memory

Brank region• Head address : buf+N×(N+2M)×sizeof(float)• allocated in PPE program you can use this region.• The size if same as total size of matrices N×(N+2M)×sizeof(float)

A

bx

SPE

Mapped address for transferringbetween SPE ls_addr[5]• Physical memory does not allocated• Each of them is 256KB ls_addr[0] ～ls_addr[4]• You can transfer data to the local store of each SPE accessing these regions directory.

• Memory allocation is suppressed less than 80 MB• Such that, total size of matrices A, b, and x is guaranteed less than half of allocated memory• N is the multiple of 32


Initial data distribution 1/2

BlankRegion

1,10,1

0,1

1,01,00,0

NNN

N

aa

a

aaa

Ordering elements in matrices

A

buf

1,10,1

0,1

1,01,00,0

MNN

M

bb

b

bbb

b

buf+N×N×sizeof(float)

• Notice! : Distribution of elements is not the same between matrix A and others

buf+4

buf+N×4

New


The algorithm adopted by toolkit1. LU decomposition

• pivoting

2. Forward substitution

3. Backward substitution

• pivoting is always done in spite of the form of matrices and size


Subroutines for DMA transfer (1/2)• Functions for DMA transfer

• dmaget, dmaput : Subroutines for DMA transfer in toolkit

• void dmaget_burst(unsigned int ppe_addr, unsigned int spe_addr, unsigned int row, unsigned int col,

unsigned int n)Read 128 Byte from the element of matrix (col, row) whose head address is ppe_addr in main memory (type of each element is float) and data into LocalStore whose head address is spe_addr (a certain element can be fetched by *(float*)(spe_addr+row%32*sizeof(float)) )

行列 (n×n) element (col,row)

address inalignment of 128Byte

ppe_addr

spe_addr

PPE(Main memory)SPE(LocalStore)

Update

Please pay a attention to identify the location of matrix


• float dmaget_value(unsigned int addr, unsigned int row, unsigned int col,

unsigned int n)Reads one element from n x n matrix, whose coordinates are given by (col, row). The matrix is stored in the main memory, and its beginning address is addr.

• void dmaput_value(unsigned int addr, unsigned int row, unsigned int col,

unsigned int n, float value) Writes value to the element of n x n matrix, whose coordinates are given by (col, row). The matrix is stored in the main memory, and its beginning address is addr. Note that the value is NOT synchronized among SPEs.

Matrix (n×n) element(col, row)

addr

PPE(main memory)

SPE

dmaget_value

dmaput_value

Update


Subroutines for DMA transfer (2/2)

Synchronization• Subroutines for synchronization• SPE0 is in charge of DMA synchronization among SPEs.• void sync(UINT32 id, // ID number of SPE

UINT32* ppe_ls, // array with addresses which “LocalStore” of

each SPEs are mapped in main memory. volatile struct spe_sync* sd, // array with local addresses in the SPE

UINT32 key) // a key used for synchronization

In function sync …• SPE0 writes a value key to variable start_flag of other SPEs, whose address is given by

sp. SPEs except SPE0 starts their calculation after start_flag=key becomes true.

• SPE1 ～ 5 writes a value key to SPE0’s variables (sd[id].end_flag) ． SPE0 stops calculation of SPE1 ～ 5 after their end_flag=key becomes true.

• Users can set any value to the key, but be aware of the conflict with other sync functions.


LU decomposition• Following procedures are repeated for N times(i=0 ～ N-1)

1. Pivot selection (selection of a row with a largest element)

2. Row swapping

3. LU decompotions (right looking method)

n×n matrix i=0 N×N

i=1 (N-1)×(N-1)

i=2 (N-2)×(N-2)

i=3 (N-3)×(N-3)

i=N-1 1×1

Partial matrices


1. Pivot selection• Pivoting function ： searches a row with maximum i-th

value• Parallel task with use of 6 SPEs

• An SPE reads i-th value of each row (use “dmaget_value” function)

• Reports the row number maxj with the maximum value to SPE0 (use “sync_collect” functions)

• SPE0 selects the row with the maximum value among all the SPEs.(n-i)×(n-i) partial matrix

SPE0

SPE1

SPE2

SPE3

SPE4

SPE5

Calculates a row if (row number)%6 is equal to the own ID

Reports the row number to the SPE0

Finds a row with maximum i-th value


Matrix (n×n)

2. Swapping of rows & columns• “swap_row” function

• Each SPE swaps rows indicated in the arguments• Swaps i-th row of matrix A and “maxj” row• 32 elements are swapped at once (dmaget, dmaput)

i-th row

“maxj” row

SPE0 SPE1 SPE2 SPE3 SPE4 SPE5

• “swap_col” function• Swaps i-th column of Matrix b and “maxj” column


3. LU decomposition with Right Looking Method

• lu_decomposition• Allots partial matrices to multiple SPEs, specified by units of rows.

• Same procedure as pivot selection

SPE0

SPE1

SPE2

SPE3

SPE4

SPE5

1. An element of (R1, R2) is stored to variable diag(dmaget_value)

2. Elements of row R1 is stored in buf2, beginning from the second element in row R1 (for SPE0 ～ 5)

3. Writes back value t1, the quotient of diag/Element of (i, row)

4. Elements of i-th row is stored in buf2, beginning from the second element of i-th row.

5. buf1 - buf2×t1 is calculated for each elements ，and written back to the i-th row.

6. Repeat procedures 2, 4, and 5 until it reaches the last row. (Use buf3 when needed )

buf1 buf2

buf3

row

row

Following procedures 1-3 must be repeated for N times to decompose matrix A:

Update


Forward and backward substitution• forward_substitution & backward_substitution functions• Refer to the source code for detail

• Each SPE calculates by a solution vector• When the number of solution vector is less than 6, some

SPE may not any work in these function• forward_substitution use “blank region” to store

intermediate data • Result of backward substitution is written to x in main

memory


References• Numerical Resipes in C

http://www.fizyka.umk.pl/nrbook/bookcpdf.html

• 2.3 LU Decomposition and Its Application

• Wikipedia “LU decomposition” http://en.wikipedia.org/wiki/LU_decomposition

• 奥村晴彦著「Ｃ言語による最新アルゴリズム辞典」技術評論社

• 小国力編著「行列計算ソフトウエアーＷＳ、スーパーコン、並列計算機」丸善株式会

• 斉藤宏樹，廣安知之，三木光範「 LU 分解の並列化について」 http://mikilab.doshisha.ac.jp/dia/research/report/2002/0612/018/report20020612018.html


Documents

Special Course on Computer Architecture 2008