Upload
gloria-todd
View
30
Download
0
Embed Size (px)
DESCRIPTION
Toolkit ver1.0. Special Course on Computer Architecture This is based on “Cell Speed Challenge 2008” in IPSJ workshop SACSIS (Symposium on Advanced Computing Systems and Infrastructures). Special Course on Computer Architecture 2008. About this document. - PowerPoint PPT Presentation
Citation preview
Special Course on Computer Architecture 2008
Toolkit ver1.0
Special Course on Computer Architecture
This is based on “Cell Speed Challenge 2008”in IPSJ workshop SACSIS (Symposium on Advanced
Computing Systems and Infrastructures)
Special Course on Computer Architecture 2008
About this document
• Brief information about toolkit ver.1.0• Tools for solving simultaneous linear equations
with multiple SPEs• Please refer to implementation guide of your homework
• This document explains the algorithm to solve simultaneous linear equations
Summary of homework• Your task is to write a parallel program for solving
simultaneous linear equations.• You will compete performance of the program to
obtain a solution vector matrix x, from constant matrix A and right-hand vector b.
• A is a matrix of N×N elements(each element is float : 4 Byte)
• x and b are matrices of M×N element• Contact : [email protected]
Special Course on Computer Architecture 2008
b=Ax
Toolkit ver1.0• Solving simultaneous linear equations by
multiple SPEs– Number of SPE can be modified.
Default is 6 (maximum).
• Limitation – Sizes of matrices MUST BE a multiple of 32 (N=32n)
• Implement program to modify function spe_soleqs() in spe1.c• Other modification will be ignored in evaluation
• You can implement freely even the code inside spe_soleqs().
Special Course on Computer Architecture 2008
Initial data distribution 1/2• Matrices A,b,x are distributed as follows:Main memory
bufaddress
• The head address of working memory which is available for users• It must be aligned to128 Byte• Constant matrix A (NxNx4) is stored in the buf.
b = buf+N×N×sizeof(float)• The head address of the region which is stored right-hand vector b(MxNx4Byte). • Notice : elements are ordered column-direction(data are stored in (0,0),…,(0,N-1),(1,0),…,(1,N-1),…,(M-1, N-1))
x = buf+(N×N+M×N)×sizeof(float)• The head address of solution vector x(MxNx4Byte)• Ordering of data is same as b
A
bx
BlankRegion
SPE
Special Course on Computer Architecture 2008
Main memory
Brank region• Head address : buf+N×(N+2M)×sizeof(float)• allocated in PPE program you can use this region.• The size if same as total size of matrices N×(N+2M)×sizeof(float)
A
bx
SPE
Mapped address for transferringbetween SPE ls_addr[5]• Physical memory does not allocated• Each of them is 256KB ls_addr[0] ~ls_addr[4]• You can transfer data to the local store of each SPE accessing these regions directory.
• Memory allocation is suppressed less than 80 MB• Such that, total size of matrices A, b, and x is guaranteed less than half of allocated memory• N is the multiple of 32
Special Course on Computer Architecture 2008
Initial data distribution 1/2
BlankRegion
1,10,1
0,1
1,01,00,0
NNN
N
aa
a
aaa
Ordering elements in matrices
A
buf
1,10,1
0,1
1,01,00,0
MNN
M
bb
b
bbb
b
buf+N×N×sizeof(float)
• Notice! : Distribution of elements is not the same between matrix A and others
buf+4
buf+N×4
New
Special Course on Computer Architecture 2008
The algorithm adopted by toolkit1. LU decomposition
• pivoting
2. Forward substitution
3. Backward substitution
• pivoting is always done in spite of the form of matrices and size
Special Course on Computer Architecture 2008
Subroutines for DMA transfer (1/2)• Functions for DMA transfer
• dmaget, dmaput : Subroutines for DMA transfer in toolkit
• void dmaget_burst(unsigned int ppe_addr, unsigned int spe_addr, unsigned int row, unsigned int col,
unsigned int n)Read 128 Byte from the element of matrix (col, row) whose head address is ppe_addr in main memory (type of each element is float) and data into LocalStore whose head address is spe_addr (a certain element can be fetched by *(float*)(spe_addr+row%32*sizeof(float)) )
行列 (n×n) element (col,row)
address inalignment of 128Byte
ppe_addr
spe_addr
PPE(Main memory)SPE(LocalStore)
Update
Please pay a attention to identify the location of matrix
Special Course on Computer Architecture 2008
• float dmaget_value(unsigned int addr, unsigned int row, unsigned int col,
unsigned int n)Reads one element from n x n matrix, whose coordinates are given by (col, row). The matrix is stored in the main memory, and its beginning address is addr.
• void dmaput_value(unsigned int addr, unsigned int row, unsigned int col,
unsigned int n, float value) Writes value to the element of n x n matrix, whose coordinates are given by (col, row). The matrix is stored in the main memory, and its beginning address is addr. Note that the value is NOT synchronized among SPEs.
Matrix (n×n) element(col, row)
addr
PPE(main memory)
SPE
dmaget_value
dmaput_value
Update
Special Course on Computer Architecture 2008
Subroutines for DMA transfer (2/2)
Synchronization• Subroutines for synchronization• SPE0 is in charge of DMA synchronization among SPEs.• void sync(UINT32 id, // ID number of SPE
UINT32* ppe_ls, // array with addresses which “LocalStore” of
each SPEs are mapped in main memory. volatile struct spe_sync* sd, // array with local addresses in the SPE
UINT32 key) // a key used for synchronization
In function sync …• SPE0 writes a value key to variable start_flag of other SPEs, whose address is given by
sp. SPEs except SPE0 starts their calculation after start_flag=key becomes true.
• SPE1 ~ 5 writes a value key to SPE0’s variables (sd[id].end_flag) . SPE0 stops calculation of SPE1 ~ 5 after their end_flag=key becomes true.
• Users can set any value to the key, but be aware of the conflict with other sync functions.
Special Course on Computer Architecture 2008
LU decomposition• Following procedures are repeated for N times(i=0 ~ N-1)
1. Pivot selection (selection of a row with a largest element)
2. Row swapping
3. LU decompotions (right looking method)
n×n matrix i=0 N×N
i=1 (N-1)×(N-1)
i=2 (N-2)×(N-2)
i=3 (N-3)×(N-3)
i=N-1 1×1
Partial matrices
Special Course on Computer Architecture 2008
1. Pivot selection• Pivoting function : searches a row with maximum i-th
value• Parallel task with use of 6 SPEs
• An SPE reads i-th value of each row (use “dmaget_value” function)
• Reports the row number maxj with the maximum value to SPE0 (use “sync_collect” functions)
• SPE0 selects the row with the maximum value among all the SPEs.(n-i)×(n-i) partial matrix
SPE0
SPE1
SPE2
SPE3
SPE4
SPE5
Calculates a row if (row number)%6 is equal to the own ID
Reports the row number to the SPE0
Finds a row with maximum i-th value
Special Course on Computer Architecture 2008
Matrix (n×n)
2. Swapping of rows & columns• “swap_row” function
• Each SPE swaps rows indicated in the arguments• Swaps i-th row of matrix A and “maxj” row• 32 elements are swapped at once (dmaget, dmaput)
i-th row
“maxj” row
SPE0 SPE1 SPE2 SPE3 SPE4 SPE5
• “swap_col” function• Swaps i-th column of Matrix b and “maxj” column
Special Course on Computer Architecture 2008
3. LU decomposition with Right Looking Method
• lu_decomposition• Allots partial matrices to multiple SPEs, specified by units of rows.
• Same procedure as pivot selection
SPE0
SPE1
SPE2
SPE3
SPE4
SPE5
1. An element of (R1, R2) is stored to variable diag(dmaget_value)
2. Elements of row R1 is stored in buf2, beginning from the second element in row R1 (for SPE0 ~ 5)
3. Writes back value t1, the quotient of diag/Element of (i, row)
4. Elements of i-th row is stored in buf2, beginning from the second element of i-th row.
5. buf1 - buf2×t1 is calculated for each elements ,and written back to the i-th row.
6. Repeat procedures 2, 4, and 5 until it reaches the last row. (Use buf3 when needed )
buf1 buf2
buf3
row
row
Following procedures 1-3 must be repeated for N times to decompose matrix A:
Update
Special Course on Computer Architecture 2008
Forward and backward substitution• forward_substitution & backward_substitution functions• Refer to the source code for detail
• Each SPE calculates by a solution vector• When the number of solution vector is less than 6, some
SPE may not any work in these function• forward_substitution use “blank region” to store
intermediate data • Result of backward substitution is written to x in main
memory
Special Course on Computer Architecture 2008
References• Numerical Resipes in C
http://www.fizyka.umk.pl/nrbook/bookcpdf.html
• 2.3 LU Decomposition and Its Application
• Wikipedia “LU decomposition” http://en.wikipedia.org/wiki/LU_decomposition
• 奥村晴彦著「C言語による最新アルゴリズム辞典」技術評論社
• 小国力編著「行列計算ソフトウエアーWS、スーパーコン、並列計算機」丸善株式会
• 斉藤 宏樹,廣安 知之,三木 光範「 LU 分解の並列化について」 http://mikilab.doshisha.ac.jp/dia/research/report/2002/0612/018/report20020612018.html
Special Course on Computer Architecture 2008