Multi-threaded Algorithm 2

Multi-threaded Algorithm 2

Michael Tsai2014/1/2

2

Scheduling Scheduler 的工作 : 把 strand 指定給

processor 執行 . On-line: scheduler 事先並不知道什麼時候

strand 會 spawn, 或者 spawn 出來的什麼時候會完成 . Centralized scheduler: 單一的 scheduler知道整體狀況並作 scheduling ( 比較容易分析 ) Distributed scheduler: 每個 thread 互相溝通合作 , 找出最好的 scheduling 方法

3

Greedy scheduler 每一個 time step, 把越多 strand 分給越多

processor 執行越好如果某個 time step 的時候 , 至少有 P 個 strand 是準備好被執行的 , 則稱為 complete step 反之則稱為 incomplete step Lower bounds: 最好的狀況也需要這些時間 :

work law: span law:

接下來一連串證明 , 會說 greedy scheduler 其實是蠻好的 scheduler.

4

Theorem: On an ideal parallel computer with P processors, a greedy scheduler executes a multithreaded computation with work and span in time .

Proof: 我們想證明 complete steps 最多為個 . 用反證法 : 假設 complete steps 超過個 . 則所有 complete steps 所完成的工作有 :

矛盾 , 因為全部的工作也才 .

因此 complete steps 最多為個 .

5

現在考慮 incomplete step 的個數 : 最長路徑一定從 in-degree=0 的 vertex 開始 greedy scheduler 的一個 incomplete step 一定把所有 G’ 裡面 in-

degree=0 的 vertex 都執行下去 (G’’ 裡面不會有 in-degree=0 的 ) G’’ 裡面的最長路徑一定比 G’ 中的最長路徑長度少 1 意思就是說每個 incomplete step 會使”表示還沒執行 strand 的

dag” 裡面的最長路徑少 1 因此最多所有 incomplete step 個數應為 span: .

最後 , 所以的 step 一定是 complete 或 incomplete. 因此

𝐺 𝐺′

𝐺′ ′

…… …

Incomplete step

6

Corollary: The running time of any multithreaded computation scheduled by a greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal.

Proof: 假設為最佳 scheduler 在 P 個 processor 的機器上所需執行的時間 . 和分別為 work & span .

7

Corollary: Let be the running time of a multithreaded computation produced by a greedy scheduler on an ideal parallel computer with P processors, and let and be the work and span of the computation, respectively. Then, if , we have , or equivalently, a speedup of approximately P.

Proof: If , then . So . (Work law: ) Or, the speedup is .

When is << ?

When the slackness (.

8

9

10

Back to P-FIB

Parallelism:

P-FIB(n)if n<=1return nelsex=spawn P-FIB(n-1)y=P-FIB(n-2)syncreturn x+y

即使對大平行電腦 , 一個普通的 n 都可以使我們的程式達到 near perfect linear speedup. (Parallelism 比P 大很多 slackness 很大 )

11

Parallel Loops Parallel loops: 把 loop 的 iterations 平行地執行 . 在 for 關鍵字前面加上 “ parallel”. 也可以用 spawn 和 sync, 不過這樣的語法比較方便 .

12

矩陣與向量相乘 A: 大小為 n x n 的矩陣 x: 大小為 n 向量需計算出 y=Ax. .

=

y A x

𝑦 𝑖i

jMAT-VEC(A,x)n=A.rowslet y be a new vector of length nparallel for i=1 to n

parallel for i=1 to nfor j=1 to n

return y

13

MAT-VEC-MAIN-LOOP(A,x,y,n,i,i’)if i==i’

for j=1 to n

elsemid=spawn MAT-VEC-MAIN-

LOOP(A,x,y,n,i,mid)MAT-VEC-MAIN-LOOP(A,x,y,n,mid+1,i’)sync

i: 從第幾個 row 開始算n: A 和 x 的大小i’: 算到第幾個 row

14

=

y A x

𝑦 𝑖i

j

1,4

5,8

15

Analyze MAT-VEC Work: 像是把 parallel 的部分拿掉計算一般的執行時間 .

( 主要是雙重迴圈的部分 ) 但是要注意 ,

spawn/parallel loop的部分會造成額外的 overhead!

MAT-VEC(A,x)n=A.rowslet y be a new vector of length nparallel for i=1 to n


return y

16

n-1 個 internal nodes

n 個 leaves

每個 internal node 都會 spawn一次 . (spawn overhead 為constant)

leave 比 internal node 多 , 因此 spawn 的 cost 可以平均分攤給 leave因此 spawn overhead 並不會使的 order 上升 . ( 但 constant 變大 )

17

Analyze MAT-VEC 現在我們要計算 span. 此時需要考慮 spawn overhead 了 . spawn 的次數 =log(n), 因為每次都分開成兩個 . 每次

spawn 的 overhead 為 constant. 代表第 i 個 iteration 的 span. 因此一般來說 , parallel loops 的 span 為 :

Parallelism=

MAT-VEC(A,x)n=A.rowslet y be a new vector of length nparallel for i=1 to n


return y

18

Race Conditions 一個 multithreaded algorithm 應該要是

deterministic 的每次執行的結果都一樣不管怎麼去把程式的不同部分排程給不同的

processor/core 去執行通常”失敗”的時候 , 都是因為有” determinacy

race”. Determinacy Race: 當有兩個邏輯上平行的指令存取相同的記憶體位置而且至少有一個指令是寫入 .

19

Determinacy Race ExamplesTherac-25: 放射線治療機組 . 1985-1987 年間造成至少 6 個病人受到原本設定劑量 100 倍的輻射 , 造成死亡或嚴重輻射灼傷 .

延伸閱讀 : 殺人的軟體 . http://hiraman-sharma.blogspot.com/2010/07/killer-softwares.html

20

Determinacy Race Examples

2003 Northeast Blackout: Affected 10M people in Ontario & 45M people in 8 U.S. states.Cause: A software bug known existed in General Electric Energy's Unix-based XA/21 energy management system

21

Race-ExampleRACE-EXAMPLE()x=0parallel for i=1 to 2

x=x+1print x

注意 : 大部分的執行順序都會得到正確的結果 , 只有少部分的順序才會造成錯誤 ! 要找出這些錯誤非常困難 !

22

如何避免 race condition? 有很多方法 (Mutex等 , OS 裡面會教 )簡單的方法 : 只將平行運算運用在獨立的運算上 , 也就是說互相之間沒有關聯性 . spawn 出來的 child跟 parent, 還有其他

spawn 出來的 child 都互相之間沒有關係 .

23

Socrates chess-playing program

𝑇 1=2048 𝑇 1′=1024

𝑃=32 𝑇 32=204832 +1=65 𝑇 32

′ =102432 +8=40

𝑃=512 𝑇 512=2048512 +1=5 𝑇 512

′ =1024512 +8=10

Original version

“Optimized version”

24

Multithreaded matrix multiplicationP-SQUARE-MATRIX-MULTIPLY(A,B)n=A.rowslet C be a new n x n matrixparallel for i=1 to n

parallel for j=1 to n

for k=1 to n

return C

𝑇 1(𝑛)=Θ (𝑛3 )

𝑇 ∞ (𝑛)=Θ ( log n )+Θ (log𝑛)+Θ (𝑛)=Θ(𝑛)

𝑇 1 (𝑛)𝑇 ∞ (𝑛 )

=Θ (𝑛3 )Θ (𝑛)

=Θ (𝑛2)

25

Divide-and-conquer Multithreaded Algorithm for Matrix Multiplication (勒勒長 )

C T

26

P-MATRIX-MULTIPLY-RECURSIVE(C,A,B)n=A.rowsif n==1

else let T be a new n x n matrixpartition A,B,C, and T into n/2 x n/2 submatrices (spawn P-MATRIX-MULTIPLY-RECURSIVE( spawn P-MATRIX-MULTIPLY-RECURSIVE(spawn P-MATRIX-MULTIPLY-RECURSIVE(spawn P-MATRIX-MULTIPLY-RECURSIVE(spawn P-MATRIX-MULTIPLY-RECURSIVE(spawn P-MATRIX-MULTIPLY-RECURSIVE(spawn P-MATRIX-MULTIPLY-RECURSIVE(P-MATRIX-MULTIPLY-RECURSIVE(syncparallel for i=1 to nparallel for j=1 to n

𝑀 1 (𝑛)=8𝑀 1(𝑛2 )+Θ (𝑛2 )=Θ (𝑛3 )

𝑀∞ (𝑛)=𝑀∞(𝑛2 )+Θ ( log𝑛 )+Θ ( log𝑛)=Θ ( log2𝑛)

𝑀 1 (𝑛)𝑀∞ (𝑛 )

=Θ (𝑛3 )Θ ( log2𝑛)

=Θ( 𝑛3

log2𝑛 )

27

How about Strassen’s method? Reading assignment: p.795-796.

Parallelism: , slightly less than the original recursive version!

Documents

Multi-threaded Algorithm 2