22
STREAM 1 Quad Core Opteron: ccNUMA Arch. AMD Quad Core Opteron Memory Memory 2.3GHz Quad Coreのソケット×4 1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L1 L1 L1 L1 L2 L2 L2 L2 L3 ノー16コア各ソケットがローカルにメモリ Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core を持っている – NUMAN on-U niform Core Core Core Core L1 L1 L1 L1 Core Core Core Core L1 L1 L1 L1 M emory A ccess ローカルのメモリをアクセスして 計算するようなプログラミング L2 L2 L2 L2 L3 Memory L2 L2 L2 L2 L3 Memory 計算するようなプログラミングデータ配置,実行時制御 numactl)が必要 cccache-coherent キャッシュ同期:Thread並列

STREAM 1 Quad Core Opteron: ccNUMA Arch.STREAM 1 Quad Core Opteron: ccNUMA Arch. • AMD Quad Core Opteron Memory Memory2.3GHz – Quad Coreのソケット×4 ⇒1 ド( ) L1L1L1L1

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

STREAM 1

Quad Core Opteron: ccNUMA Arch.p• AMD Quad Core Opteron Memory MemoryMemoryMemoryMemoryMemory MemoryMemoryMemoryMemory

2.3GHz– Quad Coreのソケット×4 ⇒ 1

ド( )

L1 L1 L1 L1L2 L2 L2 L2

L3

L1 L1 L1 L1L2 L2 L2 L2

L3

L1 L1 L1 L1L2 L2 L2 L2

L3

L1 L1 L1 L1L2 L2 L2 L2

L3

L1 L1 L1 L1L2 L2 L2 L2

L3

L1 L1 L1 L1L2 L2 L2 L2

L3

L1 L1 L1 L1L2 L2 L2 L2

L3

L1 L1 L1 L1L2 L2 L2 L2

L3

L1 L1 L1 L1L2 L2 L2 L2

L3

L1 L1 L1 L1L2 L2 L2 L2

L3

ノード(16コア)

• 各ソケットがローカルにメモリ

Core Core Core Core Core Core Core CoreCore Core Core CoreCore Core Core CoreCore Core Core CoreCore Core Core Core Core Core Core CoreCore Core Core CoreCore Core Core CoreCore Core Core Core

を持っている

– NUMA:Non-Uniform Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Memory Access– ローカルのメモリをアクセスして

計算するようなプログラミング

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory計算するようなプログラミング,データ配置,実行時制御(numactl)が必要

y yyyy yyy

– cc: cache-coherent– キャッシュ同期:Thread並列

STREAM 2

ccNUMA ArchitectureccNUMA Architecture

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

Core L2 L2 L2 L2L3

MemoryCore

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

Core

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

Memory

L2 L2 L2 L2L3

MemoryソケットSocket

コアCore

ノードNode

STREAM 3

numactl>$ numactl –-show

policy: defaultp ypreferred node: currentphyscpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 コアcpubind: 0 1 2 3 ソケットnodebind: 0 1 2 3 ソケットmembind: 0 1 2 3 メモリ

S k t #0 S k t #1

• NUMA(Non Uniform Memory Access) 向けのメ L2 L2 L2 L2

L3

#0

L2 L2 L2 L2L3

#1

N t k

Socket #0 Socket #1

y )モリ割付のためのコマンドライン:Linuxでサポート

L1 L1 L1 L1

0 1 2 3

L1 L1 L1 L1

4 5 6 7

Network

• 使うコアとメモリの関係指定:ローカルなメモリ上の

L1 L1 L1 L1L2 L2 L2 L2

3

L1 L1 L1 L1L2 L2 L2 L2

3 N t k

8 9 10 11 12 13 14 15

定:ロ カルなメモリ上のデータを使うのが得策

L3

#2

L3

#3

Network

Socket #2 Socket #3

STREAM 4

numactlの影響• T2K,有限要素法アプリケーション,Strong Scaling,Flat MPI• 128コア:1コアあたりの問題サイズは512コアの4倍ア アあたりの問題サイ は アの 倍

– メモリへの負担大

• Policy=0:何も指定しない場合• Policy=0:何も指定しない場合

• 相対性能:大きいほど良いー128ノードで影響大1.50

ance

128cores512cores

#@$-r HID-org#@$-q h08nkl32#@$-N 24#@$ J T161.00

ve P

erfo

rma #@$-J T16

#@$-e err#@$-o x384-40-1-a.lst#@$-lM 27GB

0.50

Rel

ativ #@$

#@$-lE 03:00:00#@$-s /bin/sh#@$d $PBS O WORKDIR0.00

0 1 2 3 4 5

numactl policy

cd $PBS_O_WORKDIRmpirun ./numarun.sh ./solexit

STREAM 5

numarun.shの中身1.50

ance

128cores512cores

Policy:1#!/bin/bashMYRANK=$MXMPI_ID MPIのプロセス番号MYVAL=$(expr $MYRANK / 4)SOCKET=$(expr $MYVAL % 4)

1.00

ve P

erfo

rma SOCKET=$(expr $MYVAL % 4)

numactl --cpunodebind=$SOCKET --interleave=all $@

Policy:2#!/bin/bashMYRANK=$MXMPI ID

0.50

Rel

ativ

$ _MYVAL=$(expr $MYRANK / 4)SOCKET=$(expr $MYVAL % 4)numactl --cpunodebind=$SOCKET --interleave=$SOCKET $@

Policy:30.00

0 1 2 3 4 5

numactl policy

#!/bin/bashMYRANK=$MXMPI_IDMYVAL=$(expr $MYRANK / 4)SOCKET=$(expr $MYVAL % 4)numactl --cpunodebind=$SOCKET --membind=$SOCKET $@

Socket #0 Socket #1

Policy:4#!/bin/bashMYRANK=$MXMPI_IDMYVAL=$(expr $MYRANK / 4)SOCKET=$(expr $MYVAL % 4)

L1 L1 L1 L1L2 L2 L2 L2

L3

#0

0 1 2 3

L1 L1 L1 L1L2 L2 L2 L2

L3

#1

4 5 6 7

Network

SOCKET $(expr $MYVAL % 4)numactl --cpunodebind=$SOCKET --localalloc $@

Policy:5#!/bin/bashMYRANK=$MXMPI ID

0 1 2 3

L1 L1 L1 L1

4 5 6 7

L1 L1 L1 L1

8 9 10 11 12 13 14 15_

MYVAL=$(expr $MYRANK / 4)SOCKET=$(expr $MYVAL % 4) numactl --localalloc $@

L1 L1 L1 L1L2 L2 L2 L2

L3

#2

L1 L1 L1 L1L2 L2 L2 L2

L3

#3

Network

Socket #2 Socket #3

STREAM 6

Policy-3の割り当てyPolicy:3#!/bin/bashMYRANK=$MXMPI_IDMYVAL=$(expr $MYRANK / 4)

#0 #1Socket #0 Socket #1

SOCKET=$(expr $MYVAL % 4)numactl --cpunodebind=$SOCKET --membind=$SOCKET $@

MPI process Socket Memory Core

0 0 0

L1 L1 L1 L1L2 L2 L2 L2

L3

0 1 2 3

L1 L1 L1 L1L2 L2 L2 L2

L3

4 5 6 7

Network

0 0 0

1 0 0

2 0 0

3 0 08 9 10 11 12 13 14 15

3 0 0

4 1 1

5 1 1

6 1 1

L1 L1 L1 L1L2 L2 L2 L2

L3

#2

L1 L1 L1 L1L2 L2 L2 L2

L3

#3

Network

6

7 1 1

8 2 2

9 2 2

Socket #2 Socket #3

10 2 2

11 2 2

12 3 3

13 3 3

14 3 3

15 3 3

STREAM 7

Cyclicな割り付けも可能y#!/bin/bashMYRANK=$MXMPI_IDSOCKET=$(expr $MYRANK % 4)numactl --cpunodebind=$SOCKET --membind=$SOCKET $@

#0 #1Socket #0 Socket #1

MPI process Socket Memory Core

0 0 0

L1 L1 L1 L1L2 L2 L2 L2

L3

0 1 2 3

L1 L1 L1 L1L2 L2 L2 L2

L3

4 5 6 7

Network

0 0 0

1 1 1

2 2 2

3 3 38 9 10 11 12 13 14 15

3 3 3

4 0 0

5 1 1

6 2 2

L1 L1 L1 L1L2 L2 L2 L2

L3

#2

L1 L1 L1 L1L2 L2 L2 L2

L3

#3

Network

6

7 3 3

8 0 0

9 1 1

Socket #2 Socket #3

10 2 2

11 3 3

12 0 0

13 1 1

14 2 2

15 3 3

STREAM 8

コアを指定することもできる#!/bin/bashMYRANK=$MXMPI_IDMYVAL=$(expr $MYRANK / 4)SOCKET=$(expr $MYVAL % 4)

#0 #1Socket #0 Socket #1

CORE=$(expr $MYRANK % 16)numactl --physcpubind=$CORE --membind=$SOCKET $@

MPI process Socket Memory Core

0 0 0

L1 L1 L1 L1L2 L2 L2 L2

L3

0 1 2 3

L1 L1 L1 L1L2 L2 L2 L2

L3

4 5 6 7

Network

0 0 0

1 0 1

2 0 2

3 0 38 9 10 11 12 13 14 15

3 0 3

4 1 4

5 1 5

6 1 6

L1 L1 L1 L1L2 L2 L2 L2

L3

#2

L1 L1 L1 L1L2 L2 L2 L2

L3

#3

Network

6 6

7 1 7

8 2 8

9 2 9

Socket #2 Socket #3

10 2 10

11 2 11

12 3 12

13 3 13

14 3 14

15 3 15

STREAM 9

8コア使う場合はどうする ?MPI process Socket Memory Core

0 0 0L3

#0

L3

#1Socket #0 Socket #1

1 0 2

2 1 4

3 1 6

L1 L1 L1 L1L2 L2 L2 L2

0 1 2 3

L1 L1 L1 L1L2 L2 L2 L2

4 5 6 7

Network

4 2 8

5 2 10

6 3 12L1 L1 L1 L1L2 L2 L2 L2

L3

L1 L1 L1 L1L2 L2 L2 L2

L3 Network

8 9 10 11 12 13 14 15

MPI process Socket Memory Core MPI process Socket Memory Core

7 3 14L3

#2

L3

#3

Network

Socket #2 Socket #3

MPI process Socket Memory Core

0 0 0

1 0 1

2 0 2

MPI process Socket Memory Core

0 0 0

1 0 1

2 1 42 0 2

3 0 3

4 1 4

5 1 5

2 1 4

3 1 5

4 2 8

5 2 9

6 1 6

7 1 7

6 3 12

7 3 13

STREAM 10

STREAM ベンチマークhttp://www.streambench.org/

メモリバンド幅を測定するベンチ ク• メモリバンド幅を測定するベンチマーク

– Copy, Scale, Add, Triad(DAXPYと同じ)

----------------------------------------------Double precision appears to have 16 digits of accuracyA i 8 b t DOUBLE PRECISION dAssuming 8 bytes per DOUBLE PRECISION word

----------------------------------------------Number of processors = 16Array size = 2000000Offset = 0The total memory requirement is 732.4 MB ( 45.8MB/task)You are running each test 10 times--The *best* time for each test is used*EXCLUDING* the first and last iterationsEXCLUDING the first and last iterations--------------------------------------------------------------------------------------------------------

Function Rate (MB/s) Avg time Min time Max timeCopy: 18334.1898 0.0280 0.0279 0.0280Scale: 18035 1690 0 0284 0 0284 0 0285Scale: 18035.1690 0.0284 0.0284 0.0285Add: 18649.4455 0.0412 0.0412 0.0413Triad: 19603.8455 0.0394 0.0392 0.0398

STREAM 11

ファイル:今回はFORTRANのみ今回

>$ cd <$FVM>/stream>$ mpif90 –Oss –noparallel stream.f –o stream

今回はMPI版だが OpenMP版もある今回はMPI版だが,OpenMP版もある

STREAM 12

set-a0.sh:Policy-5相当

• プロセスが割り当てられたソケットのローカルメモリを使う

ソケ ト番号は基本的にサイクリ ク( d bi 的)に• ソケット番号は基本的にサイクリック(round-robin的)に

割り当てられるが,異なる場合もある(実行ごとに性能が変動)変動)

#@$-r stream#@$-q lecture #0 #1

Socket #0 Socket #1

#@$ q lecture#@$-N 1#@$-J T16#@$-e err

L1 L1 L1 L1L2 L2 L2 L2

L3

0 1 2 3

L1 L1 L1 L1L2 L2 L2 L2

L3

4 5 6 7

Network

#@$-o a0-16-1.lst#@$-lM 28GB#@$-lT 00:05:00#@$

0 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15#@$

cd $PBS_O_WORKDIRL1 L1 L1 L1L2 L2 L2 L2

L3

L1 L1 L1 L1L2 L2 L2 L2

L3 Network

8 9 10 11 12 13 14 15

mpirun numactl --localalloc ./streamexit

#2 #3

Socket #2 Socket #3

STREAM 13

set-b1/b2.sh:Policy-3,4相当

• 各プロセスをソケット0番から順番に埋まるように割り当て 各ソケットのローカルメモリを使うて,各ソケットのローカルメモリを使う

• b1とb2ではあまり差は無い相当set-b1.sh

#@$-r stream#@$ q lecture

b1.sh:Policy-3相当

#!/bin/bashMYRANK=$MXMPI_IDMYVAL $(expr $MYRANK / 4)#@$-q lecture

#@$-N 1#@$-J T8#@$-e err

MYVAL=$(expr $MYRANK / 4)SOCKET=$(expr $MYVAL % 4)numactl --cpunodebind=$SOCKET --membind=$SOCKET $@

b2 sh:Policy 4相当#@$-o b1-08-1.lst#@$-lM 28GB#@$-lT 00:05:00#@$

b2.sh:Policy-4相当

#!/bin/bashMYRANK=$MXMPI_IDMYVAL=$(expr $MYRANK / 4)#@$

cd $PBS_O_WORKDIR

pSOCKET=$(expr $MYVAL % 4)numactl --cpunodebind=$SOCKET --localalloc $@

mpirun ./b1.sh ./streamexit

STREAM 14

set-x.sh• 各プロセスが割り当てられたソケットのメモリを使う

T2 0 1 T4 0 1 2 3• T2:0,1,T4:0,1,2,3• T8:0,0,1,1,2,2,3,3• T16:0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3

#@$-r stream#@$-q lecture#@$-N 1#@$-J T16#@$-e err#@$-e err#@$-o x1-16-1.lst#@$-lM 28GB#@$-lT 00:05:00#@$

cd $PBS_O_WORKDIR

mpirun numactl --cpunodebind=0,1,2,3 --localalloc ./streamexit

STREAM 15

Triadの結果:Policy-5@T1の性能を1.00結果 y @• Policy-5: set-a0,Policy-4: set-b2• 0,1,2,3: set-x• 16コア使った場合,メモリの性能は5倍強にしかなっていない16 ア使った場合,メモリの性能は5倍強にしかなっていない

5

6

th

Policy-5Policy-4

4

5

y B

andw

idt y

0,1,2,3

2

3

ve M

emor

y

1Rel

ativ

01 2 4 8 16

Core #

STREAM 16

Triadの性能:T2• Policy-4

– 1コアのときの2倍までは5

6

dwid

th

Policy-4

0,1,2,31コアのときの2倍までは行かないが性能は向上

– 2コアで1つのメモリを共 2

3

4

Mem

ory

Ban

d

ア リを共

有するため,多少の競合が生じている

0

1

2

Rel

ativ

e

• 0,1,2,3 Memory MemoryMemoryMemoryMemoryMemory MemoryMemoryMemoryMemory

01 2 4 8 16

Core #

Memory MemoryMemoryMemoryMemoryMemory MemoryMemoryMemoryMemory

– ほぼ100%近い性能向上

– 各コアでローカルメモリをCore

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

各 アで カル リを占有

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory MemoryMemoryMemoryMemory MemoryMemoryMemory Memory MemoryMemoryMemoryMemory MemoryMemoryMemory

PolicyPolicy--44 0,1,2,30,1,2,3

STREAM 17

Triadの性能:T4• Policy-4

– T2のときとほとんど変わ5

6

dwid

th

Policy-4

0,1,2,3T2のときとほとんど変わらない,つまり1コア当た

りのメモリ性能は半分に2

3

4

Mem

ory

Ban

d

落ちている

– 飽和状態0

1

2

Rel

ativ

e

Memory MemoryMemoryMemoryMemoryMemory MemoryMemoryMemoryMemory• 0,1,2,3 Memory MemoryMemoryMemoryMemoryMemory MemoryMemoryMemoryMemory

01 2 4 8 16

Core #

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

– T1の4倍までは行かないが順調にスケール

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

– Cache-Coherency– 通信の影響もあり

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory MemoryMemoryMemoryMemory MemoryMemoryMemoryMemory MemoryMemoryMemoryMemory MemoryMemoryMemory

PolicyPolicy--44 0,1,2,30,1,2,3

STREAM 18

Triadの性能:T8,T165

6

dwid

th

Policy-4

0,1,2,3

• Policy-4– T8

2

3

4

Mem

ory

Ban

dT8• T4の約2倍弱

– T16

0

1

2

Rel

ativ

e

• T8の約2倍弱

Memory MemoryMemoryMemoryMemoryMemory MemoryMemoryMemoryMemoryMemory MemoryMemoryMemoryMemoryMemory MemoryMemoryMemoryMemory

01 2 4 8 16

Core #• 0,1,2,3– T8

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

• T4の約2倍弱

– T16

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

• T8とほぼ同じ

• Policy-4におけるT2→T4と同じ

Memory MemoryMemoryMemoryMemory MemoryMemoryMemoryMemory MemoryMemoryMemoryMemory MemoryMemoryMemory

PolicyPolicy--44 0,1,2,30,1,2,3

同じ

STREAM 19

T2Kの性能

• 実は余りノード内の性能について細かく測定したことは無い

– 1ノードが最低単位

• メモリの性能(の悪さ)に充分注意する必要がある

– Strong Scaling(同じ問題をコア数を変えて解く)の場合,基準を1ノーg gドあるいは1ソケット(Policy-2,3,4のようにする)にする

STREAM 20

疎行列ソルバーの性能:三次元弾性問題疎行列 性能 弾性問題ICCG法,T2K・SR11000 1ノード:メモリバンド幅が効く

Hit hi SR11000/J2 T2K/T kHitachi SR11000/J2Power 5+ 2.3GHz x 16147.2 GFLOPS/node

T2K/TokyoOpteron 2.3GHz x 16147.2 GFLOPS/node

100 GB/s for STREAM/TriadL3 cache: 18MB/core

20 GB/s for STREAM/TriadL3 cache: 0.5MB/core

20.0 20.0

15.0

tio (%

)

15.0

tio (%

)

Flat MPI.HB 4x4HB 8x2HB 16x1

10.0

man

ce R

at

Fl t MPI

10.0

man

ce R

a5.0

Perfo

rm Flat MPI.HB 4x4HB 8x2HB 16x1

5.0Pe

rform

0.01.E+04 1.E+05 1.E+06 1.E+07

DOF

0.01.E+04 1.E+05 1.E+06 1.E+07

DOF

STREAM 21

Hitachi SR11000/J2• IBM POWER5+ 2.3GHz (9.2GFLOPS)• Dual Core×4 ⇒ 「MCM(Multi Core Module)」×2 ⇒ 1node(16cores)Dual Core 4 MCM(Multi Core Module)」 2 1node(16cores)• based on NUMA architectures, but small latency of memory, huge cache

青木,中村,助川,齋藤,深川,中川,五百木(2005)スーパーテクニカルサーバーSR11000 モデルJ1のノードアーキテクチュアと性能評価情報処理学会論文誌:コンピュ ティングシステム 45 SIG12(ACS11) 27 36 より作成

C

Memory

L3 C C

Memory

L3 C C

Memory

L3 C C

Memory

L3 CC

Memory

L3 CC

Memory

L3 C C

Memory

L3 CC

Memory

L3 C C

Memory

L3 CC

Memory

L3 C C

Memory

L3 CC

Memory

L3 C

情報処理学会論文誌:コンピューティングシステム 45-SIG12(ACS11),27-36 より作成

Core

L1L2

L3 Core

L1

Core

L1L2

L3 Core

L1

Core

L1L2

L3 Core

L1

Core

L1L2

L3 Core

L1

Core

L1L2

L3 Core

L1

Core

L1L2

L3 Core

L1

Core

L1L2

L3 Core

L1

Core

L1L2

L3 Core

L1

Core

L1L2

L3 Core

L1

Core

L1L2

L3 Core

L1

Core

L1L2

L3 Core

L1

Core

L1L2

L3 Core

L1

C

L1L2

L3 CPU

L1

C

L1L2

L3 C

L1

C

L1L2

L3 C

L1

C

L1L2

L3 C

L1

C

L1L2

L3 CPU

L1

C

L1L2

L3 CPU

L1

C

L1L2

L3 CPU

L1

C

L1L2

L3 C

L1

C

L1L2

L3 C

L1

C

L1L2

L3 C

L1

C

L1L2

L3 C

L1

C

L1L2

L3 C

L1

C

L1L2

L3 C

L1

C

L1L2

L3 C

L1

C

L1L2

L3 C

L1

C

L1L2

L3 C

L1

Memory

CoreL3 CPU

Memory

CoreL3 Core

Memory

CoreL3 Core

Memory

CoreL3 Core

Memory

CoreL3 CPU

Memory

CoreL3 CPUCoreL3 CPU

Memory

CoreL3 Core

Memory

CoreL3 CoreCoreL3 Core

Memory

CoreL3 Core

Memory

CoreL3 CoreCoreL3 Core

Memory

CoreL3 Core

Memory

CoreL3 CoreCoreL3 Core

STREAM 22

台形積分:T2K(東大)における並列効果

• ◆:N=106,●:107,▲:10947 64

64.021.00E+02

ideal

• -:理想値

• 1コアにおける計測結果7 97

15.68

29.9247.64

1.00E+01

N=10^6N=10^7N=10^9

(sec.)からそれぞれ算出2.00

3.99

7.97

Spee

d-U

p

#@$-r test#@$-q lecture#@$-N 1#@$ J T4

1.001.00E+00

S

#@$-J T4#@$-e err#@$-o hello.lst#@$-lM 28GB

1.00E-011 10 100

$#@$-lT 00:05:00#@$

cd $PBS O WORKDIR

• 赤字部分の影響

1ノ ドより多いと変動あり

Core #

cd $PBS_O_WORKDIRmpirun numactl --localalloc ./a.out • 1ノードより多いと変動あり

• メモリに負担のかからない計算