アプリケーション性能を推定するのためのベンチ …...2018/03/02 · 160000 97.21 150.40 547.18 292.16 234.72 1343.69 HPCG 256^3 1.79 2.48 10.57 6.82 5.84 7.09

アプリケーション性能を推定するのためのベンチマークセットによる評価指標の構築

辻美和子理研AICS

HPCにおけるベンチマークの役割

システム開発のライフサイクルの多くの場面で，ベンチマークは重要な役割を果たす

- 設計，開発，実装，保守，などなど

- 性能推定，性能評価，性能確認，などなど

High Performance Linpack (HPL)

- おそらくはもっともよくつかわれるベンチマークの一つ

- Top500

- アプリケーション性能へのかい離が指摘されている

High Performance Conjugate Gradient (HPCG)

- HPLとアプリケーションとのギャップを埋めるべく提案された

- メモリバンド幅に依存する

- B/Fが小さいシステムがアプリケーションで結果を残してない，というわけでもない

アプリケーションやミニアプリをつかったシステム評価

ミニアプリケーションを通した性能評価はアプリケーション実行時のシステム性能をダイレクトに反映できる

多くのミニアプリが提案されている

- 例：Fiber by RIKEN AICS

Sustained System Performance (SSP) metric [Kramer, 2005]

- アプリケーションセットを用いた性能評価指標

- NERSC, NCSAなどが調達時に使用

フル/ミニ・アプリケーションを用いた評価は，ベンチマークやカーネルと違い，ポーティングに労力を要する

伝統的なベンチマーク，（ミニ）アプリケーションは一長一短

プロジェクトの目的

w1

w2

w3

w4

f y

伝統的なベンチマークの組み合わせにより，アプリケーション実行時のシステム性能を推定する性能評価指標を構築する

w1’

w2’

w3’

w4’

f’ y

mini-Applications Benchmarksreal-Applications

プロジェクトの目的

w1

w2

w3

w4

f y

伝統的なベンチマークの組み合わせにより，アプリケーション実行時のシステム性能を推定する性能評価指標を構築する

w1’

w2’

w3’

w4’

f’ y

mini-Applications Benchmarksreal-Applications

実アプリを実行することなく，実アプリ実行時のシステム性能についての知見を与える指標：- ベンチマークの性能で計算できる (easy to evaluate)

- 実アプリ実行時のシステム性能推定が可能- 単一のベンチマーク (HPLなど) よりもよい近似を与える

背景（参考） : Sustained System Performance (SSP) Metric

A metric to evaluate systems from the view point of applications' throughputs

used at NERSC, NCSA’s procurements

- Blue Waters

designed to provide the following guidelines during the system’s lifetime and the design of future systems

- Evaluation and/or selection of a system from among its competitors.

- Validating the selected system once a system is built and deployed.

- Assuring the system performance stays as expected throughout the system’s lifetime.

- Helping guide future system designs.

• Proportionality

• Reliability

• Consistency

• Independence

• Ease of use

• Avoid misleading conclusions

背景: SSP metric calculation

A set of applications are considered

For each application i , a set of data J

𝑖 ∈ 𝐼, 𝑗 ∈ 𝐽𝑖 per processor performance of application i executing data set j

𝑝𝑖,𝑗 =𝑓𝑖,𝑗

𝑚𝑖,𝑗×𝑡𝑖,𝑗

The average per processor performance for all applications is multiplied by the number of processors in a system

SSP = 𝑁 × σ𝐼𝐽𝑓𝑖,𝑗


The expected throughput of difference applications, multiple data sets, and various concurrencies

Reference operation counts

(expected) execution timeconcurrency

提案：Simplified SSP (SSSP) metric

SSPはアプリケーション実行時のシステム性能について，直接的な知見を与える

SSPを計算するためには多様なミニ・アプリケーションおよびデータセットを実行したときのシステム性能推定や評価を行う必要がある

- めんどう

アプリケーション・セットのかわりにベンチマーク・セットをつかうことを提案

- HPLなどの単純なベンチマークは最適化ノウハウが広く知られており，アプリケーションよりは評価が平易

- 複数のベンチマークの組み合わせおよび重みづけにより，SSP指標（アプリケーション実行時性能）を近似する式を構築したい

⇒近似式の構築には複数のシステムで per processor performance を測定する必要

SSSP metric

SSSP = 𝑁 × σ𝐼𝐽𝑓𝑖,𝑗


SSSP は SSPに無矛盾でなければならない：

if SSP(s) < SSP(s’) then SSSP(s) < SSSP(s’)

重みづけによる，さらなる近似

複数のシステムにおける実行結果から重みを決定

SSPではアプリケーションをつかっていたところをベンチマークに置き換える

min

𝑠∈𝑆

|SSP(𝑠) − SSSP(𝑠)| = min

𝑠∈𝑆

|SSP(𝑠) − 𝑁 ×𝐼𝐽𝑤𝑖,𝑗

𝑓𝑖,𝑗

𝑚𝑖,𝑗 × 𝑡𝑖,𝑗|

適切な重みづけをしたSSSPを用いればベンチマーク実行のみでアプリケーション実行時のシステム性能が推定可能

2017年度まで：6システムで比較的小規模な実験を行い，SSSPの妥当性を検証した

K FX10 FX100 HAPACS Blue Waters Oakforest PACS

CPU SPARC64TM VIIIfx SPARC64TM IXfx SPARC64TM XIfx Intel E5 2670 AMD 6276 InterlagosIntel Xeon Phi Nights

Landing

2GHz 1.65GHz 1.975GHz 2.6 GHz 2.3 GHz 1,4GHz

8 cores 16 cores 32+2 cores 8 cores x 2sockets 16 Bulldozercores x2 68 cores

Theoretical Peak 128 GFlops 211.2 GFlops 1011.2 GFlops 332.8 GFlops 313.6 GFlops 3046 GFlops

per K 1.00 1.65 7.90 2.60 2.25 23.8

Memory DDR3 SDRAM DDR3 SDRAM HMC DDR3 SDRAM DDR3 SDRAM MCDRAM+DDR4

16 GB, 64 GB/s 32 GB, 8 GB/s32 GB 240R+240W

GB/s 128 GB, 102.8 GB/s 64 GB 16GB +96GB

Cache 32 KB L1 inst/core 32 KB L1 inst/core 32 KB L1 inst/core 32 KB L1 inst/core 64 KB L1 inst/2core 32 KB L1 inst/core

32 KB L1 data/core 32 KB L1 data/core 32 KB L1 data/core 32 KB L1 data/core 16 KB L1 data/core 32 KB L1 data/core

6 MB L2/node 12MB L2/node 12MBx2 L2/node 256 KB L2 cache/core 2 MB L2/2core 2MB L2/2node

20MB L2 cache/node 8MB L3/4core

Network Tofu Interconnect Tofu Interconnect Tofu Interconnect 2 Fat-TreeCray Gemini torus

interconnect Intel Omni Path

5 GiB/s x 2 5 GiB/s x 2 12.5 GiB/s x 2 4 GiB/s x 2 9.6 GiB/s Injection 100 gbps

Mini-applications for our experiments

Fiber Miniapp Suite

- a suite of mini applications developed and maintained by RIKEN AICS

- extracted from the full applications discussed in the application working group of the feasibility study of future HPC infrastructures

- http://ber-miniapp.github.io/

Application Area Characteristics

CCS-QCD Quantum chromodynamics Structured grid Monte Carlo

FFVC Thermo-fluid analysis 3 dimensional cavity flow

NICAM-DC Climate Structured grid stencil

mVMC Material Science Many variable variational Monte Carlo

NGS-Analyzer Genome sequence analysis Multi task work flow

NTChem Quantum chemistry Molecular orbital method

FFB Thermo-fluid analyses Finite element method, unstructured

grid

include various scientific

fields

and computational

characteristics

Benchmark programs for our experiments

Benchmarks Comments

HPL Linear equations solver

Himeno BMPxp Linear solver of pressure Poisson using a point Jacobi method

FFTE Fast Fourier Transform

HPCG Conjugate Gradients

Stream Triad measures sustainable memory bandwidth

NPB BT-IO Nas Parallel Benchmark BT for parallel I/O

ベンチマークの選定について

- 著名なベンチマーク HPL, HPCG, Himeno

- 特定の性能評価に特化したベンチマーク BT-IO (IO), Stream (memory), FFTE (network)

Per Processor Performance for benchmarks

K FX10 FX100 HA BW OFP

HIMENO M 19.22 23.22 46.56 14.46 4.57 84.44

L 7.56 9.71 48.25 13.39 4.23 128.24

FFTE 1024^3 2.74 2.93 13.42 5.68 5.75 24.17

512^23 3.41 3.51 15.01 9.36 7.95 14.38

HPL 80000 96.52 149.46 544.72 296.67 235.46 1389.38

160000 97.21 150.40 547.18 292.16 234.72 1343.69

HPCG 256^3 1.79 2.48 10.57 6.82 5.84 7.09

512^3 1.52 2.01 10.87 6.90 5.76 6.69

Stream Triad 2^15 3.86 3.64 13.11 13.11 2.26 10.92

2^29 2.49 3.64 8.52 4.89 0.65 5.36

NAS BTIO C 9.03 13.20 18.49 28.32 13.86 31.96

D 5.19 7.96 23.17 16.56 19.72 31.73

GFlops

Per Processor Performance for applications

K FX10 FX100 HA BW OFP

CCS-QCD class1 18.43 24.69 23.15 29.95 17.83 25.44

class2 11.04 10.99 33.68 39.36 15.86 38.88

NICAM-DCgl05rl00z40pe1

0 4.90 5.18 15.15 4.75 5.10 31.69

gl05rl00z80pe5 5.87 7.11 18.95 3.53 3.97 774.87

FFVC 1024^3 12.72 17.41 34.81 67.29 52.82 227.91

256^3 13.96 21.52 39.73 35.19 11.24 226.24

NTChem h2o 10.91 12.16 57.29 76.80 43.08 64.30

taxol 61.18 67.28 199.42 199.35 142.39 530.64

FFB sample 5.72 6.19 21.15 4.89 10.71 5.82

mVMC job_middle 19.11 25.19 91.21 23.56 29.81 67.14

job_tiny 4.31 3.41 4.18 5.70 7.18 5.70

NGS Ananalyzer dummy 0.01 0.01 0.01 0.05 0.02 0.01

GFlops

各システムでの SSP/SSSP

0

300

600

900

1200

1500

K FX10 BW HA FX100 OFP

SSP SSSP HPL

0

50

100

150

200

250

300


SSP SSSP HPL

GFlops

enlarged

SSSP is consistent with SSP

SSSP makes better performance projection

of applications (SSP) than HPL Gap between SSP and SSSP

Performance of each benchmark and SSP/SSSP

0

200

400

600

800

1000

1200

1400

1600


SSP HPL Himeno FFTE

HPCG Stream BTIO SSSP

0

20

40

60

80

100

120

140

160


SSP HPL Himeno FFTE


各システムでの SSP と個別のベンチマーク

0

200

400

600

800

1000

1200

1400

1600


SSP HPL Himeno FFTE


0

20

40

60

80

100

120

140

160


SSP HPL Himeno FFTE


HPLはアプリ性能指標に無矛盾だが数値には大きな乖離があることが明

らかになった

多くのベンチマークはしばしばアプリ性能指標と矛盾する結果

を示した

FFTE

SSSP

HPLSSP

重みづけアルゴリズム（学習アルゴリズム）

A simple “learning algorithm” to find weights to minimize the difference between SSP and SSSP.

min

𝑠∈𝑆

|SSP(𝑠) − SSSP(𝑠)| = min

𝑠∈𝑆

|SSP(𝑠) − 𝑁 ×𝐼𝐽𝑤𝑖,𝑗

𝑓𝑖,𝑗

𝑚𝑖,𝑗 × 𝑡𝑖,𝑗|

for(itr=0; itr<n_itr; itr++){for(sys=0; sys<n_sys; sys++){

psssp[sys] = calc(per_processor_perfomance[sys][], weight[][]);err = pssp[sys] - psssp[sys];if(fabs(err)>maxerr){

for(i=0; i<n_bench; i++){for(j=0; j<n_data; j++){

weight[i][j] += err*delta*per_processor_perfomance[sys][i,j];if(weight[i]<=0.0) break all;

}}}}

以下を収束まで繰り返す

重みづけに関する初期実験

0

50

100

150

200

250

300


SSP SSSP Weigthed SSSP

GFlops

ベンチマークに重みづけすることでさらに良好な近似が可能になった

（参考）得られた重み

𝑤𝑖 =1

|𝐽|

𝑗𝑤𝑖,𝑗

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

HPL HIMENO HPCG Stream FFTE BTIO

Computation << memory, network, IO benchmarks

2018年度の計画

より大規模なシステムを用いて，SSSPの構築および検証を行う

- 問題規模・システム規模の変化による性能要求への変化を検証

- 例：2017年度はCCS-QCD クラス1, 2を対象としたが，2018年度はクラス3,4 (20x20x20プロセス)の規模を検討する

- OFPでは，30本程度のアプリケーション/ベンチマーク/データの組み合わせに対して，～数百ノードの実行

• 30 x 1000 nodes x 数 hours 90,000 ノード時間を想定

その他の使用予定のシステム

- 京@RIKEN

- Irene (Curieの後継機)@CEA

Documents

アプリケーション性能を推定するのためのベ ンチ …...2018/03/02 · 160000 97.21 150.40 547.18 292.16 234.72 1343.69 HPCG 256^3 1.79 2.48 10.57 6.82 5.84 7.09

アプリケーション性能を推定するのためのベンチ …...2018/03/02 · 160000 97.21 150.40 547.18 292.16 234.72 1343.69 HPCG 256^3 1.79 2.48 10.57 6.82 5.84 7.09