16
IEEE CloudCom 2014参加報告 高野@産総研 担当パート Session: 2C: Virtualization I Session: 3C, 4B: HPC on Cloud 1 20150206 グリッド協議会第45回ワークショップ

IEEE CloudCom 2014参加報告 高野@産総研 担当パート

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

IEEE CloudCom 2014参加報告 高野産総研 担当パート

bull Session 2C Virtualization I bull Session 3C 4B HPC on Cloud

1 20150206 グリッド協議会第45回ワークショップ

bull アカデミア色が強い bull アジア系が多い bull 採択率の割に bull 分野の成熟

Rank CORE computer science conference rankings Publication Citation Microsoft academic search

所感

Rank Publication Citation accepted

IEEEACM CCGrid A 1454 10577 19

IEEE CLOUD B 234 445 18

IEEE CloudCom C 70 187 18

IEEE CloudNet - - - 28

IEEEACM UCC - - - 19

ACM SoCC - - - 24

CLOSER - - - 17

Gartner Hype Curve 2014

クラウドを冠した国際会議

(順番に意味はないのであしからず)

A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory

bull Transcendent memory (tmem) ndash サイズは誰にもわからず書込みは失敗するかもしれず読出し時にデータはすでに消えているかもしれないメモリ

ndash クリーンページのキャッシュ管理用の機構 bull cleancache frontswap bull zcache RAMster Xen shim

ndash 応用例VM環境のメモリオーバ プロビジョニング

bull NEXTmem (aka Ex-Tmem) ndash キャッシュ量を増やすために 不揮発メモリを利用

ndash クラウド環境はメモリ階層が 深化する傾向に有りその解析 モデルは重要な研究

3

参考 Persistent memory

bull ブロックデバイス ndash NVMe driver

bull ファイルシステム ndash ファイルキャッシュ層を削除し直接NVMにアクセス ndash PMFS DAX

bull OpenNVM (SanDisk) ndash API atomic write atomic trim ndash NVMKV NVMFS

bull SNIA NVM Programming Technical WG ndash httpwwwsniaorgforumssssinvmp

4

PM = Linux用語で不揮発メモリ

HPC on Cloud (8 papers) 1 ldquoReliability Guided Resource Allocation for Large-Scale Systemsrdquo

S Umamaheshwaran and T J Hacker (Purdue U) 2 ldquoEnergy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through

DVFSrdquo R N Calheiros and R Buyya (U Melbourne) 3 ldquoA Framework for Measuring the Impact and Effectiveness of the NEES Cyber-

infrastructure for Earthquake Engineeringrdquo T Hacker and A J Magana (Purdue U) 4 ldquoExecuting Bag of Distributed Tasks on the Cloud Investigating the Trade-Offs

between Performance and Costrdquo L Thai B Varghese and A Barker (U St Andrew) 5 ldquoCPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on

Real-Time CPU Resource Provisioning in Time-Shared Cloud Environmentsrdquo T Mastelic I Brandic and J Jasarevic (Vienna U of Technology)

6 ldquoPerformance Analysis of Cloud Environments on Top of Energy-Efficient Platforms Featuring Low Power Processorsrdquo V Plugaru S Varrette and P Bouvry (U Luxembourg)

7 ldquoExploring the Performance Impact of Virtualization on an HPC Cloudrdquo N Chakthranont P Khunphet R Takano and T Ikegami (KMUTNB AIST)

8 ldquoGateCloud An Integration of Gate Monte Carlo Simulation with a Cloud Computing Environmentrdquo B A Rowedder H Wang and Y Kuang (UNLV)

5

キーワード bull 目的

ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]

bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]

bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]

6

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in

Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい

ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関

7

ASGC Hardware Spec

8

Compute Node

CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU

Memory 128 GB DDR3-1866

InfiniBand Mellanox ConnectX-3 (FDR)

Ethernet Intel X520-DA2 (10 GbE)

Disk Intel SSD DC S3500 600 GB

bull 155 node-cluster consists of Cray H2312 blade server

bull The theoretical peak performance is 6944 TFLOPS

bull The operation started from July 2014

Exploring the Performance Impact of Virtualization on an HPC Cloud

ASGC Software Stack

Management Stack

ndash CentOS 65 (QEMUKVM 01212)

ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)

bull sgc-tools Virtual cluster construction utility

ndash RADOS cluster storage

HPC Stack (Virtual Cluster)

ndash Intel CompilerMath Kernel Library SP1 11106

ndash Open MPI 165

ndash Mellanox OFED 21

ndash Torque job scheduler

9

Exploring the Performance Impact of Virtualization on an HPC Cloud

Benchmark Programs

Micro benchmark

ndash Intel Micro Benchmark (IMB) version 324

Application-level benchmark

ndash HPC Challenge (HPCC) version 143 bull G-HPL

bull EP-STREAM

bull G-RandomAccess

bull G-FFT

ndash OpenMX version 374

ndash Graph 500 version 214

10

Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Point-to-point communication

11

585GBs 569GBs

The overhead is less than 3 with large message though it is up to 25 with small message

IMB Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 2: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

bull アカデミア色が強い bull アジア系が多い bull 採択率の割に bull 分野の成熟

Rank CORE computer science conference rankings Publication Citation Microsoft academic search

所感

Rank Publication Citation accepted

IEEEACM CCGrid A 1454 10577 19

IEEE CLOUD B 234 445 18

IEEE CloudCom C 70 187 18

IEEE CloudNet - - - 28

IEEEACM UCC - - - 19

ACM SoCC - - - 24

CLOSER - - - 17

Gartner Hype Curve 2014

クラウドを冠した国際会議

(順番に意味はないのであしからず)

A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory

bull Transcendent memory (tmem) ndash サイズは誰にもわからず書込みは失敗するかもしれず読出し時にデータはすでに消えているかもしれないメモリ

ndash クリーンページのキャッシュ管理用の機構 bull cleancache frontswap bull zcache RAMster Xen shim

ndash 応用例VM環境のメモリオーバ プロビジョニング

bull NEXTmem (aka Ex-Tmem) ndash キャッシュ量を増やすために 不揮発メモリを利用

ndash クラウド環境はメモリ階層が 深化する傾向に有りその解析 モデルは重要な研究

3

参考 Persistent memory

bull ブロックデバイス ndash NVMe driver

bull ファイルシステム ndash ファイルキャッシュ層を削除し直接NVMにアクセス ndash PMFS DAX

bull OpenNVM (SanDisk) ndash API atomic write atomic trim ndash NVMKV NVMFS

bull SNIA NVM Programming Technical WG ndash httpwwwsniaorgforumssssinvmp

4

PM = Linux用語で不揮発メモリ

HPC on Cloud (8 papers) 1 ldquoReliability Guided Resource Allocation for Large-Scale Systemsrdquo

S Umamaheshwaran and T J Hacker (Purdue U) 2 ldquoEnergy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through

DVFSrdquo R N Calheiros and R Buyya (U Melbourne) 3 ldquoA Framework for Measuring the Impact and Effectiveness of the NEES Cyber-

infrastructure for Earthquake Engineeringrdquo T Hacker and A J Magana (Purdue U) 4 ldquoExecuting Bag of Distributed Tasks on the Cloud Investigating the Trade-Offs

between Performance and Costrdquo L Thai B Varghese and A Barker (U St Andrew) 5 ldquoCPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on

Real-Time CPU Resource Provisioning in Time-Shared Cloud Environmentsrdquo T Mastelic I Brandic and J Jasarevic (Vienna U of Technology)

6 ldquoPerformance Analysis of Cloud Environments on Top of Energy-Efficient Platforms Featuring Low Power Processorsrdquo V Plugaru S Varrette and P Bouvry (U Luxembourg)

7 ldquoExploring the Performance Impact of Virtualization on an HPC Cloudrdquo N Chakthranont P Khunphet R Takano and T Ikegami (KMUTNB AIST)

8 ldquoGateCloud An Integration of Gate Monte Carlo Simulation with a Cloud Computing Environmentrdquo B A Rowedder H Wang and Y Kuang (UNLV)

5

キーワード bull 目的

ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]

bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]

bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]

6

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in

Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい

ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関

7

ASGC Hardware Spec

8

Compute Node

CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU

Memory 128 GB DDR3-1866

InfiniBand Mellanox ConnectX-3 (FDR)

Ethernet Intel X520-DA2 (10 GbE)

Disk Intel SSD DC S3500 600 GB

bull 155 node-cluster consists of Cray H2312 blade server

bull The theoretical peak performance is 6944 TFLOPS

bull The operation started from July 2014

Exploring the Performance Impact of Virtualization on an HPC Cloud

ASGC Software Stack

Management Stack

ndash CentOS 65 (QEMUKVM 01212)

ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)

bull sgc-tools Virtual cluster construction utility

ndash RADOS cluster storage

HPC Stack (Virtual Cluster)

ndash Intel CompilerMath Kernel Library SP1 11106

ndash Open MPI 165

ndash Mellanox OFED 21

ndash Torque job scheduler

9

Exploring the Performance Impact of Virtualization on an HPC Cloud

Benchmark Programs

Micro benchmark

ndash Intel Micro Benchmark (IMB) version 324

Application-level benchmark

ndash HPC Challenge (HPCC) version 143 bull G-HPL

bull EP-STREAM

bull G-RandomAccess

bull G-FFT

ndash OpenMX version 374

ndash Graph 500 version 214

10

Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Point-to-point communication

11

585GBs 569GBs

The overhead is less than 3 with large message though it is up to 25 with small message

IMB Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 3: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory

bull Transcendent memory (tmem) ndash サイズは誰にもわからず書込みは失敗するかもしれず読出し時にデータはすでに消えているかもしれないメモリ

ndash クリーンページのキャッシュ管理用の機構 bull cleancache frontswap bull zcache RAMster Xen shim

ndash 応用例VM環境のメモリオーバ プロビジョニング

bull NEXTmem (aka Ex-Tmem) ndash キャッシュ量を増やすために 不揮発メモリを利用

ndash クラウド環境はメモリ階層が 深化する傾向に有りその解析 モデルは重要な研究

3

参考 Persistent memory

bull ブロックデバイス ndash NVMe driver

bull ファイルシステム ndash ファイルキャッシュ層を削除し直接NVMにアクセス ndash PMFS DAX

bull OpenNVM (SanDisk) ndash API atomic write atomic trim ndash NVMKV NVMFS

bull SNIA NVM Programming Technical WG ndash httpwwwsniaorgforumssssinvmp

4

PM = Linux用語で不揮発メモリ

HPC on Cloud (8 papers) 1 ldquoReliability Guided Resource Allocation for Large-Scale Systemsrdquo

S Umamaheshwaran and T J Hacker (Purdue U) 2 ldquoEnergy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through

DVFSrdquo R N Calheiros and R Buyya (U Melbourne) 3 ldquoA Framework for Measuring the Impact and Effectiveness of the NEES Cyber-

infrastructure for Earthquake Engineeringrdquo T Hacker and A J Magana (Purdue U) 4 ldquoExecuting Bag of Distributed Tasks on the Cloud Investigating the Trade-Offs

between Performance and Costrdquo L Thai B Varghese and A Barker (U St Andrew) 5 ldquoCPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on

Real-Time CPU Resource Provisioning in Time-Shared Cloud Environmentsrdquo T Mastelic I Brandic and J Jasarevic (Vienna U of Technology)

6 ldquoPerformance Analysis of Cloud Environments on Top of Energy-Efficient Platforms Featuring Low Power Processorsrdquo V Plugaru S Varrette and P Bouvry (U Luxembourg)

7 ldquoExploring the Performance Impact of Virtualization on an HPC Cloudrdquo N Chakthranont P Khunphet R Takano and T Ikegami (KMUTNB AIST)

8 ldquoGateCloud An Integration of Gate Monte Carlo Simulation with a Cloud Computing Environmentrdquo B A Rowedder H Wang and Y Kuang (UNLV)

5

キーワード bull 目的

ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]

bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]

bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]

6

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in

Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい

ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関

7

ASGC Hardware Spec

8

Compute Node

CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU

Memory 128 GB DDR3-1866

InfiniBand Mellanox ConnectX-3 (FDR)

Ethernet Intel X520-DA2 (10 GbE)

Disk Intel SSD DC S3500 600 GB

bull 155 node-cluster consists of Cray H2312 blade server

bull The theoretical peak performance is 6944 TFLOPS

bull The operation started from July 2014

Exploring the Performance Impact of Virtualization on an HPC Cloud

ASGC Software Stack

Management Stack

ndash CentOS 65 (QEMUKVM 01212)

ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)

bull sgc-tools Virtual cluster construction utility

ndash RADOS cluster storage

HPC Stack (Virtual Cluster)

ndash Intel CompilerMath Kernel Library SP1 11106

ndash Open MPI 165

ndash Mellanox OFED 21

ndash Torque job scheduler

9

Exploring the Performance Impact of Virtualization on an HPC Cloud

Benchmark Programs

Micro benchmark

ndash Intel Micro Benchmark (IMB) version 324

Application-level benchmark

ndash HPC Challenge (HPCC) version 143 bull G-HPL

bull EP-STREAM

bull G-RandomAccess

bull G-FFT

ndash OpenMX version 374

ndash Graph 500 version 214

10

Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Point-to-point communication

11

585GBs 569GBs

The overhead is less than 3 with large message though it is up to 25 with small message

IMB Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 4: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

参考 Persistent memory

bull ブロックデバイス ndash NVMe driver

bull ファイルシステム ndash ファイルキャッシュ層を削除し直接NVMにアクセス ndash PMFS DAX

bull OpenNVM (SanDisk) ndash API atomic write atomic trim ndash NVMKV NVMFS

bull SNIA NVM Programming Technical WG ndash httpwwwsniaorgforumssssinvmp

4

PM = Linux用語で不揮発メモリ

HPC on Cloud (8 papers) 1 ldquoReliability Guided Resource Allocation for Large-Scale Systemsrdquo

S Umamaheshwaran and T J Hacker (Purdue U) 2 ldquoEnergy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through

DVFSrdquo R N Calheiros and R Buyya (U Melbourne) 3 ldquoA Framework for Measuring the Impact and Effectiveness of the NEES Cyber-

infrastructure for Earthquake Engineeringrdquo T Hacker and A J Magana (Purdue U) 4 ldquoExecuting Bag of Distributed Tasks on the Cloud Investigating the Trade-Offs

between Performance and Costrdquo L Thai B Varghese and A Barker (U St Andrew) 5 ldquoCPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on

Real-Time CPU Resource Provisioning in Time-Shared Cloud Environmentsrdquo T Mastelic I Brandic and J Jasarevic (Vienna U of Technology)

6 ldquoPerformance Analysis of Cloud Environments on Top of Energy-Efficient Platforms Featuring Low Power Processorsrdquo V Plugaru S Varrette and P Bouvry (U Luxembourg)

7 ldquoExploring the Performance Impact of Virtualization on an HPC Cloudrdquo N Chakthranont P Khunphet R Takano and T Ikegami (KMUTNB AIST)

8 ldquoGateCloud An Integration of Gate Monte Carlo Simulation with a Cloud Computing Environmentrdquo B A Rowedder H Wang and Y Kuang (UNLV)

5

キーワード bull 目的

ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]

bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]

bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]

6

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in

Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい

ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関

7

ASGC Hardware Spec

8

Compute Node

CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU

Memory 128 GB DDR3-1866

InfiniBand Mellanox ConnectX-3 (FDR)

Ethernet Intel X520-DA2 (10 GbE)

Disk Intel SSD DC S3500 600 GB

bull 155 node-cluster consists of Cray H2312 blade server

bull The theoretical peak performance is 6944 TFLOPS

bull The operation started from July 2014

Exploring the Performance Impact of Virtualization on an HPC Cloud

ASGC Software Stack

Management Stack

ndash CentOS 65 (QEMUKVM 01212)

ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)

bull sgc-tools Virtual cluster construction utility

ndash RADOS cluster storage

HPC Stack (Virtual Cluster)

ndash Intel CompilerMath Kernel Library SP1 11106

ndash Open MPI 165

ndash Mellanox OFED 21

ndash Torque job scheduler

9

Exploring the Performance Impact of Virtualization on an HPC Cloud

Benchmark Programs

Micro benchmark

ndash Intel Micro Benchmark (IMB) version 324

Application-level benchmark

ndash HPC Challenge (HPCC) version 143 bull G-HPL

bull EP-STREAM

bull G-RandomAccess

bull G-FFT

ndash OpenMX version 374

ndash Graph 500 version 214

10

Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Point-to-point communication

11

585GBs 569GBs

The overhead is less than 3 with large message though it is up to 25 with small message

IMB Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 5: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

HPC on Cloud (8 papers) 1 ldquoReliability Guided Resource Allocation for Large-Scale Systemsrdquo

S Umamaheshwaran and T J Hacker (Purdue U) 2 ldquoEnergy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through

DVFSrdquo R N Calheiros and R Buyya (U Melbourne) 3 ldquoA Framework for Measuring the Impact and Effectiveness of the NEES Cyber-

infrastructure for Earthquake Engineeringrdquo T Hacker and A J Magana (Purdue U) 4 ldquoExecuting Bag of Distributed Tasks on the Cloud Investigating the Trade-Offs

between Performance and Costrdquo L Thai B Varghese and A Barker (U St Andrew) 5 ldquoCPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on

Real-Time CPU Resource Provisioning in Time-Shared Cloud Environmentsrdquo T Mastelic I Brandic and J Jasarevic (Vienna U of Technology)

6 ldquoPerformance Analysis of Cloud Environments on Top of Energy-Efficient Platforms Featuring Low Power Processorsrdquo V Plugaru S Varrette and P Bouvry (U Luxembourg)

7 ldquoExploring the Performance Impact of Virtualization on an HPC Cloudrdquo N Chakthranont P Khunphet R Takano and T Ikegami (KMUTNB AIST)

8 ldquoGateCloud An Integration of Gate Monte Carlo Simulation with a Cloud Computing Environmentrdquo B A Rowedder H Wang and Y Kuang (UNLV)

5

キーワード bull 目的

ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]

bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]

bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]

6

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in

Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい

ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関

7

ASGC Hardware Spec

8

Compute Node

CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU

Memory 128 GB DDR3-1866

InfiniBand Mellanox ConnectX-3 (FDR)

Ethernet Intel X520-DA2 (10 GbE)

Disk Intel SSD DC S3500 600 GB

bull 155 node-cluster consists of Cray H2312 blade server

bull The theoretical peak performance is 6944 TFLOPS

bull The operation started from July 2014

Exploring the Performance Impact of Virtualization on an HPC Cloud

ASGC Software Stack

Management Stack

ndash CentOS 65 (QEMUKVM 01212)

ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)

bull sgc-tools Virtual cluster construction utility

ndash RADOS cluster storage

HPC Stack (Virtual Cluster)

ndash Intel CompilerMath Kernel Library SP1 11106

ndash Open MPI 165

ndash Mellanox OFED 21

ndash Torque job scheduler

9

Exploring the Performance Impact of Virtualization on an HPC Cloud

Benchmark Programs

Micro benchmark

ndash Intel Micro Benchmark (IMB) version 324

Application-level benchmark

ndash HPC Challenge (HPCC) version 143 bull G-HPL

bull EP-STREAM

bull G-RandomAccess

bull G-FFT

ndash OpenMX version 374

ndash Graph 500 version 214

10

Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Point-to-point communication

11

585GBs 569GBs

The overhead is less than 3 with large message though it is up to 25 with small message

IMB Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 6: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

キーワード bull 目的

ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]

bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]

bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]

6

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in

Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい

ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関

7

ASGC Hardware Spec

8

Compute Node

CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU

Memory 128 GB DDR3-1866

InfiniBand Mellanox ConnectX-3 (FDR)

Ethernet Intel X520-DA2 (10 GbE)

Disk Intel SSD DC S3500 600 GB

bull 155 node-cluster consists of Cray H2312 blade server

bull The theoretical peak performance is 6944 TFLOPS

bull The operation started from July 2014

Exploring the Performance Impact of Virtualization on an HPC Cloud

ASGC Software Stack

Management Stack

ndash CentOS 65 (QEMUKVM 01212)

ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)

bull sgc-tools Virtual cluster construction utility

ndash RADOS cluster storage

HPC Stack (Virtual Cluster)

ndash Intel CompilerMath Kernel Library SP1 11106

ndash Open MPI 165

ndash Mellanox OFED 21

ndash Torque job scheduler

9

Exploring the Performance Impact of Virtualization on an HPC Cloud

Benchmark Programs

Micro benchmark

ndash Intel Micro Benchmark (IMB) version 324

Application-level benchmark

ndash HPC Challenge (HPCC) version 143 bull G-HPL

bull EP-STREAM

bull G-RandomAccess

bull G-FFT

ndash OpenMX version 374

ndash Graph 500 version 214

10

Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Point-to-point communication

11

585GBs 569GBs

The overhead is less than 3 with large message though it is up to 25 with small message

IMB Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 7: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126

11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in

Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい

ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関

7

ASGC Hardware Spec

8

Compute Node

CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU

Memory 128 GB DDR3-1866

InfiniBand Mellanox ConnectX-3 (FDR)

Ethernet Intel X520-DA2 (10 GbE)

Disk Intel SSD DC S3500 600 GB

bull 155 node-cluster consists of Cray H2312 blade server

bull The theoretical peak performance is 6944 TFLOPS

bull The operation started from July 2014

Exploring the Performance Impact of Virtualization on an HPC Cloud

ASGC Software Stack

Management Stack

ndash CentOS 65 (QEMUKVM 01212)

ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)

bull sgc-tools Virtual cluster construction utility

ndash RADOS cluster storage

HPC Stack (Virtual Cluster)

ndash Intel CompilerMath Kernel Library SP1 11106

ndash Open MPI 165

ndash Mellanox OFED 21

ndash Torque job scheduler

9

Exploring the Performance Impact of Virtualization on an HPC Cloud

Benchmark Programs

Micro benchmark

ndash Intel Micro Benchmark (IMB) version 324

Application-level benchmark

ndash HPC Challenge (HPCC) version 143 bull G-HPL

bull EP-STREAM

bull G-RandomAccess

bull G-FFT

ndash OpenMX version 374

ndash Graph 500 version 214

10

Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Point-to-point communication

11

585GBs 569GBs

The overhead is less than 3 with large message though it is up to 25 with small message

IMB Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 8: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

ASGC Hardware Spec

8

Compute Node

CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU

Memory 128 GB DDR3-1866

InfiniBand Mellanox ConnectX-3 (FDR)

Ethernet Intel X520-DA2 (10 GbE)

Disk Intel SSD DC S3500 600 GB

bull 155 node-cluster consists of Cray H2312 blade server

bull The theoretical peak performance is 6944 TFLOPS

bull The operation started from July 2014

Exploring the Performance Impact of Virtualization on an HPC Cloud

ASGC Software Stack

Management Stack

ndash CentOS 65 (QEMUKVM 01212)

ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)

bull sgc-tools Virtual cluster construction utility

ndash RADOS cluster storage

HPC Stack (Virtual Cluster)

ndash Intel CompilerMath Kernel Library SP1 11106

ndash Open MPI 165

ndash Mellanox OFED 21

ndash Torque job scheduler

9

Exploring the Performance Impact of Virtualization on an HPC Cloud

Benchmark Programs

Micro benchmark

ndash Intel Micro Benchmark (IMB) version 324

Application-level benchmark

ndash HPC Challenge (HPCC) version 143 bull G-HPL

bull EP-STREAM

bull G-RandomAccess

bull G-FFT

ndash OpenMX version 374

ndash Graph 500 version 214

10

Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Point-to-point communication

11

585GBs 569GBs

The overhead is less than 3 with large message though it is up to 25 with small message

IMB Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 9: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

ASGC Software Stack

Management Stack

ndash CentOS 65 (QEMUKVM 01212)

ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)

bull sgc-tools Virtual cluster construction utility

ndash RADOS cluster storage

HPC Stack (Virtual Cluster)

ndash Intel CompilerMath Kernel Library SP1 11106

ndash Open MPI 165

ndash Mellanox OFED 21

ndash Torque job scheduler

9

Exploring the Performance Impact of Virtualization on an HPC Cloud

Benchmark Programs

Micro benchmark

ndash Intel Micro Benchmark (IMB) version 324

Application-level benchmark

ndash HPC Challenge (HPCC) version 143 bull G-HPL

bull EP-STREAM

bull G-RandomAccess

bull G-FFT

ndash OpenMX version 374

ndash Graph 500 version 214

10

Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Point-to-point communication

11

585GBs 569GBs

The overhead is less than 3 with large message though it is up to 25 with small message

IMB Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 10: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

Benchmark Programs

Micro benchmark

ndash Intel Micro Benchmark (IMB) version 324

Application-level benchmark

ndash HPC Challenge (HPCC) version 143 bull G-HPL

bull EP-STREAM

bull G-RandomAccess

bull G-FFT

ndash OpenMX version 374

ndash Graph 500 version 214

10

Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Point-to-point communication

11

585GBs 569GBs

The overhead is less than 3 with large message though it is up to 25 with small message

IMB Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 11: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

MPI Point-to-point communication

11

585GBs 569GBs

The overhead is less than 3 with large message though it is up to 25 with small message

IMB Exploring the Performance Impact of Virtualization on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 12: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128

Exe

cution T

ime (u

sec)

Number of Nodes

PhysicalCluster

0

200

400

600

800

1000

1200

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

0

2000

4000

6000

0 32 64 96 128

Exe

cution T

ime (use

c)

Number of Nodes

PhysicalCluster

Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases hellip load imbalance

+77 +88

+43

Exploring the Performance Impact of Virtualization on an HPC Cloud

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 13: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance (TFLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

Performance degradation 54 - 66

Efficiency on 128 nodes Physical 90 Virtual 84

) Rmax Rpeak

HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 14: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance (G

Bs)

Number of Nodes

Physical Cluster

Virtual Cluster

0

40

80

120

160

0 32 64 96 128

Perf

orm

ance (G

FLO

PS)

Number of Nodes

Physical Cluster

Virtual Cluster

EP-STREAM G-FFT

HPCC

The overheads are ignorable

memory intensive with no communication

all-to-all communication with large messages

Exploring the Performance Impact of Virtualization on an HPC Cloud

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 15: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

Graph500 (replicated-csc scale 26)

15

100E+07

100E+08

100E+09

100E+10

0 16 32 48 64

Perf

orm

ance (TEP

S)

Number of Nodes

Physical ClusterVirtual Cluster

Graph500

Performance degradation 2 (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads

Exploring the Performance Impact of Virtualization on an HPC Cloud

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings
Page 16: IEEE CloudCom 2014参加報告 高野@産総研 担当パート

Findings

bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection

bull VCPU pinning improves the performance for HPC applications

bull Almost all MPI collectives suffer from the scalability issue

bull The overhead of virtualization has less impact on actual applications

16

Exploring the Performance Impact of Virtualization on an HPC Cloud

  • IEEE CloudCom 2014参加報告高野産総研 担当パート
  • 所感
  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
  • 参考 Persistent memory
  • HPC on Cloud (8 papers)
  • キーワード
  • 111312511131261113127111312611131281113129111313011131311113129111313211131331113134111313511131281113125111313111131281113130111313611131351113137111312811131341113138111313911131251113126111312711131401113126111312511131411113142111314311131441113131111314511131281113146111312611131281113129111313011131311113129111313211131331113134111313511131281113147111312811131381113129111313711131351113148111313311131491113128111315011131311113134 1113151111312811131331113146111314011131381113137111313211131281113125111312611131271113151111312811131491113131111315211131291113135111312811131261113129111313111131451113137111314911131371113131111313411131371113134111315311131371113134111315411131371113132111312811131401113149CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in Time-shared Cloud Environment
  • ASGC Hardware Spec
  • ASGC Software Stack
  • Benchmark Programs
  • MPI Point-to-point communication
  • MPI Collectives (64bytes)
  • G-HPL (LINPACK)
  • EP-STREAM and G-FFT
  • Graph500 (replicated-csc scale 26)
  • Findings