Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
IEEE CloudCom 2014参加報告 高野産総研 担当パート
bull Session 2C Virtualization I bull Session 3C 4B HPC on Cloud
1 20150206 グリッド協議会第45回ワークショップ
bull アカデミア色が強い bull アジア系が多い bull 採択率の割に bull 分野の成熟
Rank CORE computer science conference rankings Publication Citation Microsoft academic search
所感
Rank Publication Citation accepted
IEEEACM CCGrid A 1454 10577 19
IEEE CLOUD B 234 445 18
IEEE CloudCom C 70 187 18
IEEE CloudNet - - - 28
IEEEACM UCC - - - 19
ACM SoCC - - - 24
CLOSER - - - 17
Gartner Hype Curve 2014
クラウドを冠した国際会議
(順番に意味はないのであしからず)
A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
bull Transcendent memory (tmem) ndash サイズは誰にもわからず書込みは失敗するかもしれず読出し時にデータはすでに消えているかもしれないメモリ
ndash クリーンページのキャッシュ管理用の機構 bull cleancache frontswap bull zcache RAMster Xen shim
ndash 応用例VM環境のメモリオーバ プロビジョニング
bull NEXTmem (aka Ex-Tmem) ndash キャッシュ量を増やすために 不揮発メモリを利用
ndash クラウド環境はメモリ階層が 深化する傾向に有りその解析 モデルは重要な研究
3
参考 Persistent memory
bull ブロックデバイス ndash NVMe driver
bull ファイルシステム ndash ファイルキャッシュ層を削除し直接NVMにアクセス ndash PMFS DAX
bull OpenNVM (SanDisk) ndash API atomic write atomic trim ndash NVMKV NVMFS
bull SNIA NVM Programming Technical WG ndash httpwwwsniaorgforumssssinvmp
4
PM = Linux用語で不揮発メモリ
HPC on Cloud (8 papers) 1 ldquoReliability Guided Resource Allocation for Large-Scale Systemsrdquo
S Umamaheshwaran and T J Hacker (Purdue U) 2 ldquoEnergy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through
DVFSrdquo R N Calheiros and R Buyya (U Melbourne) 3 ldquoA Framework for Measuring the Impact and Effectiveness of the NEES Cyber-
infrastructure for Earthquake Engineeringrdquo T Hacker and A J Magana (Purdue U) 4 ldquoExecuting Bag of Distributed Tasks on the Cloud Investigating the Trade-Offs
between Performance and Costrdquo L Thai B Varghese and A Barker (U St Andrew) 5 ldquoCPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on
Real-Time CPU Resource Provisioning in Time-Shared Cloud Environmentsrdquo T Mastelic I Brandic and J Jasarevic (Vienna U of Technology)
6 ldquoPerformance Analysis of Cloud Environments on Top of Energy-Efficient Platforms Featuring Low Power Processorsrdquo V Plugaru S Varrette and P Bouvry (U Luxembourg)
7 ldquoExploring the Performance Impact of Virtualization on an HPC Cloudrdquo N Chakthranont P Khunphet R Takano and T Ikegami (KMUTNB AIST)
8 ldquoGateCloud An Integration of Gate Monte Carlo Simulation with a Cloud Computing Environmentrdquo B A Rowedder H Wang and Y Kuang (UNLV)
5
キーワード bull 目的
ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]
bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]
bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]
6
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in
Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい
ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関
7
ASGC Hardware Spec
8
Compute Node
CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU
Memory 128 GB DDR3-1866
InfiniBand Mellanox ConnectX-3 (FDR)
Ethernet Intel X520-DA2 (10 GbE)
Disk Intel SSD DC S3500 600 GB
bull 155 node-cluster consists of Cray H2312 blade server
bull The theoretical peak performance is 6944 TFLOPS
bull The operation started from July 2014
Exploring the Performance Impact of Virtualization on an HPC Cloud
ASGC Software Stack
Management Stack
ndash CentOS 65 (QEMUKVM 01212)
ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)
bull sgc-tools Virtual cluster construction utility
ndash RADOS cluster storage
HPC Stack (Virtual Cluster)
ndash Intel CompilerMath Kernel Library SP1 11106
ndash Open MPI 165
ndash Mellanox OFED 21
ndash Torque job scheduler
9
Exploring the Performance Impact of Virtualization on an HPC Cloud
Benchmark Programs
Micro benchmark
ndash Intel Micro Benchmark (IMB) version 324
Application-level benchmark
ndash HPC Challenge (HPCC) version 143 bull G-HPL
bull EP-STREAM
bull G-RandomAccess
bull G-FFT
ndash OpenMX version 374
ndash Graph 500 version 214
10
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Point-to-point communication
11
585GBs 569GBs
The overhead is less than 3 with large message though it is up to 25 with small message
IMB Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
bull アカデミア色が強い bull アジア系が多い bull 採択率の割に bull 分野の成熟
Rank CORE computer science conference rankings Publication Citation Microsoft academic search
所感
Rank Publication Citation accepted
IEEEACM CCGrid A 1454 10577 19
IEEE CLOUD B 234 445 18
IEEE CloudCom C 70 187 18
IEEE CloudNet - - - 28
IEEEACM UCC - - - 19
ACM SoCC - - - 24
CLOSER - - - 17
Gartner Hype Curve 2014
クラウドを冠した国際会議
(順番に意味はないのであしからず)
A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
bull Transcendent memory (tmem) ndash サイズは誰にもわからず書込みは失敗するかもしれず読出し時にデータはすでに消えているかもしれないメモリ
ndash クリーンページのキャッシュ管理用の機構 bull cleancache frontswap bull zcache RAMster Xen shim
ndash 応用例VM環境のメモリオーバ プロビジョニング
bull NEXTmem (aka Ex-Tmem) ndash キャッシュ量を増やすために 不揮発メモリを利用
ndash クラウド環境はメモリ階層が 深化する傾向に有りその解析 モデルは重要な研究
3
参考 Persistent memory
bull ブロックデバイス ndash NVMe driver
bull ファイルシステム ndash ファイルキャッシュ層を削除し直接NVMにアクセス ndash PMFS DAX
bull OpenNVM (SanDisk) ndash API atomic write atomic trim ndash NVMKV NVMFS
bull SNIA NVM Programming Technical WG ndash httpwwwsniaorgforumssssinvmp
4
PM = Linux用語で不揮発メモリ
HPC on Cloud (8 papers) 1 ldquoReliability Guided Resource Allocation for Large-Scale Systemsrdquo
S Umamaheshwaran and T J Hacker (Purdue U) 2 ldquoEnergy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through
DVFSrdquo R N Calheiros and R Buyya (U Melbourne) 3 ldquoA Framework for Measuring the Impact and Effectiveness of the NEES Cyber-
infrastructure for Earthquake Engineeringrdquo T Hacker and A J Magana (Purdue U) 4 ldquoExecuting Bag of Distributed Tasks on the Cloud Investigating the Trade-Offs
between Performance and Costrdquo L Thai B Varghese and A Barker (U St Andrew) 5 ldquoCPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on
Real-Time CPU Resource Provisioning in Time-Shared Cloud Environmentsrdquo T Mastelic I Brandic and J Jasarevic (Vienna U of Technology)
6 ldquoPerformance Analysis of Cloud Environments on Top of Energy-Efficient Platforms Featuring Low Power Processorsrdquo V Plugaru S Varrette and P Bouvry (U Luxembourg)
7 ldquoExploring the Performance Impact of Virtualization on an HPC Cloudrdquo N Chakthranont P Khunphet R Takano and T Ikegami (KMUTNB AIST)
8 ldquoGateCloud An Integration of Gate Monte Carlo Simulation with a Cloud Computing Environmentrdquo B A Rowedder H Wang and Y Kuang (UNLV)
5
キーワード bull 目的
ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]
bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]
bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]
6
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in
Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい
ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関
7
ASGC Hardware Spec
8
Compute Node
CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU
Memory 128 GB DDR3-1866
InfiniBand Mellanox ConnectX-3 (FDR)
Ethernet Intel X520-DA2 (10 GbE)
Disk Intel SSD DC S3500 600 GB
bull 155 node-cluster consists of Cray H2312 blade server
bull The theoretical peak performance is 6944 TFLOPS
bull The operation started from July 2014
Exploring the Performance Impact of Virtualization on an HPC Cloud
ASGC Software Stack
Management Stack
ndash CentOS 65 (QEMUKVM 01212)
ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)
bull sgc-tools Virtual cluster construction utility
ndash RADOS cluster storage
HPC Stack (Virtual Cluster)
ndash Intel CompilerMath Kernel Library SP1 11106
ndash Open MPI 165
ndash Mellanox OFED 21
ndash Torque job scheduler
9
Exploring the Performance Impact of Virtualization on an HPC Cloud
Benchmark Programs
Micro benchmark
ndash Intel Micro Benchmark (IMB) version 324
Application-level benchmark
ndash HPC Challenge (HPCC) version 143 bull G-HPL
bull EP-STREAM
bull G-RandomAccess
bull G-FFT
ndash OpenMX version 374
ndash Graph 500 version 214
10
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Point-to-point communication
11
585GBs 569GBs
The overhead is less than 3 with large message though it is up to 25 with small message
IMB Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory
bull Transcendent memory (tmem) ndash サイズは誰にもわからず書込みは失敗するかもしれず読出し時にデータはすでに消えているかもしれないメモリ
ndash クリーンページのキャッシュ管理用の機構 bull cleancache frontswap bull zcache RAMster Xen shim
ndash 応用例VM環境のメモリオーバ プロビジョニング
bull NEXTmem (aka Ex-Tmem) ndash キャッシュ量を増やすために 不揮発メモリを利用
ndash クラウド環境はメモリ階層が 深化する傾向に有りその解析 モデルは重要な研究
3
参考 Persistent memory
bull ブロックデバイス ndash NVMe driver
bull ファイルシステム ndash ファイルキャッシュ層を削除し直接NVMにアクセス ndash PMFS DAX
bull OpenNVM (SanDisk) ndash API atomic write atomic trim ndash NVMKV NVMFS
bull SNIA NVM Programming Technical WG ndash httpwwwsniaorgforumssssinvmp
4
PM = Linux用語で不揮発メモリ
HPC on Cloud (8 papers) 1 ldquoReliability Guided Resource Allocation for Large-Scale Systemsrdquo
S Umamaheshwaran and T J Hacker (Purdue U) 2 ldquoEnergy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through
DVFSrdquo R N Calheiros and R Buyya (U Melbourne) 3 ldquoA Framework for Measuring the Impact and Effectiveness of the NEES Cyber-
infrastructure for Earthquake Engineeringrdquo T Hacker and A J Magana (Purdue U) 4 ldquoExecuting Bag of Distributed Tasks on the Cloud Investigating the Trade-Offs
between Performance and Costrdquo L Thai B Varghese and A Barker (U St Andrew) 5 ldquoCPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on
Real-Time CPU Resource Provisioning in Time-Shared Cloud Environmentsrdquo T Mastelic I Brandic and J Jasarevic (Vienna U of Technology)
6 ldquoPerformance Analysis of Cloud Environments on Top of Energy-Efficient Platforms Featuring Low Power Processorsrdquo V Plugaru S Varrette and P Bouvry (U Luxembourg)
7 ldquoExploring the Performance Impact of Virtualization on an HPC Cloudrdquo N Chakthranont P Khunphet R Takano and T Ikegami (KMUTNB AIST)
8 ldquoGateCloud An Integration of Gate Monte Carlo Simulation with a Cloud Computing Environmentrdquo B A Rowedder H Wang and Y Kuang (UNLV)
5
キーワード bull 目的
ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]
bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]
bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]
6
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in
Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい
ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関
7
ASGC Hardware Spec
8
Compute Node
CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU
Memory 128 GB DDR3-1866
InfiniBand Mellanox ConnectX-3 (FDR)
Ethernet Intel X520-DA2 (10 GbE)
Disk Intel SSD DC S3500 600 GB
bull 155 node-cluster consists of Cray H2312 blade server
bull The theoretical peak performance is 6944 TFLOPS
bull The operation started from July 2014
Exploring the Performance Impact of Virtualization on an HPC Cloud
ASGC Software Stack
Management Stack
ndash CentOS 65 (QEMUKVM 01212)
ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)
bull sgc-tools Virtual cluster construction utility
ndash RADOS cluster storage
HPC Stack (Virtual Cluster)
ndash Intel CompilerMath Kernel Library SP1 11106
ndash Open MPI 165
ndash Mellanox OFED 21
ndash Torque job scheduler
9
Exploring the Performance Impact of Virtualization on an HPC Cloud
Benchmark Programs
Micro benchmark
ndash Intel Micro Benchmark (IMB) version 324
Application-level benchmark
ndash HPC Challenge (HPCC) version 143 bull G-HPL
bull EP-STREAM
bull G-RandomAccess
bull G-FFT
ndash OpenMX version 374
ndash Graph 500 version 214
10
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Point-to-point communication
11
585GBs 569GBs
The overhead is less than 3 with large message though it is up to 25 with small message
IMB Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
参考 Persistent memory
bull ブロックデバイス ndash NVMe driver
bull ファイルシステム ndash ファイルキャッシュ層を削除し直接NVMにアクセス ndash PMFS DAX
bull OpenNVM (SanDisk) ndash API atomic write atomic trim ndash NVMKV NVMFS
bull SNIA NVM Programming Technical WG ndash httpwwwsniaorgforumssssinvmp
4
PM = Linux用語で不揮発メモリ
HPC on Cloud (8 papers) 1 ldquoReliability Guided Resource Allocation for Large-Scale Systemsrdquo
S Umamaheshwaran and T J Hacker (Purdue U) 2 ldquoEnergy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through
DVFSrdquo R N Calheiros and R Buyya (U Melbourne) 3 ldquoA Framework for Measuring the Impact and Effectiveness of the NEES Cyber-
infrastructure for Earthquake Engineeringrdquo T Hacker and A J Magana (Purdue U) 4 ldquoExecuting Bag of Distributed Tasks on the Cloud Investigating the Trade-Offs
between Performance and Costrdquo L Thai B Varghese and A Barker (U St Andrew) 5 ldquoCPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on
Real-Time CPU Resource Provisioning in Time-Shared Cloud Environmentsrdquo T Mastelic I Brandic and J Jasarevic (Vienna U of Technology)
6 ldquoPerformance Analysis of Cloud Environments on Top of Energy-Efficient Platforms Featuring Low Power Processorsrdquo V Plugaru S Varrette and P Bouvry (U Luxembourg)
7 ldquoExploring the Performance Impact of Virtualization on an HPC Cloudrdquo N Chakthranont P Khunphet R Takano and T Ikegami (KMUTNB AIST)
8 ldquoGateCloud An Integration of Gate Monte Carlo Simulation with a Cloud Computing Environmentrdquo B A Rowedder H Wang and Y Kuang (UNLV)
5
キーワード bull 目的
ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]
bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]
bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]
6
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in
Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい
ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関
7
ASGC Hardware Spec
8
Compute Node
CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU
Memory 128 GB DDR3-1866
InfiniBand Mellanox ConnectX-3 (FDR)
Ethernet Intel X520-DA2 (10 GbE)
Disk Intel SSD DC S3500 600 GB
bull 155 node-cluster consists of Cray H2312 blade server
bull The theoretical peak performance is 6944 TFLOPS
bull The operation started from July 2014
Exploring the Performance Impact of Virtualization on an HPC Cloud
ASGC Software Stack
Management Stack
ndash CentOS 65 (QEMUKVM 01212)
ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)
bull sgc-tools Virtual cluster construction utility
ndash RADOS cluster storage
HPC Stack (Virtual Cluster)
ndash Intel CompilerMath Kernel Library SP1 11106
ndash Open MPI 165
ndash Mellanox OFED 21
ndash Torque job scheduler
9
Exploring the Performance Impact of Virtualization on an HPC Cloud
Benchmark Programs
Micro benchmark
ndash Intel Micro Benchmark (IMB) version 324
Application-level benchmark
ndash HPC Challenge (HPCC) version 143 bull G-HPL
bull EP-STREAM
bull G-RandomAccess
bull G-FFT
ndash OpenMX version 374
ndash Graph 500 version 214
10
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Point-to-point communication
11
585GBs 569GBs
The overhead is less than 3 with large message though it is up to 25 with small message
IMB Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
HPC on Cloud (8 papers) 1 ldquoReliability Guided Resource Allocation for Large-Scale Systemsrdquo
S Umamaheshwaran and T J Hacker (Purdue U) 2 ldquoEnergy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through
DVFSrdquo R N Calheiros and R Buyya (U Melbourne) 3 ldquoA Framework for Measuring the Impact and Effectiveness of the NEES Cyber-
infrastructure for Earthquake Engineeringrdquo T Hacker and A J Magana (Purdue U) 4 ldquoExecuting Bag of Distributed Tasks on the Cloud Investigating the Trade-Offs
between Performance and Costrdquo L Thai B Varghese and A Barker (U St Andrew) 5 ldquoCPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on
Real-Time CPU Resource Provisioning in Time-Shared Cloud Environmentsrdquo T Mastelic I Brandic and J Jasarevic (Vienna U of Technology)
6 ldquoPerformance Analysis of Cloud Environments on Top of Energy-Efficient Platforms Featuring Low Power Processorsrdquo V Plugaru S Varrette and P Bouvry (U Luxembourg)
7 ldquoExploring the Performance Impact of Virtualization on an HPC Cloudrdquo N Chakthranont P Khunphet R Takano and T Ikegami (KMUTNB AIST)
8 ldquoGateCloud An Integration of Gate Monte Carlo Simulation with a Cloud Computing Environmentrdquo B A Rowedder H Wang and Y Kuang (UNLV)
5
キーワード bull 目的
ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]
bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]
bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]
6
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in
Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい
ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関
7
ASGC Hardware Spec
8
Compute Node
CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU
Memory 128 GB DDR3-1866
InfiniBand Mellanox ConnectX-3 (FDR)
Ethernet Intel X520-DA2 (10 GbE)
Disk Intel SSD DC S3500 600 GB
bull 155 node-cluster consists of Cray H2312 blade server
bull The theoretical peak performance is 6944 TFLOPS
bull The operation started from July 2014
Exploring the Performance Impact of Virtualization on an HPC Cloud
ASGC Software Stack
Management Stack
ndash CentOS 65 (QEMUKVM 01212)
ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)
bull sgc-tools Virtual cluster construction utility
ndash RADOS cluster storage
HPC Stack (Virtual Cluster)
ndash Intel CompilerMath Kernel Library SP1 11106
ndash Open MPI 165
ndash Mellanox OFED 21
ndash Torque job scheduler
9
Exploring the Performance Impact of Virtualization on an HPC Cloud
Benchmark Programs
Micro benchmark
ndash Intel Micro Benchmark (IMB) version 324
Application-level benchmark
ndash HPC Challenge (HPCC) version 143 bull G-HPL
bull EP-STREAM
bull G-RandomAccess
bull G-FFT
ndash OpenMX version 374
ndash Graph 500 version 214
10
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Point-to-point communication
11
585GBs 569GBs
The overhead is less than 3 with large message though it is up to 25 with small message
IMB Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
キーワード bull 目的
ndash 耐障害性 [1]省電力 [2 6]性能指標 [4 5] 高性能 [6 7]
bull システム ndash リソースプロビジョニングスケジューラ [1 4 5] ndash IaaS OpenStack [6] CloudStack [7] ndash ワークフロー [8]
bull アプリケーション ndash MPI [6 7] ndash Bag of Tasks [2] Bag of Distributed Tasks [4] ndash Webアプリ (FFmpeg MongoDB Ruby on Rails) [5] ndash モンテカルロ [8] ndash Earthquake Engineering [3]
6
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in
Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい
ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関
7
ASGC Hardware Spec
8
Compute Node
CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU
Memory 128 GB DDR3-1866
InfiniBand Mellanox ConnectX-3 (FDR)
Ethernet Intel X520-DA2 (10 GbE)
Disk Intel SSD DC S3500 600 GB
bull 155 node-cluster consists of Cray H2312 blade server
bull The theoretical peak performance is 6944 TFLOPS
bull The operation started from July 2014
Exploring the Performance Impact of Virtualization on an HPC Cloud
ASGC Software Stack
Management Stack
ndash CentOS 65 (QEMUKVM 01212)
ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)
bull sgc-tools Virtual cluster construction utility
ndash RADOS cluster storage
HPC Stack (Virtual Cluster)
ndash Intel CompilerMath Kernel Library SP1 11106
ndash Open MPI 165
ndash Mellanox OFED 21
ndash Torque job scheduler
9
Exploring the Performance Impact of Virtualization on an HPC Cloud
Benchmark Programs
Micro benchmark
ndash Intel Micro Benchmark (IMB) version 324
Application-level benchmark
ndash HPC Challenge (HPCC) version 143 bull G-HPL
bull EP-STREAM
bull G-RandomAccess
bull G-FFT
ndash OpenMX version 374
ndash Graph 500 version 214
10
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Point-to-point communication
11
585GBs 569GBs
The overhead is less than 3 with large message though it is up to 25 with small message
IMB Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126 111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126
11131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126111312611131261113126CPU Performance Coefficient (CPU-PC) A Novel Performance Metric Based on Real-time CPU Resource Provisioning in
Time-shared Cloud Environment bull クラウド環境では1台のサーバに複数のVMが共存 bull クラウド提供者も利用者も使える性能指標が欲しい
ndash response timeは他のVMの影響で変動 bull stolen timeに着目した指標CPU-PCを提案 bull CPU-PCとresponse timeは非常に高い相関
7
ASGC Hardware Spec
8
Compute Node
CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU
Memory 128 GB DDR3-1866
InfiniBand Mellanox ConnectX-3 (FDR)
Ethernet Intel X520-DA2 (10 GbE)
Disk Intel SSD DC S3500 600 GB
bull 155 node-cluster consists of Cray H2312 blade server
bull The theoretical peak performance is 6944 TFLOPS
bull The operation started from July 2014
Exploring the Performance Impact of Virtualization on an HPC Cloud
ASGC Software Stack
Management Stack
ndash CentOS 65 (QEMUKVM 01212)
ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)
bull sgc-tools Virtual cluster construction utility
ndash RADOS cluster storage
HPC Stack (Virtual Cluster)
ndash Intel CompilerMath Kernel Library SP1 11106
ndash Open MPI 165
ndash Mellanox OFED 21
ndash Torque job scheduler
9
Exploring the Performance Impact of Virtualization on an HPC Cloud
Benchmark Programs
Micro benchmark
ndash Intel Micro Benchmark (IMB) version 324
Application-level benchmark
ndash HPC Challenge (HPCC) version 143 bull G-HPL
bull EP-STREAM
bull G-RandomAccess
bull G-FFT
ndash OpenMX version 374
ndash Graph 500 version 214
10
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Point-to-point communication
11
585GBs 569GBs
The overhead is less than 3 with large message though it is up to 25 with small message
IMB Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
ASGC Hardware Spec
8
Compute Node
CPU Intel Xeon E5-2680v228GHz (10 core) x 2CPU
Memory 128 GB DDR3-1866
InfiniBand Mellanox ConnectX-3 (FDR)
Ethernet Intel X520-DA2 (10 GbE)
Disk Intel SSD DC S3500 600 GB
bull 155 node-cluster consists of Cray H2312 blade server
bull The theoretical peak performance is 6944 TFLOPS
bull The operation started from July 2014
Exploring the Performance Impact of Virtualization on an HPC Cloud
ASGC Software Stack
Management Stack
ndash CentOS 65 (QEMUKVM 01212)
ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)
bull sgc-tools Virtual cluster construction utility
ndash RADOS cluster storage
HPC Stack (Virtual Cluster)
ndash Intel CompilerMath Kernel Library SP1 11106
ndash Open MPI 165
ndash Mellanox OFED 21
ndash Torque job scheduler
9
Exploring the Performance Impact of Virtualization on an HPC Cloud
Benchmark Programs
Micro benchmark
ndash Intel Micro Benchmark (IMB) version 324
Application-level benchmark
ndash HPC Challenge (HPCC) version 143 bull G-HPL
bull EP-STREAM
bull G-RandomAccess
bull G-FFT
ndash OpenMX version 374
ndash Graph 500 version 214
10
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Point-to-point communication
11
585GBs 569GBs
The overhead is less than 3 with large message though it is up to 25 with small message
IMB Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
ASGC Software Stack
Management Stack
ndash CentOS 65 (QEMUKVM 01212)
ndash Apache CloudStack 43 + our extensions bull PCI passthroughSR-IOV support (KVM only)
bull sgc-tools Virtual cluster construction utility
ndash RADOS cluster storage
HPC Stack (Virtual Cluster)
ndash Intel CompilerMath Kernel Library SP1 11106
ndash Open MPI 165
ndash Mellanox OFED 21
ndash Torque job scheduler
9
Exploring the Performance Impact of Virtualization on an HPC Cloud
Benchmark Programs
Micro benchmark
ndash Intel Micro Benchmark (IMB) version 324
Application-level benchmark
ndash HPC Challenge (HPCC) version 143 bull G-HPL
bull EP-STREAM
bull G-RandomAccess
bull G-FFT
ndash OpenMX version 374
ndash Graph 500 version 214
10
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Point-to-point communication
11
585GBs 569GBs
The overhead is less than 3 with large message though it is up to 25 with small message
IMB Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
Benchmark Programs
Micro benchmark
ndash Intel Micro Benchmark (IMB) version 324
Application-level benchmark
ndash HPC Challenge (HPCC) version 143 bull G-HPL
bull EP-STREAM
bull G-RandomAccess
bull G-FFT
ndash OpenMX version 374
ndash Graph 500 version 214
10
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Point-to-point communication
11
585GBs 569GBs
The overhead is less than 3 with large message though it is up to 25 with small message
IMB Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Point-to-point communication
11
585GBs 569GBs
The overhead is less than 3 with large message though it is up to 25 with small message
IMB Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
Exe
cution T
ime (u
sec)
Number of Nodes
PhysicalCluster
0
200
400
600
800
1000
1200
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
0
2000
4000
6000
0 32 64 96 128
Exe
cution T
ime (use
c)
Number of Nodes
PhysicalCluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes significant as the number of nodes increases hellip load imbalance
+77 +88
+43
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance (TFLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation 54 - 66
Efficiency on 128 nodes Physical 90 Virtual 84
) Rmax Rpeak
HPCC Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Perf
orm
ance (G
Bs)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Perf
orm
ance (G
FLO
PS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable
memory intensive with no communication
all-to-all communication with large messages
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
Graph500 (replicated-csc scale 26)
15
100E+07
100E+08
100E+09
100E+10
0 16 32 48 64
Perf
orm
ance (TEP
S)
Number of Nodes
Physical ClusterVirtual Cluster
Graph500
Performance degradation 2 (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP) We used a combination of 2 MPI processes and 10 OpenMP threads
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud
Findings
bull PCI passthrough is effective in improving the IO performance however it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection
bull VCPU pinning improves the performance for HPC applications
bull Almost all MPI collectives suffer from the scalability issue
bull The overhead of virtualization has less impact on actual applications
16
Exploring the Performance Impact of Virtualization on an HPC Cloud