Ryousei Takano, Yusuke Tanimura, Akihiko Oota, !Hiroki Oohashi, Keiichi Yusa, Yoshio Tanaka
National Institute of Advanced Industrial Science and Technology, Japan
ISGC 2015@Taipei, 20 March. 2015
AIST Super Green Cloud: !Lessons Learned from the Operation and !the Performance Evaluation of HPC Cloud
Introduction �• HPC Cloud is promising HPC platform.• Virtualization is a key technology.
– Pro: a customized software environment, elasticity, etc
– Con: a large overhead, spoiling I/O performance.• VMM-bypass I/O technologies, e.g., PCI passthrough and !
SR-IOV, can significantly mitigate the overhead.
• “99% of HPC jobs running on US NSF computing centers fit into one rack.” -- M. Norman, UCSD
• Current virtualization technologies are feasible enough for supporting such a scale.
LINPACK on ASGC �
4
0
10
20
30
40
50
60
0 32 64 96 128
Perf
orm
ance
(TFL
OP
S)�
Number of Nodes �
Physical ClusterVirtual Cluster
Performance degradation: ! 5.4 - 6.6%
Efficiency* on 128 nodesPhysical cluster: 90%Virtual cluster: 84%
*) Rmax / Rpeak
IEEE CloudCom 2014
Introduction (cont’d)�• HPC Clouds are heading for hybrid-cloud and multi-
cloud systems, where the user can execute their application anytime and anywhere he/she wants.
• Vision of AIST HPC Cloud:
“Build once, run everywhere”• AIST Super Green Cloud (ASGC): !
a fully virtualized HPC system
Outline �• AIST Super Green Cloud (ASGC) and!
HPC Cloud service
• Lessons learned from the first six months of operation• Experiments• Conclusion
6
Vision of AIST HPC Cloud !“Build Once, Run Everywhere”�
7
Academic Cloud
Private Cloud
Commercial Cloud
Virtual cluster!templates
Deploy a Virtual Cluster �
Feature 1: Create a customized virtual cluster easily�
Feature 2: Build a virtual cluster once, and run it everywhere on clouds
Usage Model of AIST Cloud �
8
Allow users to customize their virtual clusters
0 21
Web appsBigDataHPC
1. Select a template of! a virtual machine �
2. Install required software! package �
VM template files
HPC
+Ease of use �
Virtual cluster
deploy �take snapshots �
Launch a virtual machine when necessary
3. Save a user-customized !template in the repository �
9
Elastic Virtual Cluster �
Cloud controller
Login node
sgc-tools
Image repository
Virtual cluster!template
Cloud controller
Frontend node
cmp!node
cmp node
cmp node Scale in/!
scale out
NFSdJob scheduler
Virtual Cluster on ASGC�
InfiniBand/EthernetImage
repository
Import/export
Create a virtual cluster�
Virtual Cluster!on Public Cloud�
Frontend node
cmp!node
cmp!node
Ethernet
Submit a job�Submit a job�
In operationUnder development
ASGC Hardware Spec.�
10
Compute Node�CPU Intel Xeon E5-2680v2/2.8GHz !
(10 core) x 2CPU
Memory 128 GB DDR3-1866
InfiniBand Mellanox ConnectX-3 (FDR)
Ethernet Intel X520-DA2 (10 GbE)
Disk Intel SSD DC S3500 600 GB
• 155 node-cluster consists of Cray H2312 blade server• The theoretical peak performance is 69.44 TFLOPS
Network switch �InfiniBand Mellanox SX6025
Ethernet Extreme BlackDiamond X8
ASGC Software Stack �Management Stack
– CentOS 6.5 (QEMU/KVM 0.12.1.2)
– Apache CloudStack 4.3 + our extensions• PCI passthrough/SR-IOV support (KVM only)
• sgc-tools: Virtual cluster construction utility
– RADOS cluster storageHPC Stack (Virtual Cluster)
– Intel Cluster Studio SP1 1.2.144– Mellanox OFED 2.1
– TORQUE job scheduler 4.2.8
11
Storage Architecture �
User%a'ached+storage
x+NVLAN
Compute+nodes
Compute+network+(Infiniband+FDR)�
Managem
ent+n
etwork+
(10/1G
bE)�x+155
x+155
BDX8
x+155
x+5
VMDI+public+SW
RGWRADOS
VMDI+cluster+SW
x+10
x+10
VMDI+storage
Data+network+(10GbE)
NFS
x+2 x+2
VM�
• No shared storage/!filesystem
• VMDI (Virtual Machine Disk Image) storage– RADOS storage cluster
– RADOS gateway
– NFS secondary staging server
• User-attached storage
primary storage!(local SSD)
secondary storage
user data
Outline �• AIST Super Green Cloud (ASGC) and!
HPC Cloud service
• Lessons learned from the first six months of operation– CloudStack on a Supercomputer
– Cloud Service for HPC users– Utilization
• Experiments• Conclusion
14
Overview of ASGC Operation �• The operation started from July 2014.• Accounts: 30+
– Main users are material scientists and genome scientists.
• Utilization: < 70%• 95% of the total usage time is consumed for running
HPC VM instances.• Hardware failures: 19 (memory, M/B, power supply)
CloudStack on Supercomputer �• Supercomputer is not designed for cloud computing.
– Cluster management software is troublesome.
• We can launch a highly productive system in a short development time by leveraging open source system software.
• Software maturity of CloudStack– Our storage architecture is slightly uncommon, that is we
use local SSD disk as primary storage, and S3-compatible object store as secondary storage.
– We discovered and resolved several serious bugs.
Software Maturily�CloudStack Issue � Our action � Status �
cloudstack-agent jsvc gets too large virtual memory space Patch Fixed
listUsageRecords generates NullPointerExceptions for expunging instances
Patch Fixed
Duplicate usage records when listing large number of records /Small page sizes return duplicate results
Backporting Fixed
Public key content is overridden by template's meta data when you create a instance
Bug report Fixed
Migration of a VM with volumes in local storage to another host in the same cluster is failing
Backporting Fixed
Negative ref_cnt of template(snapshot/volume)_store_ref results in out-of-range error in MySQL
Patch (not merged) Fixed
[S3] Parallel deployment makes reference count of a cache in NFS secondary staging store negative(-1)
Patch (not merged) Unresolved
Can't create proper template from VM on S3 secondary storage environment
Patch Fixed
Fails to attach a volume (is made from a snapshot) to a VM with using local storage as primary storage
Bug report Unresolved
Cloud service for HPC users�• SaaS is the best if the target application is clear.• IaaS is quite flexible. However, it is difficult to
manage an HPC environment from scratch for application users.
• To bridge this gap, sgc-tools is introduced on top of an IaaS service.
• We believe it works well, although some minor problems are remained.
• To improve the ability to maintain VM template, the idea of “Infrastructure as code” can help.
Utilization�• Efficient use of limited resources is required.• A virtual cluster dedicates resources whether the
user fully utilizes them or not.
• sgc-tools do not support queuing at system wide, therefore, the users need to check the availability.
• Introducing a global scheduler, e.g., Condor VM universe, can be a solution for this problem.
Outline �• AIST Super Green Cloud (ASGC) and!
HPC Cloud service
• Lessons learned from the first six months of operation• Experiments
– Deployment time
– Performance evaluation of SR-IOV
• Conclusion
20
0 5 10 15 20 25 30 350
200
400
600
800
1000
Number of nodes
Dep
loy
time
[sec
ond]
●●
●
●
●
● no cachecache on NFS/SScache on local node
Virtual Cluster Deployment �
Breakdown (second)�Device attach!(before OS boot)
90
OS boot 90
FS creation (mkfs) 90
transfer from !RADOS to SS
transfer from SS to !local node
VM�
VM�
RADOS
NFS/SS
Compute!node
VM�
Benchmark Programs�Micro benchmark
– Intel Micro Benchmark (IMB) version 3.2.4• Point-to-point• Collectives: Allgather, Allreduce, Alltoall, Bcast, Reduce, Barrier
Application-level benchmark– LAMMPS Molecular Dynamics Simulator version 28 June
2014• EAM benchmark, 100x100x100 atoms
MPI Point-to-point communication�
23
6.00GB/s5.72GB/s5.73GB/s
IMB
Message size [Byte]
Band
widt
h [M
B/se
c]
101 102 103 104 105 10610−1
101
102
103
104
BMPCI passthroughSR−IOV
The overhead is less than 5% with large message,though it is up to 30% with small message.
Message size [Byte]
Exec
utio
n tim
e [m
icros
econ
d]
101 102 103 104 105 106101
102
103
104
105
BMPCI passthroughSR−IOV
MPI Collectives�
Time [microsecond] �BM 6.87 (1.00)
PCI passthrough 8.07 (1.17)
SR-IOV 9.36 (1.36)
Message size [Byte]
Exec
utio
n tim
e [m
icros
econ
d]
101 102 103 104 105 106101
102
103
104
105
BMPCI passthroughSR−IOV
Message size [Byte]
Exec
utio
n tim
e [m
icros
econ
d]
101 102 103 104 105 106101
102
103
104
105
BMPCI passthroughSR−IOV
Message size [Byte]
Exec
utio
n tim
e [m
icros
econ
d]
101 102 103 104 105 106101
102
103
104
BMPCI passthroughSR−IOV
Message size [Byte]
Exec
utio
n tim
e [m
icros
econ
d]
101 102 103 104 105 106101
102
103
104
BMPCI passthroughSR−IOV
Reuce �Bcast �
Allreduce �Allgather � Alltoall �
The performance of SR-IOV is comparable to that of PCI passthrough while unexpected performance degradation is often observed
Barrier �
LAMMPS: MD simulator �• EAM benchmark:
– Fixed problem size (1M atoms)
– #proc: 20 - 160
• VCPU pinning reduces performance fluctuation.
• Performance overhead of PCI passthrough and SR-IOV is about 13 %.
20 40 60 80 100 120 140 1600
10
20
30
40
50
60
Number of processes
Exec
utio
n tim
e [s
econ
d]
●
●
●
●●
● BMPCI passthroughPCI passthrough w/ vCPU pinSR−IOVSR−IOV w/ vCPU pin
Findings�• The performance of SR-IOV is comparable to that of
PCI passthrough while unexpected performance degradation is often observed.
• VCPU pinning improves the performance for HPC applications.
Outline �• AIST Super Green Cloud (ASGC) and!
HPC Cloud service
• Lessons learned from the first six months of operation• Experiments• Conclusion
27
Conclusion and Future work �• ASGC is a fully virtualized HPC system.• We can launch a highly productive system in a short
development time by leveraging start-of-the-art open source system software.– Extension: PCI passthrough/SR-IOV support, sgc-tools– Bug fixes…
• Future research direction: data movement is key.– Efficient data management and transfer methods– Federated identity management
28