Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Achieving Higher Productivity on Abaqus Simulations for Small and Mid-size Enterprises
with HPC and Clustering Technologies
Pak Lui
Application Performance Manager
SIMULIA COMMUNITY CONFERENCEMAY 18-21, 2015 | BERLIN, GERMANY
2
The HPC Advisory Council
• World-wide HPC non-profit organization (400+ members)
• Bridges the gap between HPC usage and its potential
• Provides best practices and a support/development center
• Explores future technologies and future developments
• Leading edge solutions and technology demonstrations
3
HPC Advisory Council Members
4
Special Interest Subgroups
• HPC|Scale Subgroup
– Explore usage of commodity HPC as a
replacement for multi-million dollar
mainframes and proprietary based
supercomputers
• HPC|Cloud Subgroup
– Explore usage of HPC components as part of
the creation of external/public/internal/private
cloud computing environments.
• HPC|Works Subgroup
– Provide best practices for building balanced
and scalable HPC systems, performance
tuning and application guidelines.
• HPC|Storage Subgroup
– Demonstrate how to build high-performance
storage solutions and their affect on
application performance and productivity
• HPC|GPU Subgroup
– Explore usage models of GPU components
as part of next generation compute
environments and potential optimizations for
GPU based computing
• HPC|FSI Subgroup
– Explore the usage of high-performance
computing solutions for low latency trading,
productive simulations and overall more
efficient financial services
• HPC|Music
– To enable HPC in music production and to
develop HPC cluster solutions that further
enable the future of music production
5
HPC Advisory Council HPC CenterInfiniBand-based Storage (Lustre) Juniper Heimdall
Plutus Janus Athena
Vesta Thor Mala
Lustre FS 640 cores
456 cores 192 cores
704 cores 896 cores
280 cores
16 GPUs
80 cores
6
• LS-DYNA
• miniFE
• MILC
• MSC Nastran
• MR Bayes
• MM5
• MPQC
• NAMD
• Nekbone
• NEMO
• NWChem
• Octopus
• OpenAtom
• OpenFOAM
• MILC
• OpenMX
• PARATEC
• PFA
• PFLOTRAN
• Quantum ESPRESSO
• RADIOSS
• SPECFEM3D
• WRF
130 Applications Best Practices Published
• Abaqus
• AcuSolve
• Amber
• AMG
• AMR
• ABySS
• ANSYS CFX
• ANSYS FLUENT
• ANSYS Mechanics
• BQCD
• CCSM
• CESM
• COSMO
• CP2K
• CPMD
• Dacapo
• Desmond
• DL-POLY
• Eclipse
• FLOW-3D
• GADGET-2
• GROMACS
• Himeno
• HOOMD-blue
• HYCOM
• ICON
• Lattice QCD
• LAMMPS
For More Information, visit: http://hpcadvisorycouncil.com/best_practices.php
7
University Award Program
• University award program– Universities are encouraged to submit proposals for advanced research – Once / twice a year, the HPC Advisory Council will select a few proposals
• Selected proposal will be provided with:– Exclusive computation time on the HPC Advisory Council’s Compute Center– Invitation to present in one of the HPC Advisory Council’s worldwide workshops– Publication of the research results on the HPC Advisory Council website
• 2010 award winner is Dr. Xiangqian Hu, Duke University– Topic: “Massively Parallel Quantum Mechanical Simulations for Liquid Water”
• 2011 award winner is Dr. Marco Aldinucci, University of Torino– Topic: “Effective Streaming on Multi-core by Way of the FastFlow Framework’
• 2012 award winner is Jacob Nelson, University of Washington– “Runtime Support for Sparse Graph Applications”
• 2013 award winner is Antonis Karalis– Topic: “Music Production using HPC”
• To submit a proposal – please check the HPC Advisory Council web site
8
2015 ISC Student Cluster Competition
• University-based teams to compete and demonstrate the incredible
capabilities of state-of- the-art HPC systems and applications on the
ISC’15 show-floor
• The Student Cluster Challenge is designed to introduce the next
generation of students to the high performance computing world and
community
9
ISC’15 – Student Cluster Competition Teams
10
HPCAC – 2015 ISC High Performance
– Student Cluster Competition Teams
11
HPC Advisory Council
• HPC Advisory Council (HPCAC)
– 400+ members
– http://www.hpcadvisorycouncil.com/
– Application best practices, case studies (Over 150)
– Benchmarking center with remote access for users
– World-wide workshops
– Value add for your customers to stay up to date
and in tune to HPC market
• 2015 Workshops
– USA (Stanford University) – February 2015
– Switzerland – March 2015
– Brazil – August 2015
– Spain – September 2015
– China (HPC China) – Oct 2015
• For more information
– www.hpcadvisorycouncil.com
12
HPCAC Conferences
13
Note
• The following research was performed under the HPC
Advisory Council activities
– Special thanks for: HP, Mellanox
• For more information on the supporting vendors solutions
please refer to:
– www.mellanox.com, http://www.hp.com/go/hpc
• For more information on the application:
– http://www.simulia.com
14
Abaqus by SIMULIA
• Abaqus Unified FEA product suite offers powerful and complete solutions
for both routine and sophisticated engineering problems covering a vast
spectrum of industrial applications
• The Abaqus analysis products listed below focus on:
– Nonlinear finite element analysis (FEA)
– Advanced linear and dynamics application problems
• Abaqus/Standard
– General-purpose FEA that includes broad range of analysis capabilities
• Abaqus/Explicit
– Nonlinear, transient, dynamic analysis of solids and structures using explicit
time integration
15
Objectives
• The presented research was done to provide best practices
– Abaqus performance benchmarking
– Interconnect performance comparisons
– CPU performance comparisons
– Understanding Abaqus communication patterns
• The presented results will demonstrate
– The scalability of the compute environment to provide nearly linear
application scalability
16
• HP Apollo 6000 “Heimdall” cluster
– HP ProLiant XL230a Gen9 10-node “Heimdall” cluster
– Processors: Dual-Socket 14-core Intel Xeon E5-2697v3 @ 2.6 GHz CPUs (Turbo Mode on; Home
Snoop as default)
– Memory: 64GB per node, 2133MHz DDR4 DIMMs
– OS: RHEL 6 Update 6, OFED 2.4-1.0.0 InfiniBand SW stack
• Mellanox Connect-IB 56Gb/s FDR InfiniBand adapters
• Mellanox ConnectX-3 Pro 10GbE Ethernet adapters
• Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand and Ethernet VPI Switch
• MPI: Platform MPI 9.1.2 and Intel MPI 4.1.3
• Application: Abaqus 6.14-2 (unless otherwise stated)
• Benchmark Workload:
– Abaqus/Standard: S2: Flywheel with centrifugal load, S4 Cylinder head bolt-up
– Abaqus/Explicit: E6 Concentric Spheres
Test Cluster Configuration
17
Item HP ProLiant XL230a Gen9 Server
Processor Two Intel® Xeon® E5-2600 v3 Series, 6/8/10/12/14/16 Cores
Chipset Intel Xeon E5-2600 v3 series
Memory 512 GB (16 x 32 GB) 16 DIMM slots, DDR3 up to DDR4; R-DIMM/LR-DIMM; 2,133 MHz
Max Memory 512 GB
Internal Storage
1 HP Dynamic Smart Array B140i
SATA controller
HP H240 Host Bus Adapter
NetworkingNetwork module supporting
various FlexibleLOMs: 1GbE, 10GbE, and/or InfiniBand
Expansion Slots1 Internal PCIe:
1 PCIe x 16 Gen3, half-height
PortsFront: (1) Management, (2) 1GbE, (1) Serial, (1) S.U.V port, (2) PCIe, and Internal Micro SD
card & Active Health
Power SuppliesHP 2,400 or 2,650 W Platinum hot-plug power supplies
delivered by HP Apollo 6000 Power Shelf
Integrated ManagementHP iLO (Firmware: HP iLO 4)
Option: HP Advanced Power Manager
Additional FeaturesShared Power & Cooling and up to 8 nodes per 4U chassis, single GPU support, Fusion I/O
support
Form Factor 10 servers in 5U chassis
HP ProLiant XL230a Gen9 Server
18
Abaqus/Explicit Performance –CPU Generation
All measurement
based on v6.12.2
• Intel E5-2697V3 (Haswell) cluster outperforms prior system generations
– Performs 113% higher vs Westmere cluster, 36% vs Sandy Bridge cluster, and 23% vs Ivy Bridge
cluster at 4 nodes
• System components used:
– Heimdall (HSW): 2-socket 14-core Intel E5-2697v3 @ 2.6GHz, 2133MHz DIMMs, FDR IB, 1HDD
– Athena (IVB): 2-socket 10-core Intel E5-2680v2 @ 2.7GHz, 1600MHz DIMMs, FDR IB, 1HDD
– Athena (SNB): 2-socket 8-core Intel E5-2680 @ 2.7GHz, 1600MHz DIMMs, FDR IB, 1HDD
– Plutus (WSM): 2-socket 6-core Intel X5670 @ 2.93GHz, 1333MHz DIMMs, QDR IB, 1HDD
Performance Rating:
based on total
runtime
19
Abaqus/Explicit Performance vs Power
• Modest dependency on CPU frequencies for Explicit case
– Approximately 23% of the gain between 2000MHz vs 2800MHz
– Tradeoff: About 51% of power needed to run from 2GHz to 2.8GHz CPUs
• Enabling Turbo Mode provides little difference performance for Explicit
– Turbo model provides ~4% of performance gain (for CPU cores running at 2800 MHz)
– It also resulted in 18% of higher power usage
Higher is better
18%
51%23%4%
Ivy Bridge
20
Abaqus/Standard Performance vs Power
• Similar behavior regarding CPU frequency and Turbo mode is seen with Standard
– Up to 25% of the improvement at 51% of higher power utilization
– Small gain (5%) of performance when using Turbo Mode, at 18% higher power usage
• Using kernel tools called “msr-tools” to adjust Turbo Mode dynamically
– Allows dynamically turn off/on Turbo mode in the OS level
– Turbo Mode: Boosting core frequency; consequently resulted in higher power
consumption
Higher is better
18%51%25%
5%
Ivy Bridge
21
Abaqus/Explicit Performance - Interconnect
• FDR InfiniBand is the most efficient network interconnect for Abaqus/Explicit
– FDR IB outperforms 1GbE by 5.2x, 10GbE by 87%, and at 8 nodes
– InfiniBand reduces communication time; provides more time for computation
– With RDMA, IB offloads CPU from communications, thus CPU can focus on computation
Higher is better
22
Abaqus/Explicit Profiling – MPI Time Ratio
• FDR InfiniBand reduces the MPI communication time
– InfiniBand FDR consumes about 37% of total runtime at 4 nodes
– Ethernet solutions consume from 53% to 72% at 4 nodes
23
Abaqus/Explicit Profiling – Time Spent in MPI
• Abaqus/Explicit: More time spent on MPI collective operations:
– InfiniBand FDR: MPI_Gather(54%), MPI_Allreduce(9%), MPI_Scatterv(9%)
– 1GbE: MPI_Gather(43%), MPI_Irecv(13%), MPI_Scatterv(14%)
24
• Abaqus/Standard performs faster than the previous version
– 6.14-2 outperforms 6.13-2 by 8% at 8 nodes / 224 cores
– MPI library versions have also changed
• The gain on Abaqus/Explicit for the newer version is not as dramatic
– Only slight gain can be seen on the dataset tested
Abaqus Performance – Software Versions
8%
Higher is better
25
• Changing QPI Snoop Configuration may improve performance in certain workload
– Home Snoop provide high memory bandwidth in average NUMA environment (default)
– Early Snoop may decrease memory latency but lower bandwidth
– Cluster on Die may provide increased memory bandwidth in highly optimized NUMA
workloads
– Only slight improvement is seen using Early Snoop for Abaqus/Standard, no difference
in Explicit
Abaqus Performance –Snoop Configuration
Higher is better
26
• Abaqus/Explicit performs marginally faster with SRQ enabled
– Enabling SRQ performs 3% faster at 8 nodes / 224 cores
Abaqus Performance – SRQ
Higher is better
3%
27
• Some improvement is observed by enabling Turbo Mode in Abaqus
– About 5% improvement seen for both Abaqus/Standard and Abaqus/Explicit
– The CPU clock runs at a higher clock frequency when turbo is enabled
– Additional power use can occur when Turbo Mode is on
Abaqus Performance – Turbo Mode
Higher is better
5%5%
28
• Abaqus starts to support Intel MPI starting in version 6.14
– Platform MPI performs slightly higher performance than Intel MPI of
3% at 8 nodes
Abaqus Performance – MPI Libraries
3%
Higher is better
29
Abaqus/Explicit Profiling – MPI calls
• Abaqus/Standard uses a wide range of MPI APIs
– MPI_Test dominates the MPI function calls (over 97%)
• Abaqus/Explicit shows high usage for testing non-blocking
messages
– MPI_Iprobe (95%), MPI_Test (3%)
30
Abaqus/Explicit Profiling – Time Spent in MPI
• Abaqus/Standard shows high usage for non-blocking messages
• MPI_Gather dominates the MPI communication time in
Abaqus/Explicit
31
Abaqus Profiling – Message Sizes
• Abaqus/Standard uses small and medium MPI message sizes
– Most message sizes are between 0B to 64B, and 65B to 256B
– Some medium size concentration in 64KB to 256KB
• Abaqus/Explicit shows a wide distribution of small message sizes
– Small messages peak in the range from 65B to 256B
32
Abaqus Performance – GPU
• GPU Support for Abaqus/Standard since 2011
• Current supported version: Abaqus 6.14
– Direct Sparse solvers – symmetric & unsymmetric
– Multi-GPU/node; multi-node DMP clusters
– Flexibility to run jobs on specific GPUs
• Customer adoptions increasing across industry segments
• Abaqus GPU licensing based on tokens
– Same token scheme for CPU core & GPU
• Performance gains vary: 2-3x on average is common
33
Abaqus Summary
• HP ProLiant Gen9 servers deliver superior compute power over predecessor
– Intel E5 2600V3 series (Haswell) processors and FDR IB outperforms prior generations
– Performs 113% higher vs Westmere, 36% vs Sandy Bridge, and 23% vs Ivy Bridge
• FDR InfiniBand is the most efficient cluster interconnect for Abaqus
– FDR IB outperforms 1GbE by 5.2 times, 10GbE by 87% at 8 nodes
• Abaqus/Standard performs faster than the previous version
– Version 6.14-2 outperforms version 6.13-2 by 8% at 4 nodes / 80 cores
– Both Platform MPI and Intel MPI is included in version 6.14-2
– Enabling SRQ may improve performance at scale
3434
Thank YouHPC Advisory Council
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein