Upload
phungque
View
217
Download
0
Embed Size (px)
Citation preview
National Center for Supercomputing Applications
Engineering Breakthroughs at NCSA:XSEDE, Blue Waters, Industry
Seid Koric Senior Technical Lead-Private Sector Program at NCSA
Adjunct Professor, Mechanical Science and Engineering Dept.University of Illinois
http://[email protected]
XSEDE ECSS Project: 3D Study of Elastic-Plastic Transition and Fractal Patterns of 1 million Grain
Cube of grade 316-Steel (2010-2012)
(M. Ostoja-Starzewski, Jun Li, S. Koric, A. Saharan, Philosophical Magazine, 2012 )
National Center for Supercomputing Applications
Largest Nonhomogenous FEA simulations to date
Every of 1 Million Elements (Grains) has a different material property
Fractal dimension can be used to estimate level of plasticity for damage assessment for various structures
We are aiming at (much) larger simulations on Blue Waters !
National Center for Supercomputing Applications
Blue Waters- sustained peta-scale system
• >300Cray System & Storage cabinets:
• >25,000Compute nodes:
• >1 TB/sUsable Storage Bandwidth:
• >1.5 PetabytesSystem Memory:
• 4 GBMemory per core module:
• 3D TorusGemini Interconnect Topology:
• >25 PetabytesUsable Storage:
• >11.5 PetaflopsPeak performance:
• >49,000Number of AMD processors:
• >380,000Number of AMD x86 core module:
• >5,000Number of NVIDIA GPUs:
iForge-Industrial HPC resource at NCSA
National Center for Supercomputing Applications
Platform 1 Platform 2x86 Cores 2048 576CPU Type “Sandy Bridge” “Abu Dhabi”
Clock 3.2 Ghz 3.4 GHz
Cores/Node 16 32Memory/Node 128 GB, 1600 MHz 256 GB, 1600MHzGlobal RAMdisk 1.5 TerabytesTotal Memory 21 Terabytes
Storage 700 TerabytesFile system GPFSInterconnect 40 Gigabit QDR InfiniBand
MPI Platform, Intel, MVAPICH2, OpenMP
Operating System Red Hat Enterprise Linux 6.4
National Center for Supercomputing Applications
Evaluation of Massively Parallel Linear Solvers in Implicit FEA
• Implicit FEA code spends (70-80%) of time solving large systems of linear equations, Ax=b , where A is sparse i.e., most of coefficients are zero
• A wide range of applications: finite element solid mechanics, computational fluid dynamics, reservoir simulation, circuit design, linear programming etc.
National Center for Supercomputing Applications
Problem Specification (matrices)
• Originate from either in-house industrial and academic codes, or from a commercial FE code solving real world engineering problems
• Mostly SPD with N=1-20 M, NNZ=120-500M
• Condition Numbers 103-1012
National Center for Supercomputing Applications
Problem Specification (solvers)
•WSMP: direct solver developed by IBM/Watson, based on multifrontal algorithm, hybrid (MPI & p-threads), symmetric and nonsymmetric•Super LU: direct solver developed by LBNL, LU decomposition, MPI, nonsymmetric•MUMPS: direct solver funded by CEC ESPIRT IV, multifrontalalgorithm, MPI, symmetric and nonsymmetric•Hypre: iterative solver, LLNL, Conjugate Gradient with AMG, IC, and SAI (Sparse Approx Inverse) pre-conditioners, MPI, symmetric•PETSc: iterative solver, ANL, Conjugate Gradients (CG), Bi-Conjugate Stabilized (BCGS), Conjugate Residual Gradient (CR) with Bjacobi, ACM (Additive Schwarz) , and AMG (Multi-Grid) pre-conditioners , MPI, symmetric and nonsymmetric•Commercial FEA Codes (NDA)
Solver Work in Progress (iForge now)
National Center for Supercomputing Applications
0
50
100
150
200
250
CG/Bjacobi,PETSc,
Rconv=1.E‐5
BCGS/Bjacobi,PETSc,
Rconv=1.E‐5
BCGS/ASM,PETSc,
Rconv=1.E‐5
CR/Bjacobi,PETSc,
Rconv=1.E‐5
PCG/ParaSails,Hypre,
Rconv=1.E‐5
MUMPS SPD,Direct
WSMP SPD,Direct
SuperLU,Unsymmetric,
Direct
Solutio
n Time [sec
]
Matrix 1M, SPD, N=1.5M, NNZ=63.6M, COND=6.9E4Lower = Better
16 cores
32 cores
64 cores
128 cores
256 cores
An Order of magnitude larger problem
National Center for Supercomputing Applications
0
2000
4000
6000
8000
10000
12000
Solutio
n Time [sec]
16 cores
32 cores
64 cores
128 cores
256 cores
512 cores
CR/Bjacobi, PETSc, Rconv=1.0E‐5
WSMP, SPD, Direct
PCG/Parasails, Hypre, Rconv=1.0E‐5
MUMPS, SPD, Direct
Matrix 20M, SPD, N=20.05M, NNZ=827.49M, COND=~1.E7Lower = Better
WSMP Performance on iForgeHigher=Better
National Center for Supercomputing Applications
0
1
2
3
4
5
6
128 256 512 768 960
Sparse Factroizatio
n Pe
rforman
ce TFlop
/Sec
Number of Threads
Watson Sparse Matrix Package Hybrid (MPI/Pthreads) Symmetric Solver N=2.8M, NNZ=107M
X5690/Westmere
XE5‐2670/Sandy Bridge
ABAQUS model:Number of elements: 2,274,403Number of nodes: 12,190,073Number of DOFs >30M
ABAQUS analysis job:Cluster: iForgeNumber of cores used: 24-196Solver: Direct Sparse 7hours->1hour
ISV Implicit FEA Benchmark on iForge
0
5000
10000
15000
20000
25000
30000
0 50 100 150 200 250
Wal
l Clo
ck T
ime
(sec
)
# of cores
Wall Clock Time vs. Number of Cores
Explicit FEA: LS-Dyna on Blue Waters
NCSA/PSP, Hardware Vendor (CRAY), ISV (LSTC), PSP partner (NDA)-all working together !
Real geometry, Loads, BC-s, highly nonlinear transient dynamic problem with difficult contact conditions
MPP Dyna solver fully ported and optimized to CRAY’s Linux Environment and taking full advantage of Gemini interconnect
National Center for Supercomputing Applications
LS-Dyna Breakthrough on Blue Waters
National Center for Supercomputing Applications
0
2
4
6
8
10
12
14
16
512 1024 1536 2048 3072 4096 8192
Wall Clock (h
ours)
CPU Cores
26.5M nodes, 80M DOFs, Time in Hours, Lower = Better
iForge (MPI)
Blue Waters(MPI)
Blue Waters(Hybrid)
Highest known scaling of LS‐DYNA to date !!
Typical MPP-Dyna Profiling
National Center for Supercomputing Applications
As the number of cores increases, the communication cost increases rapidly !
64 cores
Computing
Communication
512 cores
Dyna Work in progress
• Benchmarking even larger real problems• Memory management becoming a serious
issue for DP (decomposition, distribution, MPMD, etc.)
• Hybrid (MPI/OpenMP) solver uses less memory and less communication
• Load Balance in Contact and Rigid Body Algorithms
National Center for Supercomputing Applications
Star‐CCM+ Breakthrough on Blue Waters
Source: NCSA Private Sector Partner ”B" (Confidential)Code/Version: Star‐CCM+ 7.6.9Physics: Transient, turbulent, single‐phase compressible flowMesh size: 21.4 million unstructured polyhedral cellsComplexity: Very complicated geometry, high resolution mesh
Complex real‐life production case: A highly complex CFD case both in terms of the mesh and physics involved.
0
200
400
600
800
1000
0 128 256 384 512 640 768 896 102411521280140815361664179219202048
Iteratio
ns / Sim
ulation Hr
CPU Cores
iForge
BlueWaters
Scaling with Infiniband levels off at 256 cores
Highest known scaling of Star‐CCM+ to date…
…and we broke the code!
CD‐adapco Star‐CCM+ Case from “Partner B”Iteration/Simulation hour, Higher = Better
Future of HPC, GPGPU with OpenACC ?
National Center for Supercomputing Applications
0
10
20
30
40
50
60
70
80
90
100
CPU Only (1 OMP) CPU Only (6 OMP) GPU(OpenACC)
Wall Clock [sec]
Laplace 2DLower is Better Blue Waters XK7
(Interlagos/Kepler)
KIDS (Westmere/Fermi)
14x Speedup !
Inter-Nodal GPU Acceleration on Blue Waters with Abaqus
National Center for Supercomputing Applications
0
5
10
15
20
25
30
8 16 32 64 96
Parallel Spe
edup
Cores
Abaqus/Standard 6.11, Cluster Compatibility ModeS4B Benchmark (5.23M Dofs), Higher=Better
Cray XE6 (CPU only)
Cray XK7(CPU+GPU)
NDEMC Public-Private Partnership
National Center for Supercomputing Applications
•US OEMs have gained a competitive edge through the use of high performance computing (HPC) with modeling simulation and analysis (MS&A).
• US Council of competitiveness recognized that small and medium sized enterprises (SMEs) are not able to take advantage of HPC
• Starting in Fall of 2011 a regional pilot program was started in the Midwestern supply base.
Objective:Study fatigue life of a charge air cooler due to thermal stresses for NDEMC project.
Description:Three‐Step Sequentially Coupled Simulation
(1) CFD Analysis of turbulent fluid flow through CAC coupled with advective HT provide thermal BC‐s for FEA.
(2) FEA analysis of the thermo‐mechanical provides transient thermal stresses in solid part during the thermal cycle for Fatigue Analysis.
(3) Fatigue Model uses history of thermal stresses estimates the cycle life at critical points 15M
nodes
NDEMC: Multiphysics Simulation of Charge Air Cooler (CAC)