Considerations for Scalable CAE on the SGI ccNUMA Architecture

Considerations for Scalable CAE on the SGI ccNUMA Architecture

Stan PoseyApplications Market Development

Cheng LiaoPrincipal Scientist, FEA ApplicationsChristian TanasescuCAE Applications Manager

Topics of Discussion

Historical Trends of CAEHistorical Trends of CAE

Current Status of Scalable CAECurrent Status of Scalable CAE

Future Directions in ApplicationsFuture Directions in Applications

Workstationsand Servers

Workstationsand Servers

MainframesMainframes

Economics: Physical prototyping costs continue Increasing Engineer more expensive than simulation tools

Cost

1960 2000Years

Cost of CAESimulation

Cost of PhysicalPrototyping

Cost of CAEEngineer

MSC/NASTRANSimulation Costs

(Source: General Motors)

MSC/NASTRANSimulation Costs

(Source: General Motors)

CAE Engineervs. System Costs(Source: Detroit Big3)

CAE Engineervs. System Costs(Source: Detroit Big3)

1960$30,000

1960$30,000

Engineer$36/hr

Engineer$36/hr

1999$0.021999$0.02

System$1.5/hrSystem$1.5/hr

Motivation for CAE Technology

Computer Hardware Advances:Processors: Ability to “hide” system latency

Architecture: ccNUMA: Crossbar switch replaces shared bus

Recent Technology Achievements

Rapid CAE Advancement from 1996 to 1999

Late 1980’s: Shared Memory ParallelHardware: Bus-based shared memory parallel (SMP)Parallel Model: Compiler enabled loop level (SMP fine grain)Characteristics: Low scalability (2p to 6p) but easy to programLimitations: Expensive memory for vector architectures

Early 1990’s: Distributed Memory ParallelHardware: MPP and cluster distributed memory parallel (DMP)Parallel Model: DMP coarse grain through explicit message passingCharacteristics: High scalability (> 64p) but difficult to programLimitations: Commercial CAE applications generally unavailable

Late 1990’s: Distributed Shared Memory ParallelHardware: Physically DMP but logically SMP ccNUMAParallel Model: SMP fine grain, DMP and SMP coarse grainCharacteristics: High scalability and easy to program

Recent History of Parallel Computing

Origin ccNUMA Architecture Basics

MainMemory

Proc.

Cache

I/O

Proc.

Cache

Local Switch

Proc.

Cache

I/O

Proc.

Cache

Local Switch

Global Switch Interconnect

MainMemory D

ir

Dir

Features of ccNUMA Multi-purpose ArchitectureFeatures of ccNUMA Multi-purpose Architecture

Detail of TwoNode (w/Router)

Architecture (32p Topology)

Node

Router

• Origin2000 ccNUMA available since 1996

• Non-blocking crossbar switch as interconnect fabric

• High levels of scalability over shared bus SMP

• Physical DMP but logical SMP (synchronized cache memories)

• 2 to 512 MIPS R12000/400Mhz processors with 8MB L2 cache

• High memory bandwidth (1.6Gb/s) and I/O that is scalable

• Distributed and shared memory (fine and coarse) parallel models

Parallel Computing with ccNUMA

Origin2000/256

Features of ccNUMA Multi-purpose ArchitectureFeatures of ccNUMA Multi-purpose Architecture

Computer Hardware Advances:Processors: Ability to “hide” system latency

Architecture: ccNUMA: Crossbar switch replaces shared bus

Application Software Advances:Implicit FEA: Sparse solvers increase performance by 10-fold

Explicit FEA: Domain parallel increases performance by 10-fold

CFD: Scalability increases performance by 100-fold

Meshing: Automatic and robust “tetra” meshing

Recent Technology Achievements

Rapid CAE Advancement from 1996 to 1999

Compute Intensity Flops/word of memory traffic

Deg

ree

of

Par

alle

lism

0.1 1 10 100 1000

FLUENT

ABAQUS

PAM-CRASH

LS-DYNA

MSC.Nastran (101)

ADINA ANSYS

Cache-friendlyMemory BW

Low

HighSTAR-CD

RADIOSS

MARC

MSC.Nastran (108)

OVERFLOWCFD

Explicit FEA

Implicit FEA(Statics)

Characterization of CAE Applications

MSC.Nastran (103 and 111)

Implicit FEA(Modal Freq)

Implicit FEA(Direct Freq)

MP SCALAR MP SCALAR

VECTOR VECTOR


Deg

ree

of

Par

alle

lism

0.1 1 10 100 1000

FLUENT

ABAQUS

PAM-CRASH

LS-DYNA

MSC.Nastran (101)

ADINA ANSYS


Low

HighSTAR-CD

RADIOSS

MARC

MSC.Nastran (108)


OVERFLOWCFD

Explicit FEA









CPU1CPU2CPU3CPU4

image

Implicit FEA - ABAQUS, ANSYS, MSC.Marc, MSC.Nastran

Explicit FEA - LS-DYNA, PAM-CRASH, RADIOSS

General CFD - CFX, FLUENT, STAR-CD

Domain Parallel Example:

Compressible 2D flow overwedge, partitioned as 4domains for parallelexecution on 4 processors

1

3

4

2

System

Scalable CAE: Domain Decomposition Parallel

Scalability Emerging for all CAE

Parallel Scalability in CAE

51

2

25

6

12

8

6432

16

8

4

2

1

# CPUs

Nastran CFD CodesCrash Codes

108101103108

SMP DMP

Usable parallel

V70.5

V70.7

Peak parallel

Sources that Inhibit Efficient Parallelism

Source

Computational load imbalance

communication overhead

between neighboring partitions

data and process placement

message passing performance

MPICH latency : ~ 31s

Solution

Nearly equal sized partitions

minimize communication between

adjacent cells on different cpus

enforce memory-process affinity

latency and bandwidth awareness

SGI-MPI3.1 latency : ~ 12s

Sca

lin

g t

o 1

6p o

nly

Sca

lin

g t

o 6

4p !

!

Considerations for Scalable CAE

Processor-Memory Affinity (Data Placement)

R

N

N

RN

N

R

N

N

R N

N

R N

N

R

N

N

RN

N

R

N

N

Processmigrates,data stays

Process + Data

Theory:system will place data and execution threads together properly, system will migrate that data to follow the executing

Real Life:32p Origin 2000

Considerations for Scalable CAE

CPUs

10

30

60

120

240

SSI

381 1.0

99 3.9

67 5.7

29 13.1

18 21.2

4 x 64

424 1.0

139 3.0

72 5.9

39 10.9

49 8.7

Software: FLUENT 5.1.1CFD Model: External aerodynamics, 3D, , segregated

incompressible, iso-thermal, 29M cells

Time per Iteration (seconds)

FLUENT Scalability on ccNUMA

FLUENT Scalability Study of SSI vs. ClusterFLUENT Scalability Study of SSI vs. Cluster

Largest FLUENT automotive case

achieved near ideal scaling on

SGI 2800/256

CPUs

8

16

32

64

128

256

Shared Memory (ns)

528

641

710

796

903

1200

MPI (ns)

19 x 10^3

23 x 10^3

26 x 10^3

29 x 10^3

34 x 10^3

44 x 10^3

Single System Image (SSI) Latency

HIPPI osBYPASS 139 x 10^3

Cluster Configuration Latency

256cpu SSI

4 x 64 Cluster

SSI Advantage for CFD with MPI

75

60

45

30

15

00 128 256 384 512

Number of CPUs

Per

form

ance

(G

FL

OP

/s)

60 GFLOPS, Oct 99

FY98 Milestone

C916/16 OVERFLOW Limit

Problem: 35M Points160 Zones

NASAAmes Research Center

Largest model in NASA history,

achieved 60Gflops on SGI 2800/512

with linear scaling

OVERFLOW Complete Boeing 747 Aerodynamics Simulation

BoeingCommercial Aircraft

Grand Scale HPC: NASA and Boeing

Computational Requirements for MSC.Nastran

Compute Task

Sparse Direct Solver

Lanczos Solver

Iterative Solver

I/O Activity

Memory CPUBandwidth Cycles

7% 93%

60% 40%

83% 17%

100% 0%

MSC/NASTRAN MPI Based Scalability for SOL 108:

• Independent frequency steps, naturally parallel

• File and memory space not shared

• Near linear parallel scalability

• Improved accuracy over SOL 111 with increasing frequency

• Released on SGI with v70.7 (Oct 99)

MSC/NASTRAN MPI Based Scalability for SOL 103, 111:

• Typical scalability - 2x to 3x on 8p, less for SOL 111

MSC.Nastran Scalability on ccNUMA

200Hz100Hz

150 modes

CPU 1

350 modes

CPU 2

300 modes

CPU 3

200 modes

CPU 4

0Hz

200Hz50Hz 100Hz 150Hz

1 - 50

CPU 1

51 - 100

CPU 2

101 - 150

CPU 3

151 - 200

CPU 4

0Hz

400Hz300Hz



Freqs

CPU

Modes

CPU

Parallel Schematics

Parallel Schemesfor an excitation frequency of 200Hzon a 4 CPU system


0

20000

40000

60000

80000

100000

120000

140000

1-w

ay

4-w

ay

16-w

ay

wall clockinseconds

CPUs Elapsed ParallelTime (s) Speed-up

1 120720 1.0

2 61680 2.0

4 32160 3.8

8 17387 6.9

16 10387 11.6(*)

* measured on populated nodes

Cray T90 Baseline Results

SOL: 111DOF: 525KEigensolution: 2714 modesFreq Steps: 96Elap Time: 31610 sec

SOL 108 Comparison with Conventional NVH (SOL 111 on T90)


CPUs Elapsed Parallel Time (h) Speed-up

1 31.7 1.0

8 4.1 7.8

16 2.2 14.2

32 1.4 22.6

Model Description

Model: BIW SOL: 108DOF: 536KFreq Steps: 96

Run Statistics (per MPI Process)

Memory: 340 MB FFIO Cache: 128 MBDisk Space: 3.6 GBProcess/Node: 2

MSC.Nastran Parallel Scalability for Direct Frequency Response (SOL 108)


The Future of Automotive NVH ModelingThe Future of Automotive NVH Modeling

Higher excitation frequencies of interest will increase DOF and modal density beyond SOL 103,111 practical limits

Frequency

ElapTime Direct Frequency

Response: 108

Modal FrequencyResponse: 103,111

199X Models

200X Models

Future Automotive NVH Modeling





CapabilityFeatures

General Availability

IRIX/MIPS SSI

Linux/IA-64,Clusters & SSI

FunctionalityMigration

UNICOS/Vector

Economics of HPC Rapidly Changing

SGI Partnership with HPC Community on Technology RoadmapSGI Partnership with HPC Community on Technology Roadmap

• Bandwidth improvement of 2x over Origin2000

• System support for IRIX/MIPS or LINUX/IA-64

• Modular design allows subsystem upgrades without forklift

• Latency decrease by 50% over Origin2000

•Next Generation IRIX Features and Improvements

SN-MIPS: Features of Next Generation ccNUMA

• Shared memory to 512 processors and beyond• RAS enhancements: Resiliency and Hot Swap• Data center management: scheduling, accounting• HPC clustering: GSN, CXFS shared file system

HPC Architecture Roadmap at SGI


Deg

ree

of

Par

alle

lism

0.1 1 10 100 1000

FLUENT

ABAQUS

PAM-CRASH

LS-DYNA

MSC.Nastran (101)

ADINA ANSYS


Low

HighSTAR-CD

RADIOSS

MARC

MSC.Nastran (108)

OVERFLOWCFD

Explicit FEA






SN-MIPS Benefit

SN-MIPS Benefit


Deg

ree

of

Par

alle

lism

0.1 1 10 100 1000

FLUENT

ABAQUS

PAM-CRASH

LS-DYNA

MSC.Nastran (101)

ADINA ANSYS


Low

HighSTAR-CD

RADIOSS

MARC

MSC.Nastran (108)

OVERFLOWCFD

Explicit FEA






SN-MIPS Benefit

SN-MIPS Benefit

SN-IA BenefitSN-IA Benefit

Current as of SEP 1999

49.8

50.2

31.1

68.9

78.3

21.7

0

20

40

60

80

100

%

USA Europe Japan

1997

MP Scalar

Vector

18.3

81.7

35.2

64.8

72.5

27.5

0

20

40

60

80

100

%

USA Europe Japan

1999

1999: 2.9 TFlops installed in Automotive OEMs world wide

1997: 1.1 TFlops installed in Automotive OEMs world wide

Architecture Mix for Automotive HPC

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1995 1996 1997 1998 1999

Installed GFLOPs

EUROPE

J APAN

US

GM and DaimlerChrysler each grew capacity more than 2x over past year

Automotive Industry HPC Investments

Meta-Computing with Explicit FEAMeta-Computing with Explicit FEA

Los Alamos and DOE Applied Engineering Analysis“Stochastic Simulation of 18 CPU Years Completed in 3 Days on ASCI Blue Mtn”

USDOE supported research achieved “first-ever” full-scale ABAQUS/Explicit simulation of nuclear weapons impact response on Origin/6144 ASCI (Feb 00)

Ford Motor SRL and NASA Langley Optimization of a vehicle body for NVH and crash, completed 9 CPU months of RADIOSS and MSC.Nastran overnight with response surface technique (Apr 00)

BMW Body Engineering672 MIPS cpus dedicated to stochastic crash simulation with PAM-CRASH (Jan 00)

Non-deterministic methods for improved FEA simulation

Future Directions in CAE Applications

Meta-Computing with Explicit FEA

• Manage design uncertainty from variability– Scatter in materials, loading, test conditions

• Non-deterministic simulation of vehicle “population”– Meta-computing on SSI or large cluster

• Improved design space exploration– Moving design towards target parameters

Objective:Objective:

Approach:Approach:

Insight:Insight:

Unlikely Performance

Most likely Performance

NASALangley Research Center

Achieved overnight BIP optimization on SGI 2800/256, with

equivalent yield of 9 months CPU time

NVH & Crash Optimization of Vehicle Body Overnight

Ford MotorScientific Research Labs

• Ford body-in-prime (BIP) model of 390K DOF

• MSC.Nastran for NVH, 30 design variables

• RADIOSS for crash, 20 design variables

• 10 design variables in common

• Sensitivity based Taylor approx. for NVH

• Polynomial response surface for crash

Grand Scale HPC: NASA and Ford

Crash ModelSize

Number ofEngineers

Cost per CPU-hour

1

100

Growth Index

1993

1999

450000 elem.

x7 x5

X90+

Turnaroundtime CrashSMP

x6

Capacity GFlops

#1 564 Gflops

x90

x40

Turnaroundtime Crash,CFD-MPP

NVH ModelSize

2 Mil. DOF

CFDModelSize

>10Mil cells

x6x6

Historical Growth of CAE Application

Source: Survey of major automotive developers

CAE to evolve into fully scalable, RISC-based technology High resolution models - CFD today, Crash, FEA emerging

Deterministic CAE giving way to probability techniques Deployment increases computational requirements 10-fold

Visual interaction with models beyond 3M cell/DOF High resolution modeling will strain visualization technology

Multi-Discipline optimization (MDO) implementation in earnest Coupling of structure, fluids, acoustics, electromagnetics

Future Directions of Scalable CAE

Conclusions

For small and medium size problems cluster can be a viable solution in the range of 8 – 16 CPUs

In the space of large and extremely large problems SSI architecture provides better parallel performance due to superior characteristics of in-box interconnect

In order to increase a single CPU performance developer should put in consideration the correlation between exploited data structure & algorithms and specific memory hierarchy

ccNUMA system allows a coupling of various parallel programming paradigms which could benefit a performance of multiphysics applications

Documents

Considerations for Scalable CAE on the SGI ccNUMA Architecture