Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Principles of ScalableHPC System Design

March 6, 2012

Sue KellySandia National Laboratories

http://www.sandia.gov/~smkelly

Abstract: Sandia National Laboratories has a long history of successfully applying high performance computing (HPC) technology to solve scientific problems. We drew upon our experiences with numerous architectural and design features when planning our most recent computer systems. This talk will present the key issues that were considered. Important principles are performance balance between the hardware components and scalability of the system software. The talk will conclude with lessons learned from the system deployments.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Companyfor the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Outline

• A definition of HPC for scientific applications • Design Principles

– Partition Model

– Network Topology

– Balance of Hardware Components

– Scalable System Software

• Lessons Learned

• (n.) A branch of computer science that concentrates on developing supercomputers and software to run on supercomputers. A main area of this discipline is developing parallel processing algorithms and software programs that can be divided into little pieces so that each piece can be executed simultaneously by separate processors. (http://www.webopedia.com/TERM/H/High_Performance_Computing.html)

• Will not talk about embarrassingly parallel applications

• The idea/premise of scientific parallel processing is not new (http://www.sandia.gov/ASC/news/stories.html#nineteen-twenty-two)

What is High Performance Computing?

http://www.webopedia.com/TERM/H/High_Performance_Computing.html

http://www.sandia.gov/ASC/news/stories.html#nineteen-twenty-two

The Partition Model:Match the hardware & software to its function

Users/home

Parallel I/OCompute PartitionService

Net I/O

• Applies to both hardware and software• Physically and logically divide the system into functional units

• Compute hardware different configuration than service & I/O• Only run the necessary software to perform the function

Usage Model:Partitions cooperate to appear as one system

Linux Login (Service)

Node

ComputeResource

I/O

Mesh/Torus topologies are scalable

12,960Compute

Node Mesh

X=27

Y=20

Z=24

TorusInterconnect

in Z

310 Service

& I/O

Nodes

310

Ser

vice

& I

/O N

odes

Minimize communication interference

• Jobs occupy disjoint regions simultaneously• Example – red, green, and blue jobs:

Z=24

X=27

Y=20

12,960Compute

Nodes

Hardware Performance Characteristics that Lead to a Balanced System

• Network bandwidth

must balance with• Processor speed and operations per second

must balance with• Memory bandwidth and capacity

must balance with• File system I/O bytes per second

In Addition to Balanced Hardware,System Software must be Scalable

Scalable System SoftwareConcept #1

Do things in a hierarchical fashion

Jobs Launch is Hierarchical

ComputeNode

AllocatorJob Launch

Login Node

Linux

UserApplication

User

Login &

Start App

Job Scheduler Node

Batch momScheduler

Batch Server

......

...

…

ComputeNode

Allocator

Job Queues

Database Node

CPU Inventory Database

Fan out application

System monitoring is hierarchical

R S M S

E t h e r n e t

T r e eL 0L 0L 0

L 1L 1L 1

L 0L 0L 0

L 1L 1L 1

RR

RRC a b in e t

B o a r d

S M WS M WS M WS M WS M WS M W

S S

H /M

S S

H /M

HT

E t h e r n e t

H S NHT

S S

H /M

S S

H /M

HT

E t h e r n e t

H S NHT


Minimize Compute Node

Operating System Overhead

Operating System Interruptions Impede Progress of the Application

Interruptions of User Applications

0

50000

100000

150000

200000

250000

300000

350000

0 1 2 3 4 5 6

Wall time in seconds

Inte

rru

pti

on

s i

n n

s

LinuxCatamount

System monitoring is out of band and non-invasive

R S M S

E t h e r n e t

T r e eL 0L 0L 0

L 1L 1L 1

L 0L 0L 0

L 1L 1L 1

RR

RRC a b in e t

B o a r d

S M WS M WS M WS M WS M WS M W

S S

H /M

S S

H /M

HT

E t h e r n e t

H S NHT

S S

H /M

S S

H /M

HT

E t h e r n e t

H S NHT


Minimize Compute Node Interdependencies

Calculating Weather Minute by Minute

Calc 1

0 min

Calc 2

1 min

Calc 3

2 min

Calc 4

3 min 4 min

Calculation with Breaks

• Calculation with Asynchronous Breaks

Calc 1

0 min

Wait

1 min

Calc 2

2 min

Calc 3

3 min

Wait

4 min 5 min

Calc 4

6 min

Run Time Impact of LinuxSystems Services (aka Daemons)

• Say breaks take 50 S and occur once per second– On one CPU, wasted time is 50 s every second

• Negligible .005% impact

– On 100 CPUs, wasted time is 5 ms every second• Negligible .5% impact

– On 10,000 CPUs, wasted time is 500 ms• Significant 50% impact


Avoid linear scaling of buffer requirements

Connection-oriented protocolshave to reserve buffers for the worst case

• If each node reserves a 100KB buffer for its peers, that is 1GB of memory per node for 10,000 processors.

• Need to communicate using collective algorithms


Parallelize wherever possible

Use parallel techniques for I/OCompute Nodes

I/O Nodes

High Speed Network

Parallel File System Servers (190 + MDS)

10.0 GigE Servers (50)

Login Servers (10)

RAIDs10 Gbit Ethernet 1 Gbit Ethernet

• 140 MB/s per FC X 2 X 190 = 53 GB/s

• 500 MB/s X 50 = 25 GB/s

• 1.0 GigE X 10

CC CC CC CC CC CC CC CC CC CC CC CC CC

II II II II II

II

NN

LL

NN NN NN NN LL LL LL LL

Summary of Principles• Partition the hardware and software • Hardware

– For scalability and upgradability, use a mesh network topology– Determine the right balance of processor speed, memory

bandwidth, network bandwidth, and I/O bandwidth for your applications

• System Software– Do things in a hierarchical fashion– Minimize compute node OS overhead– Minimize compute node interdependencies– Avoid linear scaling of buffer requirements– Parallelize wherever possible

Lessons Learned

• Seek first to emulate– Learn from the past– Simulate the future

• Need technology philosophers• Tilt Meters• Historians• Even Tiger Woods has a coach

• The big bang only worked once– Deploy test platforms early and often

• Build de-scalable, scalable systems– Don’t forget that you have to get it running first!– Leave the support structures (even non-scalable development tools) in

working condition, you’ll need to debug some day• Only dead systems never change

– Nobody ever built just one system even when successfully deploying just one system

– Nothing is ever done just once• Build scaffolding that meets the structure

– Is build and test infrastructure in place FIRST?– Will it effectively support both the team and the project?

Documents

Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National