25
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories http://www.sandia.gov/~smkelly Abstract: Sandia National Laboratories has a long history of successfully applying high performance computing (HPC) technology to solve scientific problems. We drew upon our experiences with numerous architectural and design features when planning our most recent computer systems. This talk will present the key issues that were considered. Important principles are performance balance between the hardware components and scalability of the system software. The talk will conclude with lessons learned from the system deployments. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Embed Size (px)

Citation preview

Page 1: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Principles of ScalableHPC System Design

March 6, 2012

Sue KellySandia National Laboratories

http://www.sandia.gov/~smkelly

Abstract: Sandia National Laboratories has a long history of successfully applying high performance computing (HPC) technology to solve scientific problems. We drew upon our experiences with numerous architectural and design features when planning our most recent computer systems. This talk will present the key issues that were considered. Important principles are performance balance between the hardware components and scalability of the system software. The talk will conclude with lessons learned from the system deployments.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Companyfor the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Page 2: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Outline

• A definition of HPC for scientific applications • Design Principles

– Partition Model

– Network Topology

– Balance of Hardware Components

– Scalable System Software

• Lessons Learned

Page 3: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

• (n.) A branch of computer science that concentrates on developing supercomputers and software to run on supercomputers. A main area of this discipline is developing parallel processing algorithms and software programs that can be divided into little pieces so that each piece can be executed simultaneously by separate processors. (http://www.webopedia.com/TERM/H/High_Performance_Computing.html)

• Will not talk about embarrassingly parallel applications

• The idea/premise of scientific parallel processing is not new (http://www.sandia.gov/ASC/news/stories.html#nineteen-twenty-two)

What is High Performance Computing?

Page 4: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

The Partition Model:Match the hardware & software to its function

Users/home

Parallel I/OCompute PartitionService

Net I/O

• Applies to both hardware and software• Physically and logically divide the system into functional units

• Compute hardware different configuration than service & I/O• Only run the necessary software to perform the function

Page 5: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Usage Model:Partitions cooperate to appear as one system

Linux Login (Service)

Node

ComputeResource

I/O

Page 6: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Mesh/Torus topologies are scalable

12,960Compute

Node Mesh

X=27

Y=20

Z=24

TorusInterconnect

in Z

310 Service

& I/O

Nodes

310

Ser

vice

& I

/O N

odes

Page 7: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Minimize communication interference

• Jobs occupy disjoint regions simultaneously• Example – red, green, and blue jobs:

Z=24

X=27

Y=20

12,960Compute

Nodes

Page 8: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Hardware Performance Characteristics that Lead to a Balanced System

• Network bandwidth

must balance with• Processor speed and operations per second

must balance with• Memory bandwidth and capacity

must balance with• File system I/O bytes per second

Page 9: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

In Addition to Balanced Hardware,System Software must be Scalable

Page 10: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Scalable System SoftwareConcept #1

Do things in a hierarchical fashion

Page 11: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Jobs Launch is Hierarchical

ComputeNode

AllocatorJob Launch

Login Node

Linux

UserApplication

User

Login &

Start App

Job Scheduler Node

Batch momScheduler

Batch Server

......

...

ComputeNode

Allocator

Job Queues

Database Node

CPU Inventory Database

Fan out application

Page 12: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

System monitoring is hierarchical

R S M S

E t h e r n e t

T r e eL 0L 0L 0

L 1L 1L 1

L 0L 0L 0

L 1L 1L 1

RR

RRC a b in e t

B o a r d

S M WS M WS M WS M WS M WS M W

S S

H /M

S S

H /M

HT

E t h e r n e t

H S NHT

S S

H /M

S S

H /M

HT

E t h e r n e t

H S NHT

Page 13: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Scalable System SoftwareConcept #2

Minimize Compute Node

Operating System Overhead

Page 14: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Operating System Interruptions Impede Progress of the Application

Interruptions of User Applications

0

50000

100000

150000

200000

250000

300000

350000

0 1 2 3 4 5 6

Wall time in seconds

Inte

rru

pti

on

s i

n n

s

LinuxCatamount

Page 15: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

System monitoring is out of band and non-invasive

R S M S

E t h e r n e t

T r e eL 0L 0L 0

L 1L 1L 1

L 0L 0L 0

L 1L 1L 1

RR

RRC a b in e t

B o a r d

S M WS M WS M WS M WS M WS M W

S S

H /M

S S

H /M

HT

E t h e r n e t

H S NHT

S S

H /M

S S

H /M

HT

E t h e r n e t

H S NHT

Page 16: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Scalable System SoftwareConcept #3

Minimize Compute Node Interdependencies

Page 17: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Calculating Weather Minute by Minute

Calc 1

0 min

Calc 2

1 min

Calc 3

2 min

Calc 4

3 min 4 min

Page 18: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Calculation with Breaks

• Calculation with Asynchronous Breaks

Calc 1

0 min

Wait

1 min

Calc 2

2 min

Calc 3

3 min

Wait

4 min 5 min

Calc 4

6 min

Page 19: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Run Time Impact of LinuxSystems Services (aka Daemons)

• Say breaks take 50 S and occur once per second– On one CPU, wasted time is 50 s every second

• Negligible .005% impact

– On 100 CPUs, wasted time is 5 ms every second• Negligible .5% impact

– On 10,000 CPUs, wasted time is 500 ms• Significant 50% impact

Page 20: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Scalable System SoftwareConcept #4

Avoid linear scaling of buffer requirements

Page 21: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Connection-oriented protocolshave to reserve buffers for the worst case

• If each node reserves a 100KB buffer for its peers, that is 1GB of memory per node for 10,000 processors.

• Need to communicate using collective algorithms

Page 22: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Scalable System SoftwareConcept #5

Parallelize wherever possible

Page 23: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Use parallel techniques for I/OCompute Nodes

I/O Nodes

High Speed Network

Parallel File System Servers (190 + MDS)

10.0 GigE Servers (50)

Login Servers (10)

RAIDs10 Gbit Ethernet 1 Gbit Ethernet

• 140 MB/s per FC X 2 X 190 = 53 GB/s

• 500 MB/s X 50 = 25 GB/s

• 1.0 GigE X 10

CC CC CC CC CC CC CC CC CC CC CC CC CC

II II II II II

II

NN

LL

NN NN NN NN LL LL LL LL

Page 24: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Summary of Principles• Partition the hardware and software • Hardware

– For scalability and upgradability, use a mesh network topology– Determine the right balance of processor speed, memory

bandwidth, network bandwidth, and I/O bandwidth for your applications

• System Software– Do things in a hierarchical fashion– Minimize compute node OS overhead– Minimize compute node interdependencies– Avoid linear scaling of buffer requirements– Parallelize wherever possible

Page 25: Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories smkelly Abstract: Sandia National

Lessons Learned

• Seek first to emulate– Learn from the past– Simulate the future

• Need technology philosophers• Tilt Meters• Historians• Even Tiger Woods has a coach

• The big bang only worked once– Deploy test platforms early and often

• Build de-scalable, scalable systems– Don’t forget that you have to get it running first!– Leave the support structures (even non-scalable development tools) in

working condition, you’ll need to debug some day• Only dead systems never change

– Nobody ever built just one system even when successfully deploying just one system

– Nothing is ever done just once• Build scaffolding that meets the structure

– Is build and test infrastructure in place FIRST?– Will it effectively support both the team and the project?