Clusters to Supercomputers Schenk’s System AdministrationApril 2008 Matthew Woitaszek University of Colorado, Boulder NCAR Computer Science Section [email protected]

Clusters to Supercomputers

Schenk’s System Administration April 2008

Matthew WoitaszekUniversity of Colorado, Boulder

NCAR Computer Science Section

[email protected]

Presented forChris Schenk’s CSCI 4113Unix System Administration

We’re Hiring!

System Administrators

Software Developers

Web Technology Geeks Nobody does server-side refresh anymore

Job positions at NCAR and CU Full-time, part-time, and occasional

April 2008 2

Outline

Motivation My other computer is a …

Parallel Computing Processors Networks Storage Software

Grid Computing Software Platforms

April 2008 3

April 2008 4

James Demmel’s Reasons for HPC

Traditional scientific and engineering paradigm Do theory or paper design Perform experiments or build system

Replacing both by numerical experiments Real phenomena are too complicated to model by hand Real experiments are:

too hard, e.g., build large wind tunnels too expensive, e.g., build a throw-away passenger jet too slow, e.g., wait for climate or galactic evolution too dangerous, e.g., weapons, drug design

April 2008 5

Time-Critical Simulations

NCAR’s time-critical HPC simulations Mesoscale meterology Global climate

My favorite: Traffic simulations

Require more than a single processor to complete in a reasonable amount of time

Realtime ComputingModel and simulate to predict

The forecast simulation has to bedone before the weather happens.

April 2008 6

Performance: Vector vs. Parallel MM5 (1999)

7000

6000

5000

4000

3000

2000

1000

0 2 4 6 8 10 12 14 16

Non-realtime MM5

Number of Processors

Tim

e (

sec)

3600

Cray J90

Linux Cluster (theHIVE)Cray T3E

J. Dorband, J. Kouatchou, J. Michalakes, and U. Ranawake, “Implementing MM5 on NASA Goddard Space Flight Center computing systems: a performance study”, 1999.

April 2008 7

Performance: POP 640x768 (2003)POP 640x768 Simulated Years per Day by Number of Processors

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

8 16 24 32 40 48 56 64 72

Number of Processors

Sim

ula

ted

Ye

ars

pe

r W

all

Clo

ck D

ay

Xeon 2.4 Dual / Dolphin

Xeon 2.4 Single / Dolphin

Xeon 2.4 Single / Myrinet D

Xeon 2.8 Dual / Myrinet D

Xeon 3.06 Dual / Infiniband

Opteron 2.0 / Myrinet D (A)

Opteron 2.0 / Myrinet D (B)

IBM p690

POP on Xeon: memory bandwidth limit

M. Woitaszek, M. Oberg, and H. M. Tufo, “Comparing Linux Clusters for the Community Climate Systems Model”, 2003.

April 2008 8

Traffic Flow Simulation:15.5 miles17 exit ramps19 entry rampst = .5 sd = 100 ftT = 24 hr

Minneapolis I-494 Highway Simulation (1997)

Realtime Computing by Parallelization

April 2008 9

Performance: Realtime Computing by Parallelization

Legacy Code and Hardware Intel P133 single processor 65.7 minutes (3942 seconds)

simulation time

Massively Parallel Implementation Cray T3E: 450 MHz Alpha 21164 67.04 seconds with 1 PE 60x faster than P133 6.26 seconds with 16 PEs 629x 2.39 seconds with 256 PEs 1649x

Wait! If it takes ~1 minute on 1 PE, shouldn’t it take…67 seconds / 16 = 4.10 s on 16 PEs, or 67 seconds / 256 = 0.25 s on 256 PEs?

C. Johnston and A. Chronopolus, “The parallelization of a highway traffic flow simulation”, 1999.

April 2008 10

Speedup and Overhead

Amdahl’s Law The part you don’t optimize comes back to haunt you!

Speedup is limited by Memory latency Disk I/O bottlenecks Network bandwidth and latency Algorithm

April 2008 11

Performance: HOMME on BlueGene/L (2007)

G. Bhanot, J.M. Dennis, J. Edwards, W. Grabowski, M. Gupta, K. Jordan, R.D. Loft, J. Sexton, A. St-Cyr, S.J. Thomas, H.M. Tufo, T. Voran, R. Walkup, and A.A. Wyszogrodski, "Early Experiences with the 360TF IBM

BlueGene/L Platform," International Journal of Computational Methods, 2006.

0

50

100

150

200

250

300

32768 16384 8192 4096

ro

ss

ec

or

p/

ce

s/

PO

LF

M

Processors

coprocessor mode, snake 2x2 mapping, eager limit = 500virtual-node mode, snake 2x2 mapping, eager limit = 500virtual-node mode, grouped mapping, eager limit = 500

April 2008 12

Performance: HOMME on BlueGene/L (2007)

0 1

2 3

4 5

6 7 0

1 2

3 4

5 6

7

0

1

2

3

4

5

6

7

Z

X

Y

Z

April 2008 13

Outline




April 2008 14

Processors

From Gregory Pfister’s In Search of Clusters:

A pack of pooches.

A savage multi-headed pooch (SMP).

A pooch.

Source: Gregory Pfister’s In Search of Clusters

April 2008 15

Parallel Architectures

RAMA processor.

A symmetric multiprocessor (SMP).

CPU

RAM RAM

$

CPU

$

RAM RAM

A pack of processors.

CPU

$

CPU

$

CPU

$

April 2008 16

Parallel Architectures

A cache-coherent non-uniform [distributed shared] memory (ccNUMA) cluster of chip-multiprocessor (CMP) symmetric multiprocessors (SMP).

CPU CPU

RAML1$ L1$

Ln$

CPU CPU

RAML1$ L1$

Ln$

April 2008 17

Scalable Parallel Architectures

Emerging massively parallel architectures IBM BlueGene – 131072 chips (can be 2x processors)

Multi-core commodity architectures AMD Opteron, now Intel

Two CPUs per chip

Two chips per card

32 chips per node card

32 node cards per rack

64 racks in system

Source: IBM

April 2008 18

Networks

Network types Message passing (MPI) File system Job control System monitoring

Technologies and Competitors 1Gbps Ethernet and RDMA 10Gbps Ethernet Fixed Topology (3D Torus, Tree, Scali, etc.) Switched (Infiniband, Myrinet)

April 2008 19

Gigabit Ethernet Performance (2006)

RDMA has highest throughput (switched configuration):110 MB/s RDMA, 66 MB/s legacy, 45 MB/s motherboard

Payload Size (Bytes)

1 2 4 8 16

32

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

32

76

8

65

53

6

13

10

72

26

21

44

52

42

88

10

48

57

6

20

97

15

2

41

94

30

4

Th

rou

gh

pu

t (M

B/s

)

1

10

20

30

40

50

60

70

80

90

100

110

120(110.21MB/s) Ammasso RDMA Crossover(110.07MB/s) Ammasso RDMA Switch( 66.79MB/s) Ammasso Legacy Crossover( 66.40MB/s) Ammasso Legacy Switch( 50.39MB/s) Intel Crossover( 45.07MB/s) Intel Switch

RDMA

Legacy

Motherboard

M. Oberg, H. M. Tufo, T. Voran, and M. Woitaszek, “Evaluation of RDMA Over Ethernet Technology for Building Cost Effective Linux Clusters”, May 2006.

April 2008 20

RDMA for High-Performance Applications

Single network interface for all communications RDMA for MPI, DAPL (Direct Access Programming Library),

and Sockets Direct Protocol (SDP) RDMA bypasses operating system kernel Legacy interface for standard operating system TCP/IP

User SpaceApplication

RDMA NIC

OSKernel

User SpaceApplication

RDMA NIC

OSKernel

Zero-copy, interrupt-free RDMA for MPI applications

April 2008 21

Interconnect Performance (2006)

InterconnectMinimum Latency

Peak Bandwidth

GigE 30 - 60 us 125 MB/s

10GigE 30 - 60 us 1250 MB/s

Myrinet 4 us 250 MB/s

SCI 4 us 250 MB/s

Atoll 4 us 250 MB/s

Infiniband 5 us 1250 MB/s

Atoll Benchmarking ResultsManufacturer’s Ratings

1.5 Mbps = 192 KB/s and 1Gbps = 125 MB/s

April 2008 22

10Gbps Ethernet Performance (2007)

Ethernet approaches 10Gbps (and can be trunked!) Infiniband (4x) reported at 8Gbps sustainable

April 2008 23

StorageArchival StorageTape silo systems(3+ PB)

Supercomputersand local working storage(1 – 100 TB per system)

Grid GatewayGridFTP Servers

Archive Managementand disk cache controller

Shared Storage Clusterwith shared file system(100 – 500 TB)

E N T E R P R I S E6 0 0 0

E N T E R P R I S E6 0 0 0

Visualization Systemsand local working storage

TM

TMOrigin 3200

April 2008 24

Thousands of Disks

April 2008 25

Occam - Average Aggregate NFS Bandwidth by Clients

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30

Concurrent Clients

Ave

rag

e A

gg

reg

ate

Ban

dw

idth

(M

B/s

)

Read Write

The Single Server Limitation

Aggregate bandwidth decreases with increasing concurrent use!

Just you

SHAREDwith

others

April 2008 26

Cluster File Systems – Read Rate (2005)

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8Number of Concurrent Clients

Ave

rag

e A

gg

reg

ate

Rea

d R

ate

(MB

/s)

NFS PVFS2 PVFS2-1GbpsLustre Lustre-1Gbps TerraFSGPFS

April 2008 27

Cluster File Systems – Write Rate (2005)

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8Number of Concurrent Clients

Ave

rag

e A

gg

reg

ate

Wri

te R

ate

(MB

/s)

NFS PVFS2 PVFS2-1GbpsLustre Lustre-1Gbps TerraFSGPFS

April 2008 28

Table of Administrator Pain and Agony

GPFS 2.3 Lustre 1.4.0 PVFS2 TerraFS

Intel x86-64Metadata server

Not UsedRestrictedSLES .141

No Change Not Used

Intel Xeon Storage server

RestrictedSLES .111

RestrictedSLES .141

No Change No Change

PPC970 ClientRestrictedSLES .111

RestrictedSLES .141

All(Module Only)

N/A

Intel XeonClient

RestrictedSLES .111

RestrictedSLES .141

All(Module Only)

Custom2.4.26 Patch

Original goal was to fit file system in environment File system influences operating system stack

GPFS required a commercial OS and a specific kernel version Lustre required a commercial OS and a specific kernel patch TerraFS required a custom kernel

April 2008 29

Bullet Points of Administrator Pain and Agony

Remain responsive even in failure conditions Filesystem failure should not interrupt standard UNIX

commands used by administrators “ls –la /mnt” or “df” should not hang the console Zombies should respond to “kill –s 9”

Support clean normal and abnormal termination Support both service start and shutdown commands

Provide an Emergency Stop feature Cut losses and let the administrators fix things

Never hang Linux “reboot” command

April 2008 30

Block-Based Access is Complicated!

(Adapted from May, 2001, p. 79)

1 625 3 847

One file shared for writes on four nodes (cyclic mapping)

Server 1

Logical access size

Filesystem block size

Logical file view and physical block placement on two servers

1 2 1 2 1 2 1 2 1 2 1

Server 2

Consider the overhead of correlating blocks to servers:(Example: Where is the first byte of the red data stored?)

Blue Gene/L Single-Partition Performance (2008)

April 2008 31

Clients (Single I/O Node)

0 5 10 15 20 25 30

Ave

rage

Thr

ough

put (

MB/

s)

0

20

40

60

80

100

120

140GPFS Xyratex 2 Lustre Xyratex 2 PVFS Xyratex 2

GPFS Xyratex 4 Lustre Xyratex 4 PVFS Xyratex 4

GPFS Xyratex 6 Lustre Xyratex 6 PVFS Xyratex 6

GPFS DDN 4 Lustre DDN 4 PVFS DDN 4

GPFS FAStT−900 Terascala 10 GPFS FAStT−500

Blue Gene/L Storage Performance (2008)

April 2008 32

Clients0 200 400 600 800 1000

Ave

rage

Thr

ough

put (

MB/

s)

0

200

400

600

800

1000

1200

1400 Xyratex (2 hosts)Xyratex (4 hosts)Xyratex (6 hosts)

DDN (4 hosts)FAStT−900FAStT−500

April 2008 33

Software

Parallel Execution MPI

Job Control Batch queues: PBS, Torque/Maui

Libraries Optimized math routines BLAS, LAPACK

The next slides show what we tell the users…

April 2008 34

The Batch Queue System

Batch queues control access to compute nodes Please don’t ssh to a node and run programs Please don’t mpirun on the head node itself People expect to have the whole node for performance runs!

Resource management Flags and disables offline nodes (down or administrative) Matches job requests to nodes Reserves nodes preventing oversubscribing

Scheduling Queue prioritization spreads CPUs among users Queue limits prevent a single user from hogging the cluster

April 2008 35

PBS Queues on CSC Systems

Debugging for HPSC students Limited to 8 nodes, 10 minutes

Default queue for “friendly” jobs Limited to 16 nodes, 24 hours

Queue for large and long running jobs No resource limit, only 1 running job

Queue for users with special projects approved by the people in charge

speedq

friendlyq

workq

reservedq

April 2008 36

PBS Commands – Queue Status

[matthew@hemisphere]$ qstat -a

hemisphere.cs.colorado.edu: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----320102.hemisphe jkihm friendly WL_SIMP_17 7032 1 1 1024mb 24:00 R 09:52320103.hemisphe jkihm friendly WL_SIMP_18 7078 1 1 1024mb 24:00 R 09:52320355.hemisphe jkihm friendly WL_SIMP_17 4537 1 1 1024mb 24:00 R 08:18320388.hemisphe jkihm friendly WL_SIMP_25 -- 1 1 1024mb 24:00 Q --320389.hemisphe jkihm friendly WL_SIMP_25 -- 1 1 1024mb 24:00 Q --320390.hemisphe jkihm friendly WL_SIMP_30 -- 1 1 1024mb 24:00 Q --321397.hemisphe barcelos workq missile 21769 16 32 -- 01:12 R 00:04

What jobs are running? What jobs are waiting?

April 2008 37

Playing Nicely in the Cluster Sandbox

Security considerations Don’t share your account or your files (o+rw) Don’t put the current directory (.) in your path

Compute time considerations Don’t submit more than 10-60 jobs to PBS at a time Don’t submit from a shell script without a `sleep 1’ statement

Storage space considerations Keep large input and output sets in /quicksand, not /home Don’t keep large files around forever: compress or delete Please store your personal media collections elsewhere

Don’t use a password you have ever used anywhere else!

April 2008 38

Outline




April 2008 39

Sharing Computing and Data with Grids

Grids link computers together… more than a network! Networks connect computers Grids allow distant computers to work on a single problem

Services look like web servers HTTP for data transfer XML Simple Object Access Protocol (SOAP) instead of HTML

Grid services Metadata and Discovery Services (WS MDS) Job execution (WS GRAM) Data transfer (GridFTP) Workflow management (that’s what we do!)

April 2008 40

Grid-BGC Carbon Cycle Model

GridFTP

Project Database

Web Portal GUI

NCARMass

Storage

Project Object Manager

Job Execution Interface

Grid-BGC Web Service

Job Database

ExecutablesSurfer Resource

Broker

WS-GRAM

Scratch Space

Scratch Space

Grid Security Infrastructure Boundary

WorkflowManagement Service

GridFTP

Globus RFT

Web Service Client

J. Cope, C. Hartsough, S. McCreary, P. Thornton, H. M. Tufo, N. Wilhelmi, and M. Woitaszek, “Experiences from Simulating the Global Carbon Cycle in a Grid Computing Environment”, 2005.

April 2008 41

TeraGrid – Extensible Terascale Facility

April 2008 42

A National Research Priority

2000 $36 Terascale Computing SystemPSC

2001 $45 Distributed Terascale Facility NCSA, SDSC, ANL, CalTech

2002 $35 Extensible Terascale FacilityPSC

2003 $150 TeraGrid Extension ($10M) + OpsIU, Purdue, ORNL, TACC

2007 $65 Track 2 Mid-Range HPCORNL, TACC, NCAR

2007 $208 Track 1 “Blue Waters” PetascaleUIUC / NCSA

http://www.nsf.gov/news/news_summ.jsp?cntn_id=109850http://www.nsf.gov/news/news_summ.jsp?cntn_id=106875

All figures are in millions.

A Few TeraGrid Resources

April 2008 43

RP Name Vendor CPUs CPU Type TFLOP/s

TACC Ranger Sun 62,976 Opteron 504

ORNL NICS Kracken Cray XT4 7,488 Opteron 170

NCSA Abe Dell 9,600 Intel 64 89

TACC Lonestar Dell 5,840 Intel 64 62

PSC BigBen Cray XT4 4,180 Opteron 21

NCSA Cobalt SGI Altix 1,024 Itanium 6

NCAR Frost IBM BG/L 2,048 PPC 440 5

April 2008 44

Challenges and Definitions

Power consumption BlueVista: 276 kilowatts Average U.S. home: 10.5 kilowatts

Physical space

What’s the difference between a cluster and a supercomputer? Price Number of SMP processors in a compute node Network used to connect nodes in the cluster

April 2008 46

Cluster Administration

Parallel and distributed shells pdsh dsh

$ sudo pdsh –w node0[01-27] /etc/init.d/sshd restart

Configuration file management IBM CSM xCAT

Automated operating system installation

April 2008 47

Cluster Security

The most important question:

Centralized inaccessible logging

Intrusion detection Custom scripts Network monitoring difficult at 10Gbps

Desperate measures Extreme firewalling (but don’t depend on it!) Virtual hosting for services One-time passwords (RSA SecureID, CryptoCard)

How do you know if you’ve been compromised?

Questions?Matthew Woitaszek

[email protected]

Thanks to my CU and NCAR colleagues:

Jason Cope, John Dennis, Bobby House,

Rory Kelly, Dustin Leverman, Paul Marshal,

Michael Oberg, Henry Tufo, and Theron Voran

Schenk’s System Administration April 2008

Documents

Clusters to Supercomputers Schenk’s System AdministrationApril 2008 Matthew Woitaszek University of Colorado, Boulder NCAR Computer Science Section [email protected]