Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005

Jacquard: Architecture and Application

Performance Overview

NERSC Users’ GroupOctober 2005

Outline

An engineering level overview of the HW and SW that make up jacquard.

1) CPU’s

2) Memory

3) OS

4) Interconnect

Will use seaborg as a point of reference.

Colony Switch

Colony Switch

PG F S

seaborg.nersc.gov (review?)

Resource Speed Bytes

Registers 3 ns 256 B

L1 Cache 5 ns 32 KB

L2 Cache 45 ns 8 MB

Main Memory 300 ns 16 GB

Remote Memory 19 us 7 TB

GPFS 10 ms 50 TB

HPSS 5 s 9 PB

380 x

HPSSHPSS

CSS0

CSS1

•6080 dedicated CPUs, 96 shared login CPUs•Hierarchy of caching, speeds•Bottleneck determined by first depleted resource

16 way SMP NHII Node

Seaborg:

crossbar

main memoryGPFSMPI

Infiniband Switch

Infiniband Switch

PG F S

jacquard.nersc.gov basics

Resource Speed Bytes

Registers 0.5 ns 2 KB

L1 Cache 1.5 ns 64 KB

L2 Cache 45 ns 1 MB

Main Memory 70-117 ns 6 GB

Remote Memory 5 us 2 TB

GPFS 10 ms 15 TB

HPSS 5 s 9 PB

320 x

HPSSHPSS

IB

•640 dedicated CPUs, 8 shared login CPUs•Smaller caches, HT, Really Fast•SMP? NUMA? SUMO.

2 way Opteron node

Jacquard:

Main MemoryGPFSMPI

HT

Opteron Block Diagram : Not strictly SMP

1 TLB per CPU1K entries 4K pages 4MB coverage

SDRAM SDRAM

Switch, I/O

Hyper Transport: Good Stuff

Little conflict between data movement and computation

SMP size and memory contention

Jacquard’s numbers1 task : 100 %2 tasks: 98%

Why is Jacquard

2 way SMP?

Flops @ 2.2 GHz

• Peak Theoretical Flops–Double (64 bit) floats : 1 add + 1 mult = 2.2 GFlop/s–Single (32 bit) floats : 2 add + 2 mult = 4.4 GFlop/s

• Peak Realized Flops–Double (64 bit) floats : 1.9 GFlop/s–Single (32 bit) floats : 3.4 GFlop/s

• Your Flops?– Walltime is more important than flops– For a known algorithm flops are a sanity check

Memory BW4 GB/sec per CPU

MPI Bandwidth: seaborg

MPI Bandwidth: Jacquard

Linux for AIX Users

Linux and AIX are more similar than different

• Linux is not as good as AIX in keeping processes scheduled of the same CPU processor affinity work.

• Linux has easy interfaces to architectural and process performance information /proc/cpuinfo, /proc/self, etc.

• AIX MPI is in /usr/{bin,lib}, Linux MPI is in modules

• Linux doesn’t need –bmaxdata !

• Little vs. Big Endian

Conclusions

• The underlying HW technologies HT, IB, etc. are quite promising. Opteron systems are delivering great price/performance.

• Still working some SDRAMM, OS, and SW issues.

• What’s useful to you? Let us know.

Documents

Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005