Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
HPC Hardware Overview
Luke Wilson September 13, 2011
Texas Advanced Computing Center
The University of Texas at Austin
Outline
• Some general comments
• Lonestar System – Dell blade-based system – InfiniBand ( QDR) – Intel Processors
• Ranger System – Sun blade-based system – InfiniBand (SDR) – AMD Processors
• Longhorn System – Dell blade-based system – InfiniBand (QDR) – Intel Processors
About this Talk
• We will focus on TACC systems, but much of the information applies to HPC systems in general
• As an applications programmer you may not care about hardware details, but… – We need to think about it because of performance
issues – I will try to give you pointers to the most relevant
architecture characteristics
• Do not hesitate to ask questions as we go
High Performance Computing
• In our context, it refers to hardware and software tools dedicated to computationally intensive tasks
• Distinction between HPC center (throughput focused) and Data center (data focused) is becoming fuzzy
• High bandwidth, low latency – Memory
– Network
Lonestar: Intel hexa-core system
Lonestar: Introduction • Lonestar Cluster
– Configuration & Diagram – Server Blades
• Dell PowerEdge M610 Blade (Intel Hexa-Core) Server Nodes
• Microprocessor Architecture Features – Instruction Pipeline – Speeds and Feeds – Block Diagram
• Node Interconnect – Hierarchy – InfiniBand Switch and Adapters – Performance
Lonestar Cluster Overview
lonestar.tacc.utexas.edu
Lonestar Cluster Overview
Hardware Components Characteristics
Peak Performance 302 TFLOPS
Nodes 2 Hexa-Core Xeon 5680 1888 nodes / 22656 cores
Memory 1333 MHz DDR3 DIMMS 24 GB/node, 45 TB total
Shared Disk Lustre parallel file system 1 PB
Local Disk SATA 146GB/node, 276 TB total
Interconnect Infiniband 4 GB/sec P-2-P
Blade : Rack : System
• 1 node : 2 x 6 cores = 12 cores • 1 chassis : 16 nodes = 192 cores • 1 rack : 3 x 16 nodes = 576 cores • 39⅓ racks : 22656 cores
16 blades
3 chassis
Lonestar login nodes
• Dell PowerEdge M610
– Intel dual socket Xeon hexa-core 3.33GHz
– 24 GB DDR3 1333 MHz DIMMS
– Intel QPI 5520 Chipset
• Dell 1200 PowerVault
– 15 TB HOME disk
– 1 GB user quota(5x)
Lonestar compute nodes
• 16 Blades / 10U chassis
• Dell PowerEdge M619 – Dual socket Intel hexa-core Xeon
• 3.33 GHz(1.25x) • 13.3 GFLOPS/core(1.25x) • 64 KB L1 cache (independent) • 12 MB L2 cache (unified)
– 24 GB DDR3 1333 MHz DIMMS – Intel QPI 5520 Chipset – 2x QPI 6.4 GT/s – 146 GB 10k RPM SAS-SATA local disk (/tmp)
Motherboard
IOH
I/O Hub PCIe
ICH
I/O Controller
Hexa-core
Xeon
Hexa-core
Xeon
DDR3 1333
12 CPU Cores
QPI
DMI
DDR3 1333
Intel Xeon 5600(Westmere) • 32 KB L1 cache/core
• 256 KB L2 cache/core
• Shared 12 MB L3 cache
Core 1
Core 2
Cluster Interconnect
16 independent 4 GB/s connections/chassis
Lonestar Parallel File Systems: Lustre &NFS
$HOME
$WORK
$SCRATCH
15TB
1GB/
user
?TB
841 TB
1
2
3
Uses IP over IB
InfiniBand
Ethernet
Ranger: AMD Quad-core system
Ranger: Introduction
• Unique instrument for computational scientific research
• Housed at TACC’s new machine room
• Over 2 ½ years of initial planning and deployment efforts
• Funded by the National Science Foundation as part of a unique program to reinvigorate High Performance Computing in the United States (Office of Cyberinfrastructure)
ranger.tacc.utexas.edu
How Much Did it Cost and Who’s Involved?
• TACC selected for very first NSF ‘Track2’ HPC system
– $30M system acquisition
– Sun Microsystems is the vendor
– Very Large InfiniBand Installation • ~4100 endpoint hosts
• >1350 MT47396 switches
• TACC, ICES, Cornell Theory Center, Arizona State HPCI are teamed to operate/support the system four 4 years ($29M)
Ranger: Performance
• Ranger debuted at #4 on the Top 500 list (ranked #11 as of June 2010)
Ranger Cluster Overview
Hardware Components Characteristics
Peak Performance 579 TFLOPS
Nodes 4 Quad-Core Opteron 3,936 nodes / 62,976 cores
Memory 667MHz DDR2 DIMMS
1GHz HyperTransport
32 GB/node, 123 TB total
Shared Disk Lustre parallel file system 1.7 PB
Local Disk NONE (flash-based) 8 Gb
Interconnect Infiniband (generation 2) 1 GB/sec P-2-P
2.3 μs latency
Ranger Hardware Summary
• Compute power - 579 Teraflops – 3,936 Sun four-socket blades – 15,744 AMD “Barcelona” processors
• Quad-core, four flops/cycle (dual pipelines)
• Memory - 123 Terabytes – 2 GB/core, 32 GB/node – ~20 GB/sec memory B/W per node (667 MHz DDR2)
• Disk subsystem - 1.7 Petabytes – 72 Sun x4500 “Thumper” I/O servers, 24TB each – 40 GB/sec total aggregate I/O bandwidth – 1 PB raw capacity in largest filesystem
• Interconnect - 10 Gbps /1.6 – 2.9 sec latency – Sun InfiniBand-based switches (2), up to 3456 4x ports each – Full non-blocking 7-stage Clos fabric – Mellanox ConnectX InfiniBand (second generation)
Ranger Hardware Summary (cont.)
• 25 Management servers - Sun 4-socket x4600s – 4 Login servers, quad-core processors
– 1 Rocks master, contains software stack for nodes
– 2 SGE servers, primary batch server and backup
– 2 Sun Connection Management servers, monitors hardware
– 2 InfiniBand Subnet Managers, primary and backup
– 6 Lustre Meta-Data Servers, enabled with failover
– 4 Archive data-movers, move data to tape library
– 4 GridFTP servers, external multi-stream transfer
• Ethernet Networking - 10Gbps Connectivity – Two external 10GigE networks: TeraGrid, NLR
– 10GigE fabric for login, data-mover and GridFTP nodes, integrated into existing TACC network infrastructure
– Force10 S2410P and E1200 switches
Infiniband Cabling in Ranger
• Sun switch design with reduced cable count, manageable, but still a challenge to cable – 1312 InfiniBand 12x to 12x cables – 78 InfiniBand 12x to three 4x splitter cables – Cable lengths range from 7-16m, average 11m
• 9.3 miles of InfiniBand cable total (15.4 km)
Space, Power and Cooling
• System Power: 3.0 MW total
• System: 2.4 MW – ~90 racks, in 6 row arrangement – ~100 in-row cooling units – ~4000 ft2 total footprint
• Cooling: ~0.6 MW – In-row units fed by three 400-ton chillers – Enclosed hot-aisles – Supplemental 280-tons of cooling from CRAC units
• Observations: – Space less an issue than power – Cooling > 25kW per rack difficult – Power distribution a challenge, more than 1200 circuits
External Power and Cooling Infrastructure
Switches in Place
InfiniBand Cabling in Progress
Ranger Features
• AMD Processors: – HPC Features 4 FLOPS/CP – 4 Sockets on a board – 4 Cores per socket – HyperTransport (Direct Connect between sockets) – 2.3 GHz core – Any idea what the peak floating-point performance of a node is?
• 2.3 GHz * 4 Flops/CP * 16 cores = 147.2 GFlops Peak Performance
– Any idea how much an application can sustain? • Can sustain over 80% of peak with DGEMM (matrix-matrix multiply)
• NUMA Node Architecture (16 cores per node, think hybrid)
• 2-tier InfiniBand (NEM – “Magnum”) Switch System
• Multiple Lustre (Parallel) File Systems
GigE
InfiniBand
Ranger Architecture
Magnum
InfiniBand
Switches
I/O Nodes WORK File System
…
…
internet
Login
Nodes
1
72
1
82
X4600
X4600
3,456 IB ports,
each 12x Line
splits into 3 4x
lines.
Bisection BW =
110Tbps.
“C48”
Blades
Compute
Nodes
4 sockets
X 4 cores
24 TB
each
Thumper
X4500 Metadata Server
…
X4600
1 per
File Sys.
Ranger Infiniband Topology
…78…
“Magnum”
Switch
12x InfiniBand
3 cables combined
NEM
NEM
NEM
NEM
NEM
NEM
NEM
NEM NEM
NEM
NEM
NEM NEM
NEM
NEM
NEM
MPI Tests: P2P Bandwidth
0
200
400
600
800
1000
1200
1 10 100 1000 10000 100000 1000000 1000000
0
1E+08
Message Size (Bytes)
Ba
nd
wid
th (
MB
/se
c)
Ranger - OFED 1.2 - MVAPICH 0.9.9
Lonestar - OFED 1.1 MVAPICH 0.9.8
Effictive Bandwith is improved
at smaller message size
Point-to-Point MPI Measured Performance
• Shelf Latencies: ~1.6 µs
• Rack Latencies: ~2.0 µs
• Peak BW: ~965 MB/s
• Able to sustain ~73% bisection bandwidth efficiency with all 3936 nodes communicating simultaneously (82 racks)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 20 40 60 80 100
# of Ranger Compute Racks
Bis
ecti
on
BW
(G
B/s
ec)
.
Ideal
Measured
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
1 2 4 8 16 32 64 82
# of Ranger Compute Racks
Fu
ll B
isecti
on
BW
Eff
icie
ncy .
Ranger: Bisection BW Across 2 Magnums
Sun Motherboard for AMD Barcelona Chips Compute Blade
4 Sockets
4 Cores
8 Memory
Slots/Socket
HyperTransport
1 GHz
Sun Motherboard for AMD Barcelona Chips
8.3
GB
/s
8.3
GB
/s
8.3
GB
/s
8.3
GB
/s
Two PCIe x8 – 32Gbps
One PCIe x4 – 16Gbps
Passiv
e M
idp
lan
e
A maximum neighbor NUMA Configuration for 3-port HyperTransport.
HyperTransport Bidirectional is 6.4GB/s, Unidirectional is 3.2GB/s.
Dual Channel, 533MHz Registered, ECC Memory
NEM Switch
Intel/AMD Dual- to Quad-core Evolution
MCH MCH
AMD Barcelona/Phenom Quad-Core AMD Opteron Dual-Core
Intel Clovertown Intel Woodcrest
Caches in Quad Core CPUs
Memory Controller
Core
1
L1
Core
2
L1
L2
Core
4
L1
Core
3
L1
L2
Memory Controller
L2
Core
1
L1
L2
Core
2
L1
Core
3
L1
L2 L2
Core
4
L1
L3
Intel Quad-Core
L2 caches are not independent
AMD Quad-Core
L2 caches are independent
• HT link bandwidth is 24 GB/s
(8 GB/s per link)
•Memory controller bandwidth
up to 10.7 GB/s (667 MHz)
Cache sizes in AMD Barcelona
L2
512 KB
Core 1
L1
64 KB
Core 2
L1
64 KB
Core 3
L1
64 KB
Core 4
L1
64 KB
L3
2 MB
L2
512 KB
L2
512 KB
L2
512 KB
System Request Interface
Crossbar Switch
HT 0 Mem Ctrl HT 1 HT 2
Other Important Features
• AMD Quad-core (K10, code name Barcelona)
• Instruction fetch bandwidth now 32 bytes/cycle
• 2MB L3 cache on-die; 4 x 512KB L2 caches; 64KB L1 Instruction & Data caches.
• SSE units are now 128-bit wide -> single-cycle throughput; improved ALU and FPU throughput
• Larger branch prediction tables, higher accuracies
• Dedicated stack engine to pull stack-related ESP updates out of the instruction stream
AMD 10h Processor
Speeds and Feeds
On Die
Registers L1 Data L2
External
Memory
3 CP ~15 CP ~300 CP Latency
Load Speed
Store Speed
4 W/CP
2 W/CP
2 W/CP
1 W/CP
0.5 W/CP
0.5 W/CP
Cache line size (L1/L2) is 8W
4 FLOPS / CP
64 KB 512 KB
L3
2 MB
~25 CP
2x @ 533MHz
DDR2 DIMMS
W : FP Word (64 bit)
CP : Clock Period
Cache States: MOESI (Modified, Owner, Exclusive, Shared, Invalid)
MOESI is beneficial when latency/bandwidth between
cpus is significantly better than main memory
Ranger Disk Subsystem - Lustre
• Disk system (OSS) is based on Sun x4500 “Thumper” – Each server has 48 SATA II 500 GB drives (24TB total) - running internal
software RAID – Dual Socket/Dual-Core Opterons @ 2.6 GHz – 72 Servers Total: 1.7 PB raw storage (that’s 288 cores just to drive the
file systems)
• Metadata Servers (MDS) based on SunFire x4600s
• MDS is Fibre-channel connected to 9TB Flexline Storage
• Target Performance – Aggregate bandwidth: 40 GB/sec
Ranger Parallel File Systems: Lustre
$HOME
$WORK
$SCRATCH
96TB
6GB/
user
193TB
773TB
36 OSTs
72 OSTs
300 OSTs
6
Thumpers
12
Thumpers
50
Thumpers
…81
12 blades/Chassis
blade
0
1
2
3
Uses IP over IB
InfiniBand Switch
I/O with Lustre over Native InfiniBand
• Max. total aggregate performance of 38 or 46 GB/sec depending on stripecount (Design Target = 32GB/sec)
• External users have reported performance of ~35 GB/sec with a 4K application run
$SCRATCH File System Throughput
0.00
10.00
20.00
30.00
40.00
50.00
60.00
1 10 100 1000 10000
# of Writing Clients
Wri
te S
pe
ed
(G
B/s
ec
)
.
Stripecount=1
Stripecount=4
$SCRATCH Single Application Performance
0.00
10.00
20.00
30.00
40.00
50.00
60.00
1 10 100 1000 10000# of Writing Clients
Wri
te S
peed
(G
B/s
ec)
. Stripecount=1
Stripecount=4
Longhorn: Intel Quad-core system
Longhorn: Introduction
• First NSF eXtreme Digital Visualization grant (XD Vis)
• Designed for scientific visualization and data analysis
– Very large memory per computational core
– Two NVIDIA Graphics cards per node
– Rendering performance 154 billion triangles/second
Longhorn Cluster Overview
Hardware Components Characteristics
Peak Performance (CPUs) 512 Intel Xeon E5400 20.7 TFLOPS
Peak Performance (GPUs, SP) 128 NVIDIA
Quadroplex 2200 S4 500 TFLOPS
System Memory DDR3-DIMMS
48 GB/node (240 nodes)
144 GB/node (16 nodes)
13.5 TB total
Graphics Memory 512 FX5800 x 4GB 2 TB
Disk Lustre parallel file
system 210 TB
Interconnect QDR Infiniband 4 GB/sec P-2-P
Longhorn “fat” nodes
• Dell R710
– Intel dual socket quad-core Xeon E5400 @ 2.53GHz
– 144 GB DDR3 ( 18 GB/core )
– Intel 5520 chipset
• NVIDIA Quadroplex 2200 S4
– 4 NVIDIA Quadro FX5800
• 240 CUDA cores
• 4 GB Memory
• 102 GB/s Memory bandwidth
Longhorn standard nodes
• Dell R610
– Dual socket Intel quad-core Xeon E5400 @ 2.53 GHz
– 48 GB DDR3 ( 6 GB/core )
– Intel 5520 chipset
• NVIDIA Quadroplex 2200 S4
– 4 NVIDIA Quadro FX5800
• 240 CUDA cores
• 4 GB Memory
• 102 GB/s Memory bandwidth
Motherboard (R610/R710)
Intel 5520
Quad Core
Intel Xeon
5500
DD
R3
DD
R3
DD
R3
DD
R3
DD
R3
DD
R3
DD
R3
DD
R3
DD
R3
Quad Core
Intel Xeon
5500
DD
R3
DD
R3
DD
R3
DD
R3
DD
R3
DD
R3
DD
R3
DD
R3
DD
R3
42 lanes PCI Express or 36 lines PCI Express 2.0
12 DIMM slots (R610)
18 DIMM slots (R710)
Cache Sizes in Intel Nehalem
L2
256 KB
Core 1
L1
32 KB
Core 2
L1
32 KB
Core 3
L1
32 KB
Core 4
L1
32 KB
L3
8 MB
L2
256 KB
L2
256 KB
L2
256 KB
Memory Controller
QPI 0 QPI 1
Shared L3 Cache
Core 1
Core 2
Core 3
Core 4
Memory Controller
QP
I 0
QP
I 1
• Total QPI bandwidth up to
25.6 GB/s (@ 3.2 GHz)
• 2 20-lane QPI links
• 4 quadrants (10 lanes each)
Nehalem μArchitecture
Storage
• Ranch – Long term tape storage
• Corral – 1 PB of spinning disk
Ranch Archival System
•Sun Storage Tek Silo
•10,000 tapes
•10 PB capacity
•Used for long-term
storage
Corral
• 1.2 PB DataDirect Networks online disk storage
• 8 Dell 1950 and 8 Dell 2950 servers
• High Performance Parallel File System – Multiple databases – iRODS data
management – Replication to tape
archive
• Multiple levels of access control
• Web and other data access available globally
References
Lonestar Related References
User Guide services.tacc.utexas.edu/index.php/lonestar-user-guide
Developers
www.tomshardware.com
www.intel.com/p/en_US/products/server/processor/xeon5000/technical-documents
Ranger Related References
User Guide
services.tacc.utexas.edu/index.php/ranger-user-guide
Forums
forums.amd.com/devforum
www.pgroup.com/userforum/index.php
Developers
developer.amd.com/home.jsp
developer.amd.com/rec_reading.jsp
Longhorn Related References
User Guide
services.tacc.utexas.edu/index.php/longhorn-user-guide
General Information
www.intel.com/technology/architecture-silicon/next-gen/
www.nvidia.com/object/cuda_what_is.html
Developers
developer.intel.com/design/pentium4/manuals/index2.htm
developer.nvidia.com/forums/index.php