Upload
magnus-davidson
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
SGI's Platform Strategy: Addressing the Productivity Gap in HPC
Dave Parry
Senior Vice President and General ManagerServer and Platform Group
Silicon Graphics, Inc.
Co st
1990 2005
Software
IT & EngineeringPersonnel ~ $50/Hr
2002
IT Costs Now in People and Software
Basic Hardware ~ $1/Hr
Vector
RISC
Commodity
Changing Economics in HPC
Worldwide Production of Information
40
80
120
160
200
1998 2000 2002 2004 2006
Exa
byt
es
Source: Gartner Group
Datasets are Getting (Much) Bigger, Too
Satellite Systems Archive Growth 2000 - 2014
-
2,000
4,000
6,000
8,000
10,000
12,000
14,000
2000 2002 2004 2006 2008 2010 2012 2014
Te
rab
yte
s (
1,0
00
TB
= 1
Pe
tab
yte
)
GOES NEXRAD DMSP EOS METOP NPP NPOESS
Source: NOAA
Programming Is Getting Harder(AKA The Folly of “Least Common Denominator Computing”)
OpenMP™
. a = b
SHMEM or MPI2 (one-sided)
. C1 "get(b)" i.e. i=shmem_int_get(b) i=MPI_get(b)
MPI (two-sided). C2 "recv"; to wait for C1’s request. C1 "send"; to ask C2 for ”b". C2 finds that ”b" is needed by C1
. C2 does a local "get(b)"
. C2 does a "send(b)"
. C1 does a "recv(b)" i.e. a=MPI_recv(b)
C1 C2
mem. space for “a” mem. space for “b”
To copy/transfer the value stored in “b” to “a”...
Memory Is Getting “Slower”
0
50
100
150
200
250
300
350
400
2p 4p 8p 16p 32p 64p 128p 256p 512p 1024Rem
ote
Lat
ency
(p
roce
sso
r cy
cles
)
Origin2000-195MHz
Origin3000-400MHz
Origin3000-800MHz
Origin® 2000 195 MHz
Origin® 3000 400 MHz
Origin 3000 800 MHz
Summing up the Productivity Picture
Productivity = cost-1 * value * efficiency * usability Where: cost-1 == MFLOPS/dollar (Moore’s Law)
value == hardware cost/cost of ownershipefficiency == productive cycles/MFLOPS (Constant at 5–10%)usability == programming effort per productive cycle
Productivity
Moore’s Law
Productivity Gap
MFLOPs per acquisition dollar Productive science per total dollar
Technology Directions
to Close the Gap
VisualizationComputation
A Data-Centric View of Each Aspect of HPC
Data Access Focus: One shared view of the data
with pervasive access
Focus: One shared view of
the dataset with pervasive access
Focus: One shared view
of the visual model with pervasive access
?
Image courtesy of Janssen Pharmaceuticals
Visual Data
Visual Data
HPC/Capability HPC/
Capacity
We Want to Work Differently
Grid Infrastructure
A Different View of System Architecture
Scalable Shared Memory . Globally addressable . Thousands of ports . Flat & high bandwidth . Flexible & configurable
Terascale to Petascale Data Set : Bring Function to Data
ComputeCompute
Compute IO
IO
Graphics
Co st
1990 2005
Software
IT & EngineeringPersonnel ~ $50/Hr
2002
IT Costs Now in People and Software
Basic Hardware ~ $1/Hr
Vector
RISC
Commodity
Changing Economics in HPC
Challenges:• “Impedance match” to
HPC applications• Availability of HPC-class
architectures
0
200400
600
800
10001200
1400
1600
18002000
2200
SPECint®
AMD Athlon 3.2 GHz
AMD Opteron 1.8 GHz
Intel® Pentium® 4 3.2 GHz
Intel® Itanium® 2 1.5 GHz
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
SPECint® SPECfp®
AMD Athlon 3.2 GHz
AMD Opteron 1.8 GHz
Intel® Pentium® 4 3.2 GHz
Intel® Itanium® 2 1.5 GHz
Use an HPC Processor for HPC Applications
9.72
7.98
20.5
16.1
13.2
36
0 5 10 15 20 25 30 35 40
Opteron 1.4GHz,ifc, 32bit
Opteron 1.4GHz,pgf77, 64bit
Altix 1.3 Ghz, efc,64bit
2 CPUs
1
• 2.2-2.7x advantage for Altix on 2P.• Best Opteron result was run with single user mode and interleaved memory banks.
SPECfp_rate_base2000
Use an HPC Processor for HPC Applications
Q2 2H
64p
128p
256p
1024p+
CY2001 CY2003
Max NUMAlinked System Size
Max Kernel Image or Partition Size
1H2H
Max SMP System Size
2p SGI 750
Altix-Itanium2
512p
Altix-Madison
Combine Your HPC Processor with an HPC Architecture
1H 2H
CY2004
Altix-Madison9M
95.3
143.3
149.3
171.9
335.4
191.4
276.5
336
378.2
553.3
0 100 200 300 400 500 600
HP Int. Superdome™, 1.5Ghz
IBM p690, 1.3GHz
IBM® eServer™ p690, 1.7Ghz
SGI® Altix™ 3000, 1.3 GHz
SGI® Altix™ 3000, 1.5 GHz
Gflops
128 CPUs
64
32
• Altix, 1.3 Ghz is 1.46x faster than IBM eServer p690, 1.3 Ghz at 128P• Altix, 1.5 Ghz is 16% faster than p690, 1.7 Ghz in spite of a lower peak flop rate.
Combine Your HPC Processor with an HPC Architecture
Source: http://www.netlib.org/benchmark/performance.ps, July 24, 2003 and SGI performance reports
Linpack HPC (NxN) Performance
102
89
141
164
281
405
322
541
644
539
1053
1250
0 200 400 600 800 1000 1200 1400
Sun Fire 15K, 1.2Ghz
HP AlphaServerGS1280 1.15 Ghz
IBM® p690/655(1.7/1.5GHz)
SGI® Altix™ 3000(1.3GHz)
SGI® Altix™ 3000(1.5GHz)
64 CPUs
32 CPUs
8 CPUs
• World-record result for 64 and 32-processor systems
• SGI’s 1.5Ghz, 32P result is 2x better performance than IBM eServer p690, 1.7 Ghz
• SGI’s 1.3Ghz, 64P result is 1.95x better than Sun Fire 15K, 1.2 Ghz.
Combine Your HPC Processor with an HPC Architecture
SPECfp_rate_base2000 Performance
New Paradigms(usability)
SinglePhysical Mem.
SingleO.S.
Cache Coherent
SGI® NUMA
SGI® Origin 3000 SSI512–1024P
Cluster ToolsIBM >32P
Compaq >32PSun >64PHP >64P
Run OpenMP™ codesRun MPI codes
SingleAddress Space
SingleAdmin. View
Bus/SwitchIBM® 32P
Compaq 32PSun™ 64PHP™ 64P
Cluster
D.I.Y.
PCsconnected
New_1?(App Level)
New_2?(App Level)
Global Shared Memory between Supercluster Nodes
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
IX-BrickIX-Brick
64P Partition
Operating SystemOperating System
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
IX-BrickIX-Brick
64P Partition
MPI/shmem app
OpenMP™ appCPU_SETS
System layer
MPI/shmem appParallel Scheduler, Array Services
7
101
32
14
41
64
27
127255
0 25 50 75 100 125 150 175 200 225 250 275
HP Superdome™
HP AlphaServer™GS1280
IBM® eServer™p690, 1.7Ghz
SGI® Altix™ 3000,1.3Ghz
GB/sec
128* CPUs
64
32
16
• The 64P result is a world-record result for a microprocessor-based system and fifth overall
• 1.56x better performance than IBM eServer p690 at 32P* 128 CPU result uses MPI code to run on Altix Supercluster with two 64P nodes, for smaller CPU counts OpenMP code was used.
STREAM Triad ResultsSSI and SuperCluster Configs
A Path to Architectural Convergence
Defense andHomeland Security
Media Manufacturing ScienceEnergy
Origin AltixOrigin AltixApplicationSpecificCompute
Multi-Paradigm Architecture
A Different View of System Architecture
Scalable Shared Memory . Globally addressable . Thousands of ports . Flat & high bandwidth . Flexible & configurable
Terascale to Petascale Data Set : Bring Function to Data
Reconfigurable
Compute
Compute IO
IO
Graphics
Com
pute
Reconfigurable
Multi-Paradigm ComputingUltraViolet
Scalable Shared Memory . Globally addressable . Thousands of ports . Flat & high bandwidth . Flexible & configurable
Terascale to Petascale Data Set : Bring Function to Data
Reconfigurable
Scalar
Vector IO
IO
Graphics
Vector
Streaming
Streaming
Scalar
VisualizationComputation
A Data-Centric View of Each Aspect of HPC
Data Access Focus: One shared view of the data
with pervasive access
Focus: One shared view of
the dataset with pervasive access
Focus: One shared view
of the visual model with pervasive access
Innovation workflow means data must be shared
Design
Compute
Data
Imagine
Post-process
Visualize
Decide
Adapting to the way people work:
From the original concept to the final result, data is at the core of the workflow
Information is shared between groups, and data is moved between hosts
Data sets grow at each step
Processes are improved when data copy is avoided, shortening time to insight
SGI in Data Management Integrated HW / SW Solutions
SGI® Data Management
Legato, XFS™ / XVM, Snapshot
FailSafe™, Fail Over
Data Migration Facility / Tape Management Facility
SGI® File Server
Scalable Bandwidth
Storage Management, SAN Topology, SAN Cluster Management, TP900, TP9100, TP9500, HDS 9960, Ciprico 7000 and TALON™, Brocade, STK, ADIC, SGI Firmware
DAS Scalability to over 12
GByte/s and up to 18 M TB
Backup
Archive / HSM
Data Sharing
High Availability
RAID, JBOD, Hub,
Switch, HBA, Tape
DAS, NAS, SAN
SGI® SAN Server 1000
• Management• Topology• Monitoring
CXFS™, Samba / Cifs, BDS, NFS, FTP, ..
SAN with CXFS: High performance data sharing with unlimited scale
LANLANSANSAN
A unique high performance solution:•Each host share one or more volumes consolidated in one or more RAID array. •Centralized storage management•High modularity•True High Performances Data sharing, near local File System performances.•Fully Resilient (HA)•Fully POSIX Compliant•As easy to share files as with NFS, but faster
Windows
NT ® & 2k
SGI ®
IRIXSunTM
Solaris
Linux 64for Altix
IBM AIXLinux 32 More Under Development
True Heterogeneity
• Faster than WAN FTP or NFS• Single name space = easy to administer, no data copies
CXFS Usage - Wide Area & GRID Data Sharing
SAN across distances of up to 8000KM
Data Lifecycle ManagementStorage Hierarchy & TCO Model with DMF
TP9400
STK L700 w/9840
Primary StorageOnline - high-performance disk
Demote> 7 days < 365
Demote> 1 Yr < 2 Yr
Promoteused last 24 hrs
Promoteused last 7 days
Nearline DiskHigh Capacity, Low cost, Lower performance
Tape Libraries• high-performance• archive
DMF manages data from one platform to another based on:
• age of file• size of file• type of file
Archive> 2 Yr
SGI® High-Performance Data Management Leadership
Top performance and virtually unlimited scalability
– Broke 3 Gbyte/sec SAN barrier (2000)
– Delivering first 12 GB/sec (15GB peak) SAN (2002)
– First 2 GB SAN Fabric (2001)
– Wide area data sharing (2002)
– Broke backup record - 10 Tbyte in an hour (2003)
Summing up the Productivity Picture
Productivity = cost-1 * value * efficiency * usability
Productivity
Moore’s Law
Productivity Gap
Moore’s LawProductivity
Productivity in weather and climate HPC - SGI Altix
• Brings serious supercomputing capability to Linux
• Robust multi-OS shared filesystem with unmatched scale
• Porting of many key development and administration tools
• Ease of use from largest node size in the industry
• Environmental codes being ported, optimized, scaled
0
10
20
30
40
50
60
70
0 100 200 300
# processors
year
s/w
all d
ay
POP 1.4.3 performance1 degree global problem
Forecast years/wallclock day
Altix 1.3GHz
ES40 Altix 1.5GHz(scaled)
MM5 performanceT3a case
0
5000
10000
15000
20000
25000
30000
0 5 10 15 20 25 30 35
Number of Processors
MF
LO
PS
Altix 1.5Ghz
IBM p690 1.3Ghz
Xeon 2.2Ghz/myrinet
Athlon 1.4Ghz/Dolphin SCI
(all are MPI)
ARPS scalability on Altix
0
20
40
60
80
0 20 40 60 80
Processors
spee
dup speedup
perfect
LM scalability on Altix
0
20
40
60
80
0 20 40 60 80
Processors
spee
dup speedup
perfect
Other applications
© 2003 Silicon Graphics, Inc. All rights reserved. Silicon Graphics, SGI, Origin, OpenGL, XFS, InfiniteReality, IRIX, and the SGI logo are registered trademarks and OpenMP, NUMAflex, CXFS, InfinitePerformance, and the Silicon Graphics logo are trademarks of Silicon Graphics, Inc., in the United States and/or other countries worldwide. R10000 is a registered trademark of MIPS Technologies, Inc. Pentium and Itanium are registered trademarks of Intel Corporation. Windows is a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries worldwide. Linux is a registered trademark of Linus Torvalds. All other trademarks mentioned herein are the property of their respective owners. (06/03)