Upload
vlora
View
23
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Exascale : Power, Cooling, Reliability, and Future Arithmetic. John Gustafson HPC User Forum Seattle, September 2010. The First Cluster: The “Cosmic Cube”. - PowerPoint PPT Presentation
Citation preview
Exascale:Exascale:Power, Cooling,Power, Cooling,Reliability, andReliability, and
Future ArithmeticFuture Arithmetic
John GustafsonJohn Gustafson
HPC User ForumHPC User Forum
Seattle, September 2010Seattle, September 2010
The First Cluster: The “Cosmic Cube”The First Cluster: The “Cosmic Cube”
64 Intel 8086-8087 nodes700 Watts, total
6 cubic feet
Chuck Seitz and Geoffrey Fox developed an alternative to Cray vector mainframes for $50,000, in 1984. Motivated by QCD physics problems, but soon found useful for a very wide range of apps.
But Fox & Seitz did not But Fox & Seitz did not invent the Cosmic Cube.invent the Cosmic Cube.
Stan Lee did.Stan Lee did.
Tales of Suspense #79, July 1966
From Wikipedia:“A device created by
a secret society of scientists to further their ultimate goal of
world conquest.”
Terascale Power Use Today (Not to Scale)Terascale Power Use Today (Not to Scale)
TFLOP Machine Today
1 Tflops/s @ 200 pJ per flopCompute 200 W
Memory 0.1 byte/flop @ 1.5 nJ per byte150 W
Comm. 100 pJ comm. per flop100 WDisk 10 TB disk @ 10 W per TB100 W100 WControl
7.5 nJ/instruction @ 0.2 instruction/flop1500 W
950 W Power supply loss
19% (81% efficient power supplies)5 kW
2000 WHeat removal (all levels, chip to facility)
40% of total power consumed
Derived from data from S. Borkar, Intel Fellow
Let’s See that Drawn to Scale…Let’s See that Drawn to Scale…
A SIMD accelerator approach gives up Control to reduce wattage per Tflops/s. Which can work, for some applications that are very regular and SIMD-like (vectorizable with long vectors).
Energy Cost by Operation Energy Cost by Operation TypeType
Operation Approximate energy consumed today
64-bit multiply-add 200 pJ
Read 64 bits from cache 800 pJ
Move 64 bits across chip 2000 pJ
Execute an instruction 7500 pJ
Read 64 bits from DRAM 12000 pJ
Notice that 12000 pJ @ 3 GHz = 36 watts!SiCortex’s solution: drop the memory speed, but the
performance dropped proportionately.Larger caches actually reduce power consumption.
Energy Cost of a Future Energy Cost of a Future HPCHPC ClusterCluster
Power Size
Exaflop 20 MW Data Center
Petaflop 20 KW Cabinet
Teraflop 20 W Chip/Module
But while we’ve building HPC clusters, Google and Microsoft have been very, very busy…
No cost for interconnect? Hmm…
Cloud computing has Cloud computing has already eclipsed HPC for already eclipsed HPC for
sheer scalesheer scale• Cloud computing means using a remote data center to manage scalable, reliable, on-demand access to applications
• Provides applications and infrastructure over the Internet
• “Scalable” here means:– Possibly millions of
simultaneous users of the app.– Exploiting thousand-fold
parallelism in the app.
From Tony Hey, Microsoft
Mega-Data Center Economy of Mega-Data Center Economy of ScaleScale
Over 18 million square feet. Each.
Data courtesy of James Hamilton
TechnologyCost in small data center
Cost in large data center
Ratio
Network$95 per
Mbps permonth
$13 per Mbps per
month 7.1
Storage$2.20 per GB
per month$0.40 per GB
per month 5.7
Admini-stration
~140servers per
Admini-strator
>1000 Servers per
Admini-strator
7.1
• A 50,000-server facility is 6–7x more cost-effective than a 1,000 server facility in key respects
• Don’t expect a TOP500 score soon.• Secrecy?• Or… not enough interconnect?
Each data center is the size of 11.5 football fields
Computing by the truckloadComputing by the truckload
From Tony Hey, Microsoft
• Build racks and cooling and communication together in a “container”
• Hookups: power, cooling, and interconnect
• I estimate each center is already over 70 megawatts… and 20 petaflops total!
• But: designed for capacity computing, not capability computing
Arming for search engine Arming for search engine warfarewarfare
This work is licensed under a Creative Commons Attribution 3.0 U.S. License
It’s starting to look like…a steel mill!
A steel mill takes ~A steel mill takes ~500 500 megawattsmegawatts
• Self-contained power plant
• Is this where “economy of scale” will top out for clusters as well?
• Half the steel mills in the US are abandoned
• Maybe some should be converted to data centers!
Yes, and also some Yes, and also some really bigreally big heat sinks. heat sinks.——John G.John G.
With great power comes great responsibilityWith great power comes great responsibility——Uncle BenUncle Ben
An unpleasant math surprise An unpleasant math surprise lurks…lurks…
64-bit precision is looking long in the tooth. (gulp!)
70
80
1970 1980 1990 2000
Bits
Year
CDC 60
Moore’s Law
2010
At 1 exaflop/s (1018), 15 decimals don’t last long.
20
30
40
50
60
1940 1950 1960
Zuse 22
Univac, IBM 36
Cray 64 most vendors 64
x86 80 (stack only)
It’s unlikely a code uses the best It’s unlikely a code uses the best precision.precision.
0
16
32
48
64
80IEEE 754 double precision
Optimum precision
All floating-point operations in application
• Too few bits gives unacceptable errors
• Too many bits wastes memory, bandwidth, joules
• This goes for integers as well as floating point
Ways out of the dilemma…Ways out of the dilemma…
• Better hardware support for 128-bit, if only for use as a check
• Interval arithmetic has promise, if programmers can learn to use it properly (not just apply it to point arithmetic methods)
• Increasing precision automatically with the memory hierarchy might even allow a return to 32-bit
• Maybe it’s time to restore Numerical Analysis to the standard curriculum of computer scientists?
The hierarchical precision The hierarchical precision ideaidea
More capacity
RegL1
Cache
L2 CacheLocal RAM
2 billion FP values200-cycle latency
7 significant decimals10–38 to 10+38 range
2 million FP values30-cycle latency
15 significant decimals10–307 to 10+307 range
4096 FP values6-cycle latency
33 significant decimals10–9864 to 10+9864 range
16 FP values1-cycle latency69 significant
decimals10–2525221 to
10+2525221 range
Lower latency;
More precision, range
Another unpleasant surprise Another unpleasant surprise lurking…lurking…
• Hardware cache policies are designed to Hardware cache policies are designed to minimize minimize miss ratesmiss rates, at the expense of low , at the expense of low cache cache utilizationutilization (typically around 20%). (typically around 20%).
• Memory transfers will soon be half the power Memory transfers will soon be half the power consumed by a computer, and computing is consumed by a computer, and computing is already power-constrained.already power-constrained.
• Software will need to manage memory Software will need to manage memory hierarchies explicitly. And source codes need to hierarchies explicitly. And source codes need to expose expose memory moves, not hide them.memory moves, not hide them.
No more automatic caches?No more automatic caches?
SummarySummary
• Mega-data center clusters have eclipsed HPC clusters, but HPC can learn a lot from their methods in getting to exascale.
• Clusters may grow to the size of steel mills, dictated by economies of scale.
• We may have to rethink the use of 64-bit flops everywhere, for a variety of reasons.
• Speculative data motion (like, automatic caching) reduces operations per watt… it’s on the way out.