Exascale : Power, Cooling, Reliability, and Future Arithmetic

Exascale:Exascale:Power, Cooling,Power, Cooling,Reliability, andReliability, and

Future ArithmeticFuture Arithmetic

John GustafsonJohn Gustafson

HPC User ForumHPC User Forum

Seattle, September 2010Seattle, September 2010

The First Cluster: The “Cosmic Cube”The First Cluster: The “Cosmic Cube”

64 Intel 8086-8087 nodes700 Watts, total

6 cubic feet

Chuck Seitz and Geoffrey Fox developed an alternative to Cray vector mainframes for $50,000, in 1984. Motivated by QCD physics problems, but soon found useful for a very wide range of apps.

But Fox & Seitz did not But Fox & Seitz did not invent the Cosmic Cube.invent the Cosmic Cube.

Stan Lee did.Stan Lee did.

Tales of Suspense #79, July 1966

From Wikipedia:“A device created by

a secret society of scientists to further their ultimate goal of

world conquest.”

Terascale Power Use Today (Not to Scale)Terascale Power Use Today (Not to Scale)

TFLOP Machine Today

1 Tflops/s @ 200 pJ per flopCompute 200 W

Memory 0.1 byte/flop @ 1.5 nJ per byte150 W

Comm. 100 pJ comm. per flop100 WDisk 10 TB disk @ 10 W per TB100 W100 WControl

7.5 nJ/instruction @ 0.2 instruction/flop1500 W

950 W Power supply loss

19% (81% efficient power supplies)5 kW

2000 WHeat removal (all levels, chip to facility)

40% of total power consumed

Derived from data from S. Borkar, Intel Fellow

Emily Backus

Capped title, switched to IDF black font.

Let’s See that Drawn to Scale…Let’s See that Drawn to Scale…

A SIMD accelerator approach gives up Control to reduce wattage per Tflops/s. Which can work, for some applications that are very regular and SIMD-like (vectorizable with long vectors).

Emily Backus

Capped title.

Energy Cost by Operation Energy Cost by Operation TypeType

Operation Approximate energy consumed today

64-bit multiply-add 200 pJ

Read 64 bits from cache 800 pJ

Move 64 bits across chip 2000 pJ

Execute an instruction 7500 pJ

Read 64 bits from DRAM 12000 pJ

Notice that 12000 pJ @ 3 GHz = 36 watts!SiCortex’s solution: drop the memory speed, but the

performance dropped proportionately.Larger caches actually reduce power consumption.

Emily Backus

Capped title, switched to IDF black.

Energy Cost of a Future Energy Cost of a Future HPCHPC ClusterCluster

Power Size

Exaflop 20 MW Data Center

Petaflop 20 KW Cabinet

Teraflop 20 W Chip/Module

But while we’ve building HPC clusters, Google and Microsoft have been very, very busy…

No cost for interconnect? Hmm…

Cloud computing has Cloud computing has already eclipsed HPC for already eclipsed HPC for

sheer scalesheer scale• Cloud computing means using a remote data center to manage scalable, reliable, on-demand access to applications

• Provides applications and infrastructure over the Internet

• “Scalable” here means:– Possibly millions of

simultaneous users of the app.– Exploiting thousand-fold

parallelism in the app.

From Tony Hey, Microsoft

Mega-Data Center Economy of Mega-Data Center Economy of ScaleScale

Over 18 million square feet. Each.

Data courtesy of James Hamilton

TechnologyCost in small data center

Cost in large data center

Ratio

Network$95 per

Mbps permonth

$13 per Mbps per

month 7.1

Storage$2.20 per GB

per month$0.40 per GB

per month 5.7

Admini-stration

~140servers per

Admini-strator

>1000 Servers per

Admini-strator

7.1

• A 50,000-server facility is 6–7x more cost-effective than a 1,000 server facility in key respects

• Don’t expect a TOP500 score soon.• Secrecy?• Or… not enough interconnect?

Each data center is the size of 11.5 football fields

Computing by the truckloadComputing by the truckload

From Tony Hey, Microsoft

• Build racks and cooling and communication together in a “container”

• Hookups: power, cooling, and interconnect

• I estimate each center is already over 70 megawatts… and 20 petaflops total!

• But: designed for capacity computing, not capability computing

Arming for search engine Arming for search engine warfarewarfare

This work is licensed under a Creative Commons Attribution 3.0 U.S. License

It’s starting to look like…a steel mill!

A steel mill takes ~A steel mill takes ~500 500 megawattsmegawatts

• Self-contained power plant

• Is this where “economy of scale” will top out for clusters as well?

• Half the steel mills in the US are abandoned

• Maybe some should be converted to data centers!

Yes, and also some Yes, and also some really bigreally big heat sinks. heat sinks.——John G.John G.

With great power comes great responsibilityWith great power comes great responsibility——Uncle BenUncle Ben

An unpleasant math surprise An unpleasant math surprise lurks…lurks…

64-bit precision is looking long in the tooth. (gulp!)

70

80

1970 1980 1990 2000

Bits

Year

CDC 60

Moore’s Law

2010

At 1 exaflop/s (1018), 15 decimals don’t last long.

20

30

40

50

60

1940 1950 1960

Zuse 22

Univac, IBM 36

Cray 64 most vendors 64

x86 80 (stack only)

It’s unlikely a code uses the best It’s unlikely a code uses the best precision.precision.

0

16

32

48

64

80IEEE 754 double precision

Optimum precision

All floating-point operations in application

• Too few bits gives unacceptable errors

• Too many bits wastes memory, bandwidth, joules

• This goes for integers as well as floating point

Ways out of the dilemma…Ways out of the dilemma…

• Better hardware support for 128-bit, if only for use as a check

• Interval arithmetic has promise, if programmers can learn to use it properly (not just apply it to point arithmetic methods)

• Increasing precision automatically with the memory hierarchy might even allow a return to 32-bit

• Maybe it’s time to restore Numerical Analysis to the standard curriculum of computer scientists?

The hierarchical precision The hierarchical precision ideaidea

More capacity

RegL1

Cache

L2 CacheLocal RAM

2 billion FP values200-cycle latency

7 significant decimals10–38 to 10+38 range

2 million FP values30-cycle latency


4096 FP values6-cycle latency


16 FP values1-cycle latency69 significant

decimals10–2525221 to

10+2525221 range

Lower latency;

More precision, range

Another unpleasant surprise Another unpleasant surprise lurking…lurking…

• Hardware cache policies are designed to Hardware cache policies are designed to minimize minimize miss ratesmiss rates, at the expense of low , at the expense of low cache cache utilizationutilization (typically around 20%). (typically around 20%).

• Memory transfers will soon be half the power Memory transfers will soon be half the power consumed by a computer, and computing is consumed by a computer, and computing is already power-constrained.already power-constrained.

• Software will need to manage memory Software will need to manage memory hierarchies explicitly. And source codes need to hierarchies explicitly. And source codes need to expose expose memory moves, not hide them.memory moves, not hide them.

No more automatic caches?No more automatic caches?

SummarySummary

• Mega-data center clusters have eclipsed HPC clusters, but HPC can learn a lot from their methods in getting to exascale.

• Clusters may grow to the size of steel mills, dictated by economies of scale.

• We may have to rethink the use of 64-bit flops everywhere, for a variety of reasons.

• Speculative data motion (like, automatic caching) reduces operations per watt… it’s on the way out.

Documents

Exascale : Power, Cooling, Reliability, and Future Arithmetic