Š 2005 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice
The Future Evolution of High-Performance Microprocessors
Norm JouppiHP Labs
9/27/2006 2
Talk Overview
⢠Why and When of Evolution⢠Microprocessor Environmental Constraints⢠The Power Wall⢠Power: From the Transistor to the Data Center⢠The Evolution of Computer Architecture Ideas⢠Summary
9/27/2006 3
Disclaimer⢠These views are mine, not necessarily HP⢠âNever make forecasts, especially about the
futureâ â Samuel Goldwyn
9/27/2006 4
Why Evolution⢠Evolution is a Very Efficient
Means of Building New Things â Reuse, recycleâMinimum of new stuffâMuch easier than revolution
9/27/2006 6
Technology⢠Usually evolution, not revolution⢠Many revolutionary technologies have a bad history:âBubble memoriesâ Josephson junctionsâAnything but Ethernetâ Etc.
⢠Mooreâs Law has been a key force driving evolution of the technology
9/27/2006 7
Mooreâs Law⢠Originally presented in 1965⢠Number of transistors per chip is 1.59year-1959
(originally 2year-1959)⢠Classical scaling theory (Denard, 1974)âWith every feature size scaling of n
⢠You get O(n2) transistors⢠They run O(n) times faster
⢠Subsequently proposed:â âMooreâs Design Lawâ (Law #2)â âMooreâs Fab Lawâ (Law #3)
9/27/2006 8
Microprocessor Efficiency Eras (Jouppi)⢠Mooreâs Law says number of transistors scaling as
O(n2) and speed as O(n)⢠Microprocessor performance should scale as O(n3)
N3
Era
N2
Era
N1
Era
N0
N-1
Number of Transistors
(log)
Per
form
ance
Efficiency
9/27/2006 9
N3 Era⢠n from device speed, n2 from transistor count⢠4004 to 386⢠Expansion of data path widths from 4 to 32 bits⢠Basic pipelining⢠Hardware support for complex ops (FP mul)⢠Memory range and virtual memory⢠Hard to measure performanceâMeasured in MIPS
9/27/2006 10
N3 Era⢠n from device speed, n2 from transistor count⢠4004 to 386⢠Expansion of data path widths from 4 to 32 bits⢠Basic pipelining⢠Hardware support for complex ops (FP mul)⢠Memory range and virtual memory⢠Hard to measure performanceâMeasured in MIPSâBut how many 4-bit ops = 64-bit FP mul?
9/27/2006 11
N3 Era⢠n from device speed, n2 from transistor count⢠4004 to 386⢠Expansion of data path widths from 4 to 32 bits⢠Basic pipelining⢠Hardware support for complex ops (FP mul)⢠Memory range and virtual memory⢠Hard to measure performanceâMeasured in MIPSâBut how many 4-bit ops = 64-bit FP mul?
⢠More than 1500!
9/27/2006 12
N2 Era⢠n from device speed, only n from n2 transistors⢠486 through Pentium III/IV⢠Era of large on-chip cachesâMiss rate halves per quadrupling of cache size
⢠Superscalar issueâ2X performance from quad issue?
⢠SuperpipeliningâDiminishing returns
9/27/2006 13
N Era⢠n from device frequency, 1 from n2 transistor count⢠Very wide issue machinesâ Little help to many apsâNeed SMT to justify
⢠Increase complexity and size too much â slowdownâ Long global wiresâStructure access times go upâ Time to market
9/27/2006 14
Environmental Constraints on Microprocessor Evolution⢠Several categories:⢠Technology scalingâ EconomicsâDevicesâVoltage scaling
⢠System-level constraintsâ Power
9/27/2006 15
Supply Economics: Mooreâs Fab Law⢠Fab cost is scaling as 1/feature size⢠New 65nm full-size fabs cost >2 billion dollars⢠Few can afford one by themselvesâ Fabless startupsâ Fab partnerships
⢠IBM/Toshiba/etc.⢠Large foundries
⢠But number of transistors scales as 1/feature size2
â Transistors still getting cheaperâ Transistors still getting faster
9/27/2006 16
Supply Economics: Mooreâs Design Law⢠The number of designers goes as O(1/feature)⢠The 4004 had 3 designers, 10Âľm⢠90nm microprocessors have ~300 designers⢠Implication: design cost becomes very large⢠Consolidation in # of viable microprocessors⢠Microprocessor cores often reusedâ Too much work to design from scratchâBut âshrinks and tweaksâ becoming difficult in DSM
9/27/2006 17
Devices⢠Transistors historically get faster â feature size⢠But transistors getting much leakierâGate leakage (fix with high-K gate dielectrics)âChannel leakage (dual-gate or vertical transistors?)
⢠Even CMOS has significant static powerâ Power is roughly proportional to # transistorsâStatic power approaching dynamic powerâStatic power increases with chip temperature
⢠Positive feedback is bad
9/27/2006 18
Voltage Scaling⢠High performance MOS started out with12V⢠Max Voltage scaling roughly as sqrt(feature)⢠Power is â CV2fâ Lower voltage can reduce the power as squareâBut speed goes down with lower voltage
⢠Current high-performance microprocessors have 1.1V supplies
⢠Reduced power (12/1.1)2 = 119X over 24 years!
9/27/2006 19
Limits of Voltage Scaling⢠Beyond a certain voltage transistors donât turn off⢠ITRS projects minimum voltage of 0.7V in 2018â Limited by threshold voltage variation, etc.âBut high-performance microprocessors are now 1.1V
⢠Only (1.1/0.7)2 = 2.5X reduction left in next 14 years!
9/27/2006 20
System-Level Power⢠Per-chip power envelope is near air-cooling limits⢠1U servers and blades reduce heat sink heightâ Increases thermal resistanceâ Lowers max power dissipation
⢠Cost of power in and heat out for several years can equal original system cost
⢠First class design constraint
9/27/2006 21
Microprocessor Power Trends
⢠Figure source: Shekhar Borkar, âLow Power Design Challenges for the Decadeâ, Proceedings of the 2001 Conference on Asia South Pacific Design Automation, IEEE.
9/27/2006 23
Pushing Out the Power Wall⢠âWaste Not Want Notâ (Ben Franklin) for circuits⢠Power-efficient microarchitecture⢠Single threaded vs. throughput tradeoff
9/27/2006 24
Single-Threaded / Throughput Tradeoff⢠Reducing transistors/core can yield higher MIPS/W⢠Move back towards N3 scaling efficiency⢠Thus, expect trend to simpler processorsâNarrower issue widthâShallower pipelinesâMore in-order processors or smaller OOO windows
⢠âBack to the Futureâ⢠But this gives lower single-thread performanceâCanât simplify core too quickly
⢠Tradeoffs on what to eliminate not always obviousâ Examples: Speculation, Multithreading
9/27/2006 25
Speculation⢠Is speculation uniformly bad?âNo
⢠Example: branch predictionâ Executing down wrong path wastes performance & power âStalling at every branch would hurt performance & power
⢠Circuits leak when not switching
⢠Predicting a branch can save powerâ Plus predictor memory takes less power/area than logic
⢠But current amount of speculation seems excessive
9/27/2006 26
Multithreading⢠SMT is very useful in wide-issue OOO machinesâGood news: increases power efficiencyâBad news: Wide issue still power inefficient
⢠Multithreading useful even in simple machinesâDuring cache misses transistors still leak
⢠Not enough time to gate powerâMay only need 2 or 3 thread non-simultaneous MT
9/27/2006 27
Recap: Possible Future Evolutionin Terms of P = CV2f⢠Formula doesnât change, but terms do:â Power ⯠100
⢠Already near system limits, better if lowerâVoltage ⯠100
⢠Only a factor of 2.5 left over next 14 yearsâClock Frequency ⯠100
⢠Not scaling with device speed⢠Fewer pipestages, higher efficiency⢠Move from 30 to 10 stages from 90nm to 32 nm
âCapacitance (~Area) needs to be repartitioned for higher system performance
9/27/2006 28
Number of Cores Per Die⢠Scale processor complexity roughly as 1/featureâNumber of cores could go up dramatically as n3
â From 1 core at 90nm to 27 per die at 30nm!â The number of threads/socket will increase even more
⢠Industry moves from competing on clock frequency to number of cores
⢠But can we efficiently use that many cores?
9/27/2006 29
The Coming Golden Age⢠We are on the cusp of the golden age of parallel
programmingâ Every major application needs to become parallelâ âNecessity is the mother of inventionâ
⢠How to make use of many processors effectively?âSome aps inherently parallel (Web, CFD, TPC-C)âSome applications are very hard to parallelizeâ Parallel programming is a hard & costly problem
⢠Cherri Pancake
⢠Canât use large amounts of speculation in softwareâ Just moves power inefficiency to a higher level
9/27/2006 30
Why Didnât the Golden Age Happen a Decade Ago?⢠In 1990âs lots of parallel processing research, BUT:âAt most one processor per chipâCreating an SMP takes expensive glue chips, thusâ Even single-threaded aps had worse performance/costâAny parallel speedup less than linear made it worse stillâ Programmed by gurus
⢠In the CMP era:âYour microprocessor will come with N processors,
whether you like it or notâGiven a fixed cost, any speedup looks good
⢠But lots of work on tools for masses (not just gurus) of future parallel programmers still needed
9/27/2006 31
Important Architecture Research: A Sampler⢠How to wire up CMPs:âMemory hierarchy?â Interconnection networks?
⢠How to build cores:âHeterogeneous CMP designâConjoined core CMPsâStacked die w/ heterogeneous technologies
⢠How to program?â Transactional memory
⢠Power: From the transistor through the datacenter
9/27/2006 32
Important Architecture Research #2⢠Many system level problems:âCluster interconnectsâManageabilityâAvailabilityâSecurity
⢠Beyond the scope of this talk ⢠Hardware/software tradeoffs (ASPLOS) increasingly
important
9/27/2006 33
Power:From the Transistor through the Data Center
⢠Heterogenous CMP⢠Conjoined cores⢠Data center power management
9/27/2006 34
Heterogeneous Chip Multiprocessors
⢠Kumar et. al. in Computer⢠A.k.a. Asymmetric,
Non-homogeneous, Synergistic,âŚ
⢠Single ISA vs. Multiple ISA⢠Many benefits:â Powerâ ThroughputâMitigating Amdahlâs Law
⢠Open questionsâBest mix of heterogeneity
9/27/2006 35
Potential Power Benefits⢠Grochowski et. al. ICCD 2004:âAsymmetric CMP => 4-6Xâ Further voltage scaling => 2-4XâMore gating => 2XâControlling speculation =>1.6X
9/27/2006 36
Mitigating Amdahlâs Law⢠Amdahlâs law: Parallel speedups limited by serial
portions⢠Annavaram et. al., ISCA 2005:âBasic idea
⢠Use big core for serial portions⢠Use many small cores for parallel portions
â Prototype built from discrete 4-way SMP⢠Ran one socket at regular voltage, other 3 at low voltage⢠38% wall clock speedup using fixed power budget
9/27/2006 37
Conjoined Cores
Ideally, provide for peak needs of any single thread without multiplying the cost with the number of cores
Baseline usage resource Peak usage resource
9/27/2006 38
Conjoined-core Architecture Benefits
0102030405060
% Area savings per core
Core-area savings
(Core-area +crossbar-area percore) area savings
0
10
2030
40
50
6070
80
% p
erfo
rman
ce d
egra
dat
ion
INT Workloads
FP Workloads
Performance degradation even less if < 8 threads!
9/27/2006 41
Data center thermal management at HP⢠Power density also becoming a significant reliability issue ⢠Use 3D modeling and measurement to understand thermal
characteristics of data centersâSaving 25% today
⢠Exploit this for dynamic resource allocation and provisioning⢠Chandrakant Patel et. al.
9/27/2006 43
A Theory on the Evolution of Computer Architecture Ideas⢠Conjecture: There is no such thing as a bad or
discredited idea in computer architecture, only ideas at the wrong level, place, or time
⢠Reuse, Recycle⢠Evolution vs. Revolution⢠Examples:âSIMD âDataflowâHLL architecturesâCapabilitiesâVectors
9/27/2006 44
SIMD⢠Efficient way of computing proposed in late â60sâ64-PE Illiac-IV operational in 1972âDifficult to program
⢠Intel MMX introduced in 1996â Efficient use of larger word sizes for small parallel dataâ Used in libraries or specialized codeâ Only small increase in hardware (<1%)
9/27/2006 45
Dataflow⢠Widely researched in late â80s⢠Lots of problems:âComplicated machinesâNew programming model
⢠Out-of-order (OOO) execution machines developed in â90s are a limited form of data flowâ Issue queues issue instructions when operands readyâBut keeps same instruction set architectureâKeeps same programming modelâStill complex internally
9/27/2006 46
High-Level Language Architectures⢠Popular research in 1970âs and early 1980âs⢠âClosing the semantic gapâ⢠A few attempts to implement in hardware failedâMachines interpreted HLLs in hardware
⢠Now we have Java interpreted in softwareâ JIT compilersâ PortabilityâModest performance loss doesnât matter for some apps
9/27/2006 47
Capabilities⢠Popular research topic in 1970âs⢠Intel 432 implemented capabilities in hardwareâ Every memory referenceâ Poor performance
⢠Died with 432⢠Security increasingly important⢠Capabilities at the file-system level combined with
standard memory protection modelsâMuch less overhead
⢠Virtual machine support
9/27/2006 48
Lessons from âIdea Evolution Theoryâ⢠Donât be afraid to look at past ideas that didnât
work out as a source of inspiration⢠Some ideas make be successful if reinterpreted at
a different level, place, or time when they can be made more evolutionary than revolutionary
9/27/2006 49
Conclusions⢠The Power Wall is hereâ It is the wall to worry aboutâ It has dramatic implications for the industry
⢠From the transistor through the data center
⢠We need to reclaim past efficienciesâMicroarchitectural complexity needs to be reduced
⢠The power wall will usher in the âGolden Age of Parallel Programmingâ
⢠Much open research in architecture⢠It may be time to reexamine some previously
discarded architecture ideas