Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
The Myth of the Optimal F04The Myth of the Optimal F04
James A. James A. KahleKahleIBM FellowIBM FellowAustin, TXAustin, TX
TAU, Feb. 2 2004TAU, Feb. 2 2004
Technology driven definition:
Perf ~= (1/#i)(#i *Pa)(1/Pa*Pp)(Pp/FO4)(FO4/ps)
Path length
Impedance match between software
domain &hardware domain.
Technology
Processorefficiency
Application parallelism
#i = number of inst.Pa = Application parallelismPp = Processor parallelismF04 = Fan Out 4 delayps = picoseconds
Application Domain
PerformancePerformance
Path Length Path Length (1/#i) (#i *Pa) (1/Pa*Pp) (Pp/FO4) (FO4/ps)
nn Path length ImprovementsPath length Improvementsnn Compiler MaturityCompiler Maturity
nn Instruction CompactionInstruction Compactionnn VLIWVLIWnn SIMDSIMD
Application Parallelism(1/#i) (#i *Pa) (1/Pa*Pp) (Pp/FO4) (FO4/ps)
nn Major ClassificationMajor Classificationnn Instruction ParallelismInstruction Parallelismnn Multiprocessor Parallelism (single image)Multiprocessor Parallelism (single image)nn Cluster / Multi ComputerCluster / Multi Computer
nn MultiMulti--core Eracore Erann Processor tradeoffs need to made at higher levelProcessor tradeoffs need to made at higher levelnn Single Thread vs. MultiSingle Thread vs. Multi--thread thread nn Symmetric vs. Asymmetric MultiSymmetric vs. Asymmetric Multi--corecore
Processor Efficiency(1/#i) (#i *Pa) (1/Pa*Pp) (Pp/FO4) (FO4/ps)
nn Total logic levels for pipelineTotal logic levels for pipelinenn Branch Redirect (Ld / Branch Redirect (Ld / CmpCmp / Br)/ Br)nn Load LatencyLoad Latencynn ComputationComputationnn Floating point latencyFloating point latency
nn Processor ParallelismProcessor Parallelismnn Super Scalar WidthSuper Scalar Widthnn Super Pipeline LengthSuper Pipeline Length
Processor Efficiency(1/#i)(#i *Pa)(1/Pa*Pp) (Pp/FO4) (FO4/ps)
nn Limits of current studiesLimits of current studiesnn Ignore PowerIgnore Powernn Ignore Design ImprovementsIgnore Design Improvements
nn Power ManagementPower Managementnn New Latch structureNew Latch structure
nn Fix critical parametersFix critical parameters
nn Optimal FO4 per pipe stageOptimal FO4 per pipe stagenn It Depends!!It Depends!!
Technology(1/#i) (#i *Pa) (1/Pa*Pp) (Pp/FO4) (FO4/ps)
nn CMOS limitationsCMOS limitationsnn Power limitationsPower limitationsnn Wire limitationsWire limitations
CMOS Device PerformanceCMOS Device Performance
Conventional Bulk CMOS
SOI (silicon-on-insulator)
High mobility
Double-Gate
New Device Structures are Needed to Maintain PerformanceNew Device Structures are Needed to Maintain PerformanceR
elat
ive
Dev
ice
Per
form
ance
Year
Leakage Current TrendsLeakage Current TrendsI O
FF@
25°C
(nA
/µm
)
Year
Technology Power LimitationsTechnology Power Limitations
Year of Announcement
1950 1960 1970 1980 1990 2000 2010
Mod
ule
Hea
t Flu
x (W
/cm
2 )
0
2
4
6
8
10
12
14
Bipolar
CMOS
VacuumIBM 360
IBM 370 IBM 3033
IBM ES9000
Fujitsu VP2000
IBM 3090S
NTT
Fujitsu M-780
IBM 3090
CDC Cyber 205IBM 4381
IBM 3081Fujitsu M380
IBM RY5
IBM GP
IBM RY6
Apache
Pulsar
IBM RY7
IBM RY4
(Ghoshal and Schmidt)
§Limiters for consumer systemsƒBox cross sectionƒAir in vs out temperature ƒNoise/Max AirflowƒBox temperatureƒTj-maxƒ...
Power Limits Performancebut perhaps not quite the way you think
Intel BTX
Apple Dual G5
Only first few pose hard limit
Many platforms do not toleratesubstantially increased power.(Form-factor/noise/heat constrained.)
AC vs. DC Power TrendAC vs. DC Power Trend((End of Conventional CMOS Scaling)End of Conventional CMOS Scaling)
0.0001
0.001
0.01
0.1
1
10
100
1000
1 0.1 0.01
DC powerAC power
W/cm2
Lpoly
Not realistic for most applications
Based onIntel and IBM data
BuyBuy--All vs. HighAll vs. High--Frequency TimingFrequency Timing
3 4 5 6 7 8
GHz
0
0.2
0.4
0.6
0.8
1
232ps +-12% 4GHz 97% 172ps +-18% 6GHz 30%
Target Performance Process Distribution
FREQUENCY
Scaling Between Design PointsScaling Between Design Points
140
160
180
200
220
240
260
1.5x freq targetCircuit30% RC50% RC1x freq target
scaling the late mode timing point
140
160
180
200
220
240
260
slack adjustments
- slack on demand- timing at two corners- timing at one corner, overachieve
Freq
Freq
DELAY
DELAY
Process Variation Process Variation Dieter Wendel
WiresWires
nn Fundamentally, wires do not scaleFundamentally, wires do not scalenn Have had a steady stream of Have had a steady stream of ““trickstricks”” to to
compensatecompensatenn Aspect ratio changes (taller wires)Aspect ratio changes (taller wires)nn Resistance improvements (copper)Resistance improvements (copper)nn Capacitance improvements (lowCapacitance improvements (low--K dielectric)K dielectric)
nn Still things to improve, but wires wonStill things to improve, but wires won’’t t keep up with transistorskeep up with transistors
On-Chip Wires
ITRS Data/M. Horowitz
250 180 130 90 65 45 35
100
10
1
0.1
Gate delay
Scaled wire delay
Global w. buffers
Global w/o buffer
Process Technology Node (nm)
Rel.Delay
250 180 130 90 65 45 32
Timing Metrics for the FutureTiming Metrics for the Future
nn FO4 does not tell it allFO4 does not tell it allnn Does not really scale with TechnologyDoes not really scale with Technologynn Does not measure WiresDoes not measure Wiresnn Gives a broad view of designGives a broad view of design
nn Future MetricsFuture Metricsnn Measure of wire designMeasure of wire designnn Circuit / Wire BalanceCircuit / Wire Balancenn How well design will scaleHow well design will scale
Pushing Towards Lower FO4Pushing Towards Lower FO4
nn Complexity BoundariesComplexity Boundariesnn Exponentially increasing number of latchesExponentially increasing number of latchesnn Logic complexity increaseLogic complexity increasenn Area IncreaseArea Increase
nn How do we know the edge?How do we know the edge?nn Design teams to largeDesign teams to largenn Tools can no longer handleTools can no longer handlenn Performance from other techniquesPerformance from other techniquesnn But maybe not the whole design is at same FO4But maybe not the whole design is at same FO4
> 16 FO4 Design Methodology
Dataflow(e.g. adder)
LATCH
Synthesized Control
CycleBoundary
Re-buffering solutionNot pre-planned
<16 FO4(Template Based Design)
Dataflow(e.g. adder)
Synthesized or Array-basedControl
CycleBoundary
L
LATCH
LATCH
Dynamic Mux-latch Pre-plannedRe-buffering solution
Semi-automatedRe-buffering solution
(Posluszny et al, DAC 2000)
<12 FO4(Input & Output Latch Bound Macros)
Dataflow(e.g. adder)
Synthesized or Array-basedControl
CycleBoundary
L
LATCH
LATCH
Pre-plannedWire level &Re-buffering solution
LATCH
LATCH
§Latches are All That is Left§ Logic Fully Integrated in the Latch (LSDL,
IBM)§ Every Edge The Same (GASP, Sun)§SRAM Design in Doubt§Simultaneous setup and hold time constraints are a major challenge
Sub 9-FO4 Designs
<9 FO4(Latches are All That’s Left)
Highly Structured Control(precharacterized gates AND wires)
HalfCycleBoundary
L1
L1
LATCH
L2
L2
L1
L2
L2
L1
L1
Wires pre-designed
Wires are a MAJORchallenge
§Wire and buffer planning increasingly important as design frequency increases§Structure increases with lower FO4§Latch types proliferate as frequency increases§Ultimately latches and wires are all that’s left
High-Frequency Design Summary
Needed ToolsNeeded Toolsnn Much more refined timingMuch more refined timingnn Variability (ACLV not uniform)Variability (ACLV not uniform)
nn Statistical approachStatistical approach
nn Clock distribution skewClock distribution skewnn Local temperatureLocal temperaturenn Local supply droopLocal supply droopnn More sophisticated wire models (e.g. liner)More sophisticated wire models (e.g. liner)nn Continue to includeContinue to include
nn NoiseNoise
nn Optimization vs. analysis for new problemsOptimization vs. analysis for new problemsnn E.g. leakageE.g. leakage--driven synthesisdriven synthesis
Timed CircuitsTimed Circuitsnn Timed circuits use aggressive timing assumptions to increase speTimed circuits use aggressive timing assumptions to increase speed.ed.nn Delayed Reset Domino Logic Delayed Reset Domino Logic nn Self Resetting Dynamic Logic Self Resetting Dynamic Logic nn Pulse ClockingPulse Clockingnn Many others Many others ……
nn Timing assumptions can replace transistors Timing assumptions can replace transistors ÔÔ Increased speed, decreased areaIncreased speed, decreased areann Very simple example:Very simple example: If A and B always fall before Clk
we can make this transformation:
nn Timing Challenges:Timing Challenges:nn Verify that the assumption holds in the designVerify that the assumption holds in the designnn Verify that the assumption is sufficient to guarantee correctnesVerify that the assumption is sufficient to guarantee correctness.s.
nfet logic
Logic Devices Latching Devices
Merging Logic and LatchesMerging Logic and Latches
Logic Gates
Latch
nn Merging logic and latches allows each device Merging logic and latches allows each device to perform multiple functions.to perform multiple functions.nn Logic and data capture overlapped.Logic and data capture overlapped.nn Latching and gain overlapped in the latch.Latching and gain overlapped in the latch.
nn Demonstrated to provide up to 2x increase Demonstrated to provide up to 2x increase in density for arithmetic functions in density for arithmetic functions -- ISSCC ISSCC 20032003
nn Higher density Higher density ÜÜ shorter wires shorter wires ÜÜ smaller smaller driversdriversnn Higher frequency from shorter wires. Higher frequency from shorter wires. nn Lower active power.Lower active power.nn Lower leakage power (90% of leakage in drivers)Lower leakage power (90% of leakage in drivers)
nn Timing Challenge: Everything is a latch.Timing Challenge: Everything is a latch.nn Need to characterize them all Need to characterize them all nn This is labor intensive and needs to be more This is labor intensive and needs to be more
automated.automated.
Statistical TimingStatistical Timing
nn Process variation continues to increase.Process variation continues to increase.nn Designing for the worst case does not always make sense.Designing for the worst case does not always make sense.nn Continually adding margin consumes die area and power.Continually adding margin consumes die area and power.nn Recent work has shown worst case timing to be up to 20% overly pRecent work has shown worst case timing to be up to 20% overly pessimistic essimistic -- ICCAD 2003ICCAD 2003nn This will only get worse as process variation increases.This will only get worse as process variation increases.
nn Designers and managers need to be able to quantify the impact ofDesigners and managers need to be able to quantify the impact of circuit decisions on circuit decisions on frequency and yield.frequency and yield.nn Choice X: 70% yield at 3GHz, 75% yield at 2GHzChoice X: 70% yield at 3GHz, 75% yield at 2GHznn Choice Y: 50% yield at 3GHz, 99% yield at 2GHzChoice Y: 50% yield at 3GHz, 99% yield at 2GHznn The choice depends on the target market and manufacturing cost.The choice depends on the target market and manufacturing cost.nn Currently the choice is made without this level of information Currently the choice is made without this level of information ÜÜ more conservative designmore conservative design
nn Timing challenge:Timing challenge:nn Continue to improve the capacity and speed of statistical timingContinue to improve the capacity and speed of statistical timing engines.engines.nn Bring statistical timing analysis into mainstream timing flows.Bring statistical timing analysis into mainstream timing flows.
Hope for the FutureHope for the Future
nn New TechnologiesNew Technologiesnn Faster TransistorsFaster Transistorsnn New Memory StructuresNew Memory Structures
nn New ApplicationsNew Applicationsnn New drivers for technologyNew drivers for technology
SiGeSiGetransistortransistor
StrainedStrainedsiliconsilicon
Carbon Carbon nanotubenanotube
FinFEFinFETT
New Device StructuresNew Device Structures
MRAM Operating Principles and MRAM Operating Principles and PrototypePrototype
MRAM MRAM ArchitectureArchitecture
““MillipedeMillipede”” StorageStorage
Impedance Match(1/#i) (#i *Pa) (1/Pa*Pp) (Pp/FO4) (FO4/ps)
nn New design optimizations to match applicationsNew design optimizations to match applicationsnn Maturing Server / DesktopMaturing Server / Desktop
nn Super Scalar design limitsSuper Scalar design limitsnn Super Pipeline design limitsSuper Pipeline design limitsnn Maximize performance efficiencyMaximize performance efficiency
nn Power limited design spacePower limited design spacenn Maximize power efficiencyMaximize power efficiency
nn New Application SpacesNew Application Spacesnn Games driving new architecture organizationsGames driving new architecture organizationsnn New levels of Architecture efficiency New levels of Architecture efficiency nn Will drive the lowest FO4 design spacesWill drive the lowest FO4 design spaces
SummarySummary
nn Timing constraints are growingTiming constraints are growingnn ArchitecturalArchitecturalnn LogicalLogicalnn PhysicalPhysicalnn PowerPower
nn Multitude of Optimization pointsMultitude of Optimization pointsnn ee--businessbusinessnn High Performance ComputingHigh Performance Computingnn Low PowerLow Powernn Gaming / MediaGaming / Media
nn Technology limitsTechnology limitsnn Maturity in CMOSMaturity in CMOSnn New optimizations for new structuresNew optimizations for new structures