1
www.compaq.com
Design Challenges in Design Challenges in MultiMulti --GHz GHz
MicroprocessorsMicroprocessors
Bill HerrickBill Herrick
Director, Alpha Microprocessor Director, Alpha Microprocessor DevelopmentDevelopment
2
ASP DAC 2000
IntroductionIntroduction
�� Moore’s Moore’s Law (Law (the trend that the demand for IC functions and the trend that the demand for IC functions and the capability of the semiconductor industry to meet that demandthe capability of the semiconductor industry to meet that demand, ,
will double every 1.5 to 2 yearswill double every 1.5 to 2 years) has worked well during ) has worked well during the last 30 yearsthe last 30 years
�� Difficult challenges face the industry attempting to Difficult challenges face the industry attempting to maintain the pacemaintain the pace
�� With collaboration, understanding, vision and With collaboration, understanding, vision and innovation this trend can continue for high innovation this trend can continue for high performance microprocessorsperformance microprocessors
2
3
ASP DAC 2000
TopicsTopics�� Historical TrendsHistorical Trends
zz IntelIntel
zz Alpha chips and design styleAlpha chips and design style
zz Observations and trendsObservations and trends
�� Technology PredictionsTechnology Predictionszz ITRS 1999ITRS 1999
�� Key Design ChallengesKey Design ChallengesÎÎClocking and power Clocking and power -- how Alpha has managedhow Alpha has managed
ÎÎClocking and power Clocking and power -- long term solutionslong term solutions
4
ASP DAC 2000
Historical Trends: Then and NowHistorical Trends: Then and Now
Circa 1970Circa 1970�� 12µ PMOS12µ PMOS
�� 1000 transistors1000 transistors
�� 5 5 -- 10 mm10 mm22 die sizedie size
�� 10V supply10V supply
�� 50 50 -- 100 kHz frequency100 kHz frequency
�� 100 100 -- 200 200 mWmW
�� 16 pin 16 pin DIPsDIPs
Circa 2000Circa 2000�� 0.18µ CMOS0.18µ CMOS
�� 10 10 -- 100 million transistors100 million transistors
�� 300 300 -- 400 mm400 mm22 die sizedie size
�� 2.5V supply2.5V supply
�� 500 500 -- 1000 MHz frequency1000 MHz frequency
�� 50 50 -- 100 W100 W
�� 500 500 -- 1000 pin 1000 pin BGAsBGAs
3
5
ASP DAC 2000
Intel Performance HistoryIntel Performance History
0.01
0.1
1
10
100
1000
10000
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00
Date o f Introduction
MIP
S
386
4004
8080
808680286
486
Pentium
Pentium ProPe ntium II
Xe onPe ntium III
6
ASP DAC 2000
Intel TrendsIntel Trends
�� The 4004 (1971)The 4004 (1971)zz 2300 transistors in a 10u process, 2300 transistors in a 10u process, zz 108kHz operation, executing 0.06108kHz operation, executing 0.06 MIPs MIPs
�� Pentium III (1999) Pentium III (1999) zz 28 million transistors in a 0.18u process, 28 million transistors in a 0.18u process, zz 733MHz operation, executes 2000 733MHz operation, executes 2000 MIPsMIPs
�� Over nearly 30 years Over nearly 30 years zz performance has increased 30,000x, performance has increased 30,000x, zz transistor count has increase 10,000x transistor count has increase 10,000x zz frequency has increased 7,000x frequency has increased 7,000x zz die size has increased only 25x. die size has increased only 25x. zz Moore’sMoore’s law predicts 30,000x to 1,000,000x law predicts 30,000x to 1,000,000x
improvement over this period.improvement over this period.
4
7
ASP DAC 2000
Alpha ArchitectureAlpha Architecture
�� Alpha is a true 64Alpha is a true 64--bit load/store RISC architecturebit load/store RISC architecture
�� Alpha is designed for high clock speedAlpha is designed for high clock speedzz Simple, fixed length (32Simple, fixed length (32--bit) instructionsbit) instructions
zz Minimal instruction ordering constraintsMinimal instruction ordering constraints
zz No conditions codesNo conditions codeszz No branch delay slotsNo branch delay slots
�� Chip microChip micro--architecture is carefully chosen to architecture is carefully chosen to maximize performance without impacting cycle maximize performance without impacting cycle timetime
8
ASP DAC 2000
Alpha Performance HistoryAlpha Performance History
1
10
100
1992 1993 1994 1995 1996 1997 1998 1999
D a te o f In tro du ction
SP
EC
int9
5
E V45-275
E V5-300
E V56-500
E V6-575
E V67-700
E V4-200
EV56-600
EV56-400
5
9
ASP DAC 2000
EV4 Chip OverviewEV4 Chip Overview•• 0.75µm 3LM N0.75µm 3LM N--well CMOS,well CMOS,
LLeffeff=0.5µm, =0.5µm, TToxox=10.5nm=10.5nm
•• 3.3V 3.3V VddVdd
•• 200MHz @100°C & 3.3V200MHz @100°C & 3.3V
•• 16 gate delays per cycle 16 gate delays per cycle
•• 30W @200MHz & 3.3V30W @200MHz & 3.3V
•• 13.9mm x 16.8mm (233 mm13.9mm x 16.8mm (233 mm22) )
•• 1.7 Million Transistors1.7 Million Transistors
~ 0.85 Million Logic ~ 0.85 Million Logic TransistorsTransistors
•• 431 pin PGA (291 signals)431 pin PGA (291 signals)
10
ASP DAC 2000
EV4 MicroEV4 Micro--ArchitectureArchitecture
�� Dual InDual In--Order Instruction Issue Order Instruction Issue zz singlesingle--issue Integer & singleissue Integer & single--issue FPissue FP
�� Fully Pipelined (except Integer MUL and FP DIV)Fully Pipelined (except Integer MUL and FP DIV)zz 77--stage Integer and 10stage Integer and 10--stage FP pipelinesstage FP pipelines
�� 11--bit Branch Prediction: 2kbit Branch Prediction: 2k--entry BHTentry BHT�� 8kB direct8kB direct--mapped Imapped I--Cache and 8kB directCache and 8kB direct--
mapped writemapped write--through Dthrough D--CacheCache�� 32 Integer and 32 FP Registers, 64b/entry32 Integer and 32 FP Registers, 64b/entry�� Flexible external interface: shared 128b/64b data, Flexible external interface: shared 128b/64b data,
34b address L2 cache and system interface34b address L2 cache and system interface
6
11
ASP DAC 2000
EV5 Chip OverviewEV5 Chip Overview
•• 0.50µm 4LM N0.50µm 4LM N--well CMOS, well CMOS, LLeffeff=0.365µm, =0.365µm, TToxox=9.0nm=9.0nm
•• 3.3V 3.3V VddVdd
•• 350MHz @100°C & 3.3V350MHz @100°C & 3.3V
•• 14 gate delays per cycle 14 gate delays per cycle
•• 60W @350MHz & 3.3V60W @350MHz & 3.3V
•• 16.5mm x 18.1mm (298 mm16.5mm x 18.1mm (298 mm22) )
•• 9.3 Million Transistors9.3 Million Transistors
~ 2.5 Million Logic ~ 2.5 Million Logic TransistorsTransistors
•• 499 pin PGA (294 signals)499 pin PGA (294 signals)
12
ASP DAC 2000
EV5 MicroEV5 Micro--ArchitectureArchitecture
�� Quad InQuad In--Order Instruction IssueOrder Instruction Issuezz dualdual--issue Integer & dualissue Integer & dual--issue FPissue FP
�� 77--stage Integer and 9stage Integer and 9--stage FP pipelinesstage FP pipelineszz FP latencies reduced by 2 cyclesFP latencies reduced by 2 cycles
�� 22--bit Branch Prediction: 2kbit Branch Prediction: 2k--entry BHTentry BHT�� 8kB I8kB I--Cache and 8kB writeCache and 8kB write--through Dthrough D--CacheCache�� 96kB unified on96kB unified on--chip L2 Cachechip L2 Cache�� Improved external interface supports a nonImproved external interface supports a non--
blocking cache schemeblocking cache scheme
7
13
ASP DAC 2000
EV6 Chip OverviewEV6 Chip Overview
• 0.35µm 6LM N6LM N--well well CMOS,LLeffeff=0.25µm,=0.25µm, TToxox=6.0nm=6.0nm
• 2.2V Vdd• 575MHz @100°C & 2.2V • 12 gate delays per cycle • 90W @575MHz & 2.2V• 16.7mm x 18.8mm (314 mm2) • 15.2 Million Transistors
~ 6 Million Logic Transistors•• 587 pin PGA (374 signals)587 pin PGA (374 signals)
14
ASP DAC 2000
EV6 MicroEV6 Micro--ArchitectureArchitecture
�� FourFour--wide Instruction Fetchwide Instruction Fetch
�� Tournament Branch PredictorTournament Branch Predictor
�� OutOut--ofof--Order Execution PipelinesOrder Execution Pipelineszz QuadQuad--speculativespeculative--issue integer pipelineissue integer pipelinezz DualDual--speculativespeculative--issue floatingissue floating--point pipelinepoint pipeline
�� 80 In80 In--flight Instructionsflight Instructions�� Registers: 80 Integer, 72 Floating PointRegisters: 80 Integer, 72 Floating Point�� Queue Entries: 20 Integer, 15 Floating PointQueue Entries: 20 Integer, 15 Floating Point�� 22--Way 64KB L1 OnWay 64KB L1 On--Chip Instruction and Data Chip Instruction and Data
CachesCaches�� Up to 16 outstanding offUp to 16 outstanding off--chip memory referenceschip memory references
8
15
ASP DAC 2000
EV7 Chip OverviewEV7 Chip Overview
�� 0.18µm CMOS technology 0.18µm CMOS technology �� 1.5V 1.5V VddVdd�� Clock frequency >1.0GHzClock frequency >1.0GHz�� 100W100W�� ~350mm~350mm22
�� ~100 Million transistors~100 Million transistors�� EV6 coreEV6 core�� Integrated L2 Cache (1.75 MB 7Integrated L2 Cache (1.75 MB 7--way)way)�� Integrated memory controller (RAMBUS)Integrated memory controller (RAMBUS)�� Integrated network interfaceIntegrated network interface
16
ASP DAC 2000
EV8 Chip OverviewEV8 Chip Overview�� Clock frequency range 1.0Clock frequency range 1.0--2.0GHz2.0GHz
�� Leading edge 0.125µm CMOS technology Leading edge 0.125µm CMOS technology
�� ~1.2V~1.2V VddVdd
�� <150W<150W
�� ~250 Million transistors~250 Million transistors
�� Enhanced outEnhanced out--ofof--order executionorder execution
�� 88--widewide superscalarsuperscalar
�� 44--way simultaneous multiway simultaneous multi--threading (SMT)threading (SMT)
�� EV7 memory and system enhancementsEV7 memory and system enhancements
9
17
ASP DAC 2000
Alpha Circuit Design PhilosophyAlpha Circuit Design Philosophy
�� Transistor level circuit designTransistor level circuit design�� Broad range of circuit styles and logic familiesBroad range of circuit styles and logic families
zz Complementary CMOSComplementary CMOSzz Dynamic logicDynamic logiczz DCVSL (DCVSL (cascodecascode))zz RatioedRatioed logiclogic
�� Key components to enable Key components to enable high performancehigh performancezz OnOn--chip clock generation and distribution (lowchip clock generation and distribution (low--
skew, fast edge)skew, fast edge)zz Latching (low latency)Latching (low latency)zz Low noise onLow noise on--chip power distributionchip power distributionzz OnOn--chip signal integrity managementchip signal integrity management
18
ASP DAC 2000
Complexity TrendsComplexity Trends
�� Process scaling has continued steadilyProcess scaling has continued steadily
�� Planarization Planarization has enabled an increase in has enabled an increase in the number of interconnect layersthe number of interconnect layers
�� Transistor counts have increased Transistor counts have increased dramatically with the L2 cache dramatically with the L2 cache SRAMsSRAMs
�� Additionally, design team size has Additionally, design team size has increased ~40% per generationincreased ~40% per generation
�� Opportunities to manage complexity and Opportunities to manage complexity and productivityproductivity
zz Fundamental understanding and modeling Fundamental understanding and modeling of process and circuit element behaviorsof process and circuit element behaviors
zz High level design methodsHigh level design methodszz CADCAD
zz Design reuseDesign reuse
zz MicroMicro--architecturearchitecture
Process Features
00.10.20.30.40.50.60.70.8
EV4 EV5 EV6 EV7 EV8
Dim
ensi
on (u
m)
0
2
4
6
8
10
Me
tal L
aye
rs
Chip Features
0
50
100
150
200
250
300
EV4 EV5 EV6 EV7 EV8
Tra
nsis
tors
(M)
050100150200250300350400450
Die
Siz
e (m
m2 )
10
19
ASP DAC 2000
Power Dissipation TrendsPower Dissipation Trends
�� Power consumption is increasingPower consumption is increasingzz Better cooling technology neededBetter cooling technology needed
�� Supply current is increasing faster!Supply current is increasing faster!�� OnOn--chip signal integrity will be a major chip signal integrity will be a major
issueissue�� Power and current distribution are criticalPower and current distribution are critical�� Opportunities to slow power growthOpportunities to slow power growth
zz Accelerate Accelerate VddVdd scalingscalingzz /RZ � GLHOHFWULFV WKLQQHU �&X�/RZ � GLHOHFWULFV WKLQQHU �&X�
interconnectinterconnect
zz SOI circuit innovations SOI circuit innovations
zz Clock system designClock system designzz micromicro--architecturearchitecture
Power Dissipation
020406080
100120140160
EV4 EV5 EV6 EV7 EV8
Pow
er (W
)
0
0.51
1.5
22.5
3
3.5
Vo
ltag
e (
V)
Supply Current
0
20
40
60
80
100
120
140
EV4 EV5 EV6 EV7 EV8
Cu
rren
t (A
)
0
0.5
1
1.5
2
2.5
3
3.5
Vol
tage
(V)
20
ASP DAC 2000
Performance TrendsPerformance Trends
�� Performance has increased Performance has increased significantly (7x) faster than significantly (7x) faster than frequencyfrequency
�� Performance tracks transistor count Performance tracks transistor count when L2 cache ignoredwhen L2 cache ignored
zz Transistor budget has increased more Transistor budget has increased more than performance when L2 cache is than performance when L2 cache is considered butconsidered but
zz benchmarks did not reflect larger benchmarks did not reflect larger applicationsapplications
�� Opportunities to continue Opportunities to continue performance improvementsperformance improvements
zz Continued scaling of devices, interconnect Continued scaling of devices, interconnect and dielectricsand dielectrics
zz Clock distributionClock distribution
zz MicroMicro--architecturearchitecture
System designSystem design
Clock Speed
0
10
20
30
40
50
60
EV4 EV5 EV6 EV7 EV8
Re
lativ
e P
erfo
rma
nc
e
02004006008001000120014001600
Fre
que
ncy
(MH
z)
Transistor Count
0
10
20
30
40
50
60
70
EV4 EV5 EV6 EV7 EV8
Rel
ativ
e P
erfo
rman
ce
0
50
100
150
200
250
300
Tra
nsis
tors
(M)
11
21
ASP DAC 2000
MicroMicro--Architecture TrendsArchitecture Trends
�� Trends have includedTrends have includedzz Wider superWider super--scalar scalar
machines, deep pipelinesmachines, deep pipelines
zz Larger register, L1 cachesLarger register, L1 caches
zz OnOn--chip L2 cacheschip L2 caches
zz Out of order executionOut of order execution
zz Sophisticated branch Sophisticated branch prediction, predication, prediction, predication, speculationspeculation
zz Integrated memory and Integrated memory and network controllersnetwork controllers
zz SMTSMT
zz Less idle logic but more Less idle logic but more bookkeeping logicbookkeeping logic
�� Future opportunities Future opportunities includeinclude
zz Floating point performance Floating point performance improvementsimprovements
zz VectorsVectors
zz ThreadThread--level speculationlevel speculation
zz More pipeliningMore pipelining
zz Better onBetter on--chip chip communicationscommunications
–– Banking, replicating Banking, replicating structuresstructures
–– Clustering functional Clustering functional unitsunits
zz OnOn--chip SMPchip SMP
22
ASP DAC 2000
Challenging Design TrendsChallenging Design Trends
�� MicroMicro--architecture and logic design are architecture and logic design are stressed as frequency has increased stressed as frequency has increased faster than scalingfaster than scaling
�� Further reducing the number of gate Further reducing the number of gate delays per cycle will be difficultdelays per cycle will be difficult
�� Cycles to communicate across chip track Cycles to communicate across chip track with frequencywith frequency
�� Clock edge rates are not scalingClock edge rates are not scaling
�� Opportunities to continue performance Opportunities to continue performance increasesincreases
zz Chip implementation designChip implementation design
zz Clock system designClock system design
zz MicroMicro--architecturearchitecture
Logic Levels per Cycle
0
5
10
15
20
EV4 EV5 EV6 EV7 EV8
Gat
e D
elay
s pe
r Cyc
le
02004006008001000120014001600
Fre
que
ncy
(MH
z)
Cycles Across Chip
012345678
EV4 EV5 EV6 EV7 EV8
Cyc
les
02004006008001000120014001600
Fre
quen
cy (M
Hz)
12
23
ASP DAC 2000
ITRS ITRS --1999 Key Messages1999 Key Messages
�� No major issues through 130 nm generation, but No major issues through 130 nm generation, but significant issues for 100 nm generation (2005)significant issues for 100 nm generation (2005)
�� Continued technology scaling will require the Continued technology scaling will require the introduction of new process materials and new introduction of new process materials and new devicesdevices
�� Transistor densities will continue the historical Transistor densities will continue the historical trends ~2X / 2yrstrends ~2X / 2yrs
�� Clock frequency increases will slow compared to Clock frequency increases will slow compared to historical trends historical trends
24
ASP DAC 2000
ITRSITRS--1999 The Roadmap1999 The Roadmap
1999 2001 2003 2005 2008 2011 2014
Generation (nm) 180 130 130 100 70 50 35
Leff (nm) 140 100 80 65 45 31 21
Devices (M) 110 220 441 882 2494 7053 19949
Chip Size (mm 2) 450 450 567 622 713 817 937
Signals 768 1024 1024 1024 1280 1408 1472
Pins 1600 2007 2518 3158 4437 6234 8758
13
25
ASP DAC 2000
ITRSITRS--1999 The Roadmap (continued)1999 The Roadmap (continued)
1999 2001 2003 2005 2008 2011 2014
Generation (nm) 180 130 130 100 70 50 35
Clock (MHz) 1200 1454 1724 2000 2500 3000 3600
Local Clk (MHz) 1250 1767 2490 3500 6000 10000 13500
IO (MHz) 480 722 932 1035 1285 1540 1800
Wiring Levels 7 7 8 9 9 10 10
Vdd (V) 1.8 1.5 1.5 1.2 0.9 0.6 0.6
Power (W) 90 115 140 160 170 174 183
26
ASP DAC 2000
ITRS ITRS -- 1999 Highlights1999 Highlights�� TransistorsTransistors
zz Drive currents will remain constant through 2014 at Drive currents will remain constant through 2014 at 750 µA/µm for 750 µA/µm for NFETs NFETs and 350 µA/µm for and 350 µA/µm for PFETsPFETs
zz Leakage currents will double every 3 years from 5 Leakage currents will double every 3 years from 5 nAnA/µm in 1999 to 160 /µm in 1999 to 160 nAnA/µm in 2014/µm in 2014
�� InterconnectInterconnectzz 8VH�RI�&X�DQG�ORZ���GLHOHFWULFV�ZLOO�EHFRPH��VWDQGDUG8VH�RI�&X�DQG�ORZ���GLHOHFWULFV�ZLOO�EHFRPH��VWDQGDUG
zz Local interconnect delays will scale with gate delaysLocal interconnect delays will scale with gate delays
zz Global interconnect delays, even with repeaters will not Global interconnect delays, even with repeaters will not scale with gate delaysscale with gate delays
zz Coplanar Coplanar waveguideswaveguides, free space RF and optical , free space RF and optical interconnect may be needed longer term interconnect may be needed longer term
14
27
ASP DAC 2000
ITRS ITRS -- 1999 Highlights (2)1999 Highlights (2)�� PackagingPackaging
zz Maximum junction temperature must be reduced Maximum junction temperature must be reduced from 100from 100ooC to 85C to 85ooC by 2002 for reliability concernsC by 2002 for reliability concerns
zz SignificantSignificant θθjaja improvements will be required to improvements will be required to maintain air cooling system solutions: a 50% maintain air cooling system solutions: a 50% reduction by 2002 and another 30% by 2014reduction by 2002 and another 30% by 2014
�� Modeling & SimulationModeling & Simulationzz 2D and 3D interconnect models with inductance 2D and 3D interconnect models with inductance
and transmission line effects will be neededand transmission line effects will be neededzz Transistor models of nonTransistor models of non--quasiquasi--static effects and static effects and
quantum mechanical gate effects will be needed; quantum mechanical gate effects will be needed; gate currents will become importantgate currents will become important
zz OCV modeling will become necessary OCV modeling will become necessary zz CPU efficient and accurate models will be essentialCPU efficient and accurate models will be essential
28
ASP DAC 2000
ITRS ITRS -- 1999 Highlights (3)1999 Highlights (3)
�� Design ProductivityDesign Productivityzz Design team sizes will not exceed 300 peopleDesign team sizes will not exceed 300 people
zz Design cycle times will decrease from 36 months in Design cycle times will decrease from 36 months in 1999 to 30 months in 2005 to 24 months in 20141999 to 30 months in 2005 to 24 months in 2014
�� Verification and TestVerification and Testzz Verification has become more than 50% of the total Verification has become more than 50% of the total
design effortdesign effortzz Use of formal verification will increase from 15% Use of formal verification will increase from 15%
now to 30% in 2005 to 60% in 2014now to 30% in 2005 to 60% in 2014
zz BIST coverage will increase from 20% now to 40% BIST coverage will increase from 20% now to 40% in 2005 to 70% in 2014in 2005 to 70% in 2014
15
29
ASP DAC 2000
�� 2 phase single wire clock, 2 phase single wire clock, distributed globallydistributed globallyzz Low skewLow skewzz Fast edge rateFast edge rate
�� 1 clock driver channel1 clock driver channelzz 3.5nF clock load3.5nF clock loadzz 35 cm final driver width35 cm final driver width
EV4 ClockingEV4 Clocking
trise = 0.5ns tskew = 0.3ns
tcycle= 6.0ns
Clock waveform
Location of clockdriver on die
30
ASP DAC 2000
EV4 LatchesEV4 Latches
First single wire clock First single wire clock implementationimplementationzz Race immune latchRace immune latchzz Level sensitive designLevel sensitive designzz 2 latches per cycle2 latches per cyclezz Can build logic into first stage Can build logic into first stage
of latchof latchzz ttcyclecycle latch overhead is latch overhead is
approximately 25%approximately 25%
D
CLKQ
CLK high loading latch
16
31
ASP DAC 2000
EV4 Thermal GradientEV4 Thermal Gradient
Temp = 46C
Temp = 76C
EV4
32
ASP DAC 2000
EV5 ClockingEV5 Clocking
�� 2 phase single wire clock, 2 phase single wire clock, distributed globallydistributed globally
�� 2 distributed driver channels2 distributed driver channelszz Reduced RC delay/skewReduced RC delay/skewzz Improved thermal distributionImproved thermal distributionzz 3.75nF clock load3.75nF clock loadzz 58 cm final driver width58 cm final driver width
�� Local inverters for latchingLocal inverters for latching�� Conditional clocks in caches to Conditional clocks in caches to
reduce powerreduce power�� More complex race checkingMore complex race checking�� Device variationDevice variation
trise = 0.35ns tskew = 150ps
tcycle= 3.3ns
Clock waveform
Location of clockdriver on die
pre-driver
final drivers
17
33
ASP DAC 2000
EV5 LatchesEV5 Latches
D
CLKQ
CLK_L
CLK high loading latch
Reduce tReduce tdqdq and reduce clock loadand reduce clock loadzz Local clock inverter Local clock inverter
complicated race issuescomplicated race issueszz Level sensitive designLevel sensitive designzz 2 latches per cycle2 latches per cyclezz Can build logic into first and Can build logic into first and
last stages of latchlast stages of latchzz ttcyclecycle latch overhead is latch overhead is
approximately 15%approximately 15%zz Smaller, faster and lower Smaller, faster and lower
power than EV4 latchpower than EV4 latch
34
ASP DAC 2000
EV5 Thermal GradientEV5 Thermal Gradient
18
ASP DAC 2000
EV5 Global Clock SkewEV5 Global Clock Skew
ASP DAC 2000
EV5 Local Clock SkewEV5 Local Clock Skew
19
37
ASP DAC 2000
�� 2 Phase, with multiple conditional 2 Phase, with multiple conditional buffered clocksbuffered clockszz 2.82.8 nFnF clock loadclock loadzz 40 cm final driver width40 cm final driver width
�� Local clocks can be gated “off” to Local clocks can be gated “off” to save powersave power
�� Reduced load/skewReduced load/skew�� Reduced thermal issuesReduced thermal issues�� Multiple clocks complicate race Multiple clocks complicate race
checkingchecking
trise = 0.35ns tskew = 50ps
tcycle= 1.67ns
EV6 ClockingEV6 Clocking
Global clock waveform
PLL
38
ASP DAC 2000
EV6 LatchesEV6 Latches
Conditional clocks to reduce Conditional clocks to reduce powerpowerzz Static designStatic designzz 1 latch per cycle1 latch per cyclezz Edge triggered to simplify Edge triggered to simplify
race rulesrace ruleszz Can build logic into latchCan build logic into latchzz ttcyclecycle latch overhead is latch overhead is
approximately 15%approximately 15%
D CLK
Q_H Q_L
20
39
ASP DAC 2000
EV6 Clock ResultsEV6 Clock Results
GCLK Skew(at Vdd/2 Crossings)
ps5
101520253035404550
ps300305310315320325330335340345
GCLK Rise Times(20% to 80% Extrapolated to 0% to 100%)
40
ASP DAC 2000
EV7 Clock HierarchyEV7 Clock Hierarchy
GCLK(CPU Core)L
2L
_C
LK
(L2
Cac
he)
L2
R_
CL
K(L
2 C
ache
)
NCLK(Mem Ctr l )
DLL
PL
L
SYSCLK
DL
L
DL
L
+ widely dispersed drivers
+ DLLs compensate static and low-frequency variation
+ divides design and verification effort
- DLL design and verification is added work
+ tailored clocks
Active Skew Management and Multiple Clock Domains
21
41
ASP DAC 2000
Power ConsumptionPower Consumption
�� Clocks consume the Clocks consume the largest fraction of powerlargest fraction of power
�� Driving interDriving inter--unit busses unit busses consumes as much power consumes as much power as intraas intra--unit gates and unit gates and interconnectinterconnect
ClocksCachesExecution Un itsContro lI/O D riv ers
40%20%
15%
15% 10%
42
ASP DAC 2000
EV4 EV4 -- 3 Metal Layers3 Metal Layers3rd “coarse and thick” metal layer added to the3rd “coarse and thick” metal layer added to the
technology for EV4 designtechnology for EV4 designPower supplied from two sides of the die via 3rd metal layerPower supplied from two sides of the die via 3rd metal layer
2nd metal layer used to form power grid2nd metal layer used to form power grid
90% of 3rd metal layer used for power/clock routing90% of 3rd metal layer used for power/clock routing
Metal 3
Metal 2
Metal 1
22
43
ASP DAC 2000
EV5 EV5 -- 4 Metal Layers4 Metal Layers4th “coarse and thick” metal layer added to the4th “coarse and thick” metal layer added to the
technology for EV5 designtechnology for EV5 designPower supplied from four sides of the diePower supplied from four sides of the die
Grid strapping done all in coarse metalGrid strapping done all in coarse metal
90% of 3rd and 4th metals used for power/clock routing90% of 3rd and 4th metals used for power/clock routing
Metal 3
Metal 2
Metal 1
Metal 4
44
ASP DAC 2000
2 reference plane metal layers added to thetechnology for EV6 designSolid planes dedicated to Vdd/Vss
Significantly lowers resistance of gridLowers on-chip inductance
EV6 EV6 -- 6 Metal Layers6 Metal Layers
Metal 4
Metal 2Metal 1
RP2/Vdd
RP1/Vss
Metal 3
23
ASP DAC 2000
Reference Plane ExampleReference Plane ExampleSimulation MethodologySimulation Methodology
Extract Inductance & Resistance versus FrequencyExtract Inductance & Resistance versus Frequency
Model Skin Effect Both Vertically and HorizontallyModel Skin Effect Both Vertically and Horizontally
Construct TimeConstruct Time--Domain SPICE Model and Simulate with SPICEDomain SPICE Model and Simulate with SPICE
Use FF Devices, HighUse FF Devices, High VddVdd & Low Temperature to Aggravate Inductive Effects& Low Temperature to Aggravate Inductive Effects
Substrate
RP2
Metal 4
Metal 3
RP1Metal 2Metal 1
Substrate
RP2
Metal 4
Metal 3
Metal 1Metal 2
⇐ Metal 3 Victims
ASP DAC 2000
Reference Plane Example (continued)Reference Plane Example (continued)
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
Time (ns)
M3 Victim
M3 Aggressors
M1 Aggressors
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
Time (ns)
M3 Victim
M3 Aggressors
M1 Aggressors
RP2 Only RP1 & RP2
24
47
ASP DAC 2000
DeDe--coupling Capacitor Ratioscoupling Capacitor Ratios
�� EV4EV4zz total effective switching capacitance = 12.5nFtotal effective switching capacitance = 12.5nFzz 128nF of de128nF of de--coupling capacitancecoupling capacitance
zz dede--coupling/switching capacitance ~ 10xcoupling/switching capacitance ~ 10x
�� EV5EV5zz 13.9nF of switching capacitance 13.9nF of switching capacitance
zz 160nF of de160nF of de--coupling capacitancecoupling capacitance
�� EV6EV6zz 34nF of effective switching capacitance34nF of effective switching capacitance
zz 320nF of de320nF of de--coupling capacitance coupling capacitance ---- not enough!not enough!
48
ASP DAC 2000
EV6 DeEV6 De--coupling Capacitancecoupling Capacitance
Design for Design for ∆∆IddIdd= 25 A @= 25 A @ VddVdd = 2.2 V, = 2.2 V, f f = 600 MHz= 600 MHzzz 0.320.32--µF of onµF of on--chip dechip de--coupling capacitance was coupling capacitance was
addedadded–– Under major busses and around majorUnder major busses and around major griddedgridded clock clock
driversdrivers
–– Occupies 15Occupies 15--20% of die area20% of die area
zz 11--µF 2µF 2--cmcm22 WirebondWirebond Attached Chip Capacitor Attached Chip Capacitor (WACC) significantly increases “Near(WACC) significantly increases “Near--Chip” deChip” de--couplingcoupling
–– 160 160 VddVdd//Vss bondwireVss bondwire pairs on the WACC minimize pairs on the WACC minimize inductanceinductance
25
49
ASP DAC 2000
EV6 WACCEV6 WACC
587 IPGA
Micro p rocessorWACC
Heat Slu g
389 Signal - 198 VDD/VSS Pins389 Signal Bondwires
395 VDD/VSS Bondwires
320 VDD/VSS Bondwires
50
ASP DAC 2000
Clocking FuturesClocking Futures
�� Frequencies will continue to scaleFrequencies will continue to scale
�� Clock edge rates are not scaling as wellClock edge rates are not scaling as well
�� Multiple clock zones required Multiple clock zones required zz Architectures minimizing global communicationsArchitectures minimizing global communicationszz Adaptive and passive synchronization techniquesAdaptive and passive synchronization techniques
zz DLLs in the near termDLLs in the near term
zz Clocking schemes utilizing encoding, extraction, Clocking schemes utilizing encoding, extraction, multimulti--state and local phase optimization to state and local phase optimization to compensate for skew and latencycompensate for skew and latency
�� Asynchronous or quasiAsynchronous or quasi--synchronous architecturessynchronous architectures
26
51
ASP DAC 2000
Power FuturesPower Futures
�� Low power modesLow power modes
�� Power tradeoffs in the microPower tradeoffs in the micro--architecturearchitecture
�� More emphasis on a low power circuitsMore emphasis on a low power circuitszz Reducing clock loadReducing clock loadzz Low swing differential clocksLow swing differential clockszz Low swing busesLow swing buses
zz Adiabatic circuits, clocked drivers, retractile logicAdiabatic circuits, clocked drivers, retractile logiczz Asynchronous designAsynchronous design
�� Reference plans also help to minimize inductive Reference plans also help to minimize inductive and wave effectsand wave effects
52
ASP DAC 2000
ConclusionConclusion
�� Physical technology advances will enable multiPhysical technology advances will enable multi--GHz chipsGHz chips
�� Key challenges in power, clocking, complexity, Key challenges in power, clocking, complexity, and verification must be addressedand verification must be addressed
�� New tools and methods will be neededNew tools and methods will be needed
�� CAD developers and chip designers must CAD developers and chip designers must collaborate more closely than ever collaborate more closely than ever
�� With a solid understanding of the fundamentals, a With a solid understanding of the fundamentals, a clear vision of the product and ingenuity, we will clear vision of the product and ingenuity, we will realize multirealize multi--GHz microprocessorsGHz microprocessors