Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Stefan Rusu – July 2012 ©2012 Intel Corporation
Microprocessor Design in the Nanoscale Era
Stefan Rusu Senior Principal Engineer
Intel Corporation
IEEE Fellow
1
Stefan Rusu – July 2012
Agenda
•Microprocessor Design Trends
•Process Technology Directions
•Active Power Management
• Leakage Reduction Techniques
•Packaging and Thermal Modeling
•Future Directions and Summary
2
Stefan Rusu – July 2012
Microprocessor Evolution
4004 Processor Westmere-EX Processor
Year 1971 2011
Transistors 2300 2.6 B
Process 10 µm 32 nm
Die area 12 mm2 513 mm2
Die photos not at scale
3
Stefan Rusu – July 2012
Scaling Trends
Transistor dimensions scale to improve performance, reduce power and reduce cost per transistor
M. Bohr
0.01
0.1
1
10
1970 1980 1990 2000 2010 2020
Mic
ron
s
45nm 65nm
32nm 0.7x every
2 years
Feature Size
22nm
4
Stefan Rusu – July 2012
Client Processor Trend: Integrated Graphics
• Ivy Bridge 22nm client processor with monolithic integrated graphics
• Up to 4 dual-threaded cores and 8MB L3 cache
• Dual channel DDR3 memory controller at 1600MT/s
• Integrated PCIe interface (16 Gen3 + 4 Gen2 + 4 DMI lanes)
- First Client CPU to support PCIe Gen3
• Three independent displays
• 1.4B transistors in 160mm2 die
S. Damaraju, ISSCC 2012
5
Stefan Rusu – July 2012
Client Processor Trend: Integrated WiFi RF
6
• Sensitive RF circuits integrated with 32nm ATOM and PCH
• Integration of traditional III-V RF components
-21dBm Power amp, and 34dBm T/R switch, 3.5dB NF LNA
H. Lakdawala, ISSCC 2012
Chip
TLine to ANT (50Ω)
Package50Ω Diff.
PA
Balun Filter
TX SWRX SW /
Matching Network
GmTX RF
RX RF
Stefan Rusu – July 2012
Server Processor Trends: More Cores
Server core count increases every generation,
while keeping within flat power budget
0
4
8
12N
um
ber
of
core
s
Tulsa
Tigerton
Dunnington
Nehalem-EX
Xeon® EX Processors
65nm 45nm 32nm
Westmere-EX
7
Stefan Rusu – July 2012
Server Processor Trends: More Cache
8
1
4
8
16
24
30
0
4
8
12
16
20
24
28
32
180nm 130nm 90nm 65nm 45nm 32nm
On-Die L3 Cache
[MB]
Cache size increases with every process generation
Xeon® EX Processors
Stefan Rusu – July 2012
Server Processors Power Trends
0.001
0.01
0.1
1
10
100
1000
1990 1995 2000 2005 2010
Pow
er
[W]
Year
Leakage ►
Total
Power
▼
Active
Power
9
Stefan Rusu – July 2012
Voltage Scaling Has Slowed Down
0.1
1
10
'91 '93 '95 '97 '99 '01 '03 '05 '07 '09 '11 '13
Su
pp
ly V
olta
ge
[V
]
~0.7X Scaling
~0.95X Scaling
10
Stefan Rusu – July 2012
Agenda
•Microprocessor Design Trends
•Process Technology Directions
•Active Power Management
•Leakage Reduction Techniques
•Packaging and Thermal Modeling
•Future Directions and Summary
11
Stefan Rusu – July 2012
30 Years of MOSFET Scaling
Gate Length: 1.0 mm 35 nm
Gate Oxide Thickness: 35 nm 1.2 nm
Operating Voltage: 4.0 V 1.2 V
1 mm
Dennard 1974 Intel 2005
Classical scaling ended in the early 2000s
due to gate oxide leakage limits M. Bohr, ISSCC 2009
12
Stefan Rusu – July 2012
90 nm Strained Silicon Transistors
High Stress Film
NMOS
SiGe SiGe
PMOS
SiN cap layer SiGe source-drain
Tensile channel strain Compressive channel strain
Strained silicon provided increased drive currents,
making up for lack of gate oxide scaling M. Bohr, ISSCC 2009
13
Stefan Rusu – July 2012
45 nm High-k + Metal Gate Transistors
65 nm Transistor 45 nm HK+MG
High-k + metal gate transistors
break through gate oxide scaling barrier
SiO2 dielectric Hafnium-based dielectric
Polysilicon gate electrode Metal gate electrode
M. Bohr, ISSCC 2009
14
Stefan Rusu – July 2012
HK/MG Gate Leakage Reduction
HK+MG significantly reduces gate leakage
25x 1000x
K. Mistry, IEDM 2007
15
Stefan Rusu – July 2012
6T SRAM Bit Cell Leakage Reduction
SRAM bit cell leakage reduced ~10x M. Bohr, ISSCC 2009
0
2
4
6
8
10
12
65nm 45nm
Norm
aliz
ed C
ell
Leaka
ge
I GATE
I OFF
I JUNCT
1.0V, 25C
10x
16
Stefan Rusu – July 2012
Traditional 2-D planar transistors form a conducting channel in the
silicon region under the gate electrode when in the “on” state
Silicon
Substrate
Oxide
Gate
Source
Drain
High-k
Dielectric
Traditional Planar Transistor
M. Bohr, 2011
17
Stefan Rusu – July 2012
Silicon
Substrate
Oxide
Source
Drain
Gate
3-D Tri-Gate transistors form conducting channels on three sides
of a vertical fin structure, providing “fully depleted” operation
22 nm Tri-Gate Transistor
18
Stefan Rusu – July 2012
32 nm Planar Transistors 22 nm Tri-Gate Transistors
Gates Fins
M. Bohr, 2011
Transistor Scaling Trends
19
Stefan Rusu – July 2012
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
0.5 0.6 0.7 0.8 0.9 1.0 1.1
32nm
Planar
Operating Voltage (V)
Transistor
Gate Delay (normalized)
45nm
Planar
22%
Faster
22%
Faster
Transistor Gate Delay
20
32nm planar transistors 22% faster than 45nm planar
D. Perlmutter, ISSCC 2012
Stefan Rusu – July 2012
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
0.5 0.6 0.7 0.8 0.9 1.0 1.1
32nm
Planar
14%
Faster
Operating Voltage (V)
Transistor
Gate Delay (normalized)
22nm
Planar
45nm
Planar
22%
Faster
14%
Faster
22%
Faster
Transistor Gate Delay
21
22nm planar transistors would have been only 14% faster
Stefan Rusu – July 2012
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
0.5 0.6 0.7 0.8 0.9 1.0 1.1
22nm
Tri-Gate
32nm
Planar
18%
Faster
37%
Faster
Operating Voltage (V)
Transistor
Gate Delay (normalized)
Transistor Gate Delay
22
22nm Tri-Gate transistors provide improved performance at high voltage and unprecedented 37% speedup at low voltage
Stefan Rusu – July 2012
Intel Transistor Leadership
2003 2005 2007 2009 2011
90 nm 65 nm 45 nm 32 nm 22 nm
SiGeSiGe
Invented
SiGe
Strained Silicon
2nd Gen.
SiGe
Strained Silicon
2nd Gen.
Gate-Last
High-k
Metal Gate
Invented
Gate-Last
High-k
Metal Gate
First to
Implement
Tri-Gate
23
Strained Silicon
High-k Metal Gate
Tri-Gate
Stefan Rusu – July 2012
Lithography Challenges
Initial Production
1000
100
10 ’89 ’91 ’93 ’95 ’97 ’99 ’01 ’03 ’05 ’07 ’09 ’11 ’13 ‘15
Feature size
13nm (EUVL)
Lithography Wavelength
193nm 248nm
Gap
nm
193 nm enhancements enable the 22 nm generation
24
Stefan Rusu – July 2012
Extreme Ultraviolet Lithography
•EUV lithography uses extremely short wavelength light
-Visible light – 400 to 700 nm
-DUV lithography – 193 and 248 nm
-EUV lithography – 13 nm
World’s First EUV Mask EUV Micro Exposure Tool
25
Stefan Rusu – July 2012
Layout Restrictions
26
65 nm Layout Style 32 nm Layout Style
Bi-directional features
Varied gate dimensions
Varied pitches
Uni-directional features
Uniform gate dimension
Gridded layout
M. Bohr, ISSCC 2009
Stefan Rusu – July 2012
450mm in the Era of Complex Scaling: Must coordinate demand drivers,
technical requirements and resources
End-User
Demand Drivers
Integrated
IC Maker
Coordination
Equipment
& Materials
Development
University and Government
Support
Projected 2000 Wafer, circa 1975 (Gordon Moore, ISSCC ’03)
Stefan Rusu – July 2012
Within-Die Variations
Systematic
Die-to-Die Variations
Random Placement
of Dopant Atoms
Lens Aberrations Resist Thickness
Random
Process Variations
28
Stefan Rusu – July 2012
Voltage and Temperature Variations
• Voltage
-Chip activity change
-Current delivery—RLC
-Dynamic: ns to 10-100µs
-Within-die variation
• Temperature
-Activity & ambient change
-Dynamic: 100-1000µs
-Within-die variation
Temp(oC)
29
Stefan Rusu – July 2012
Impact on Design Methodology
Delay
Path Delay
Pro
babili
ty
Due to variations in:
Vdd, Vt, and Temp
Delay Target
# o
f P
ath
s
Delay Target
# o
f P
ath
s
Deterministic Probabilistic
Leakage Power
Fre
qu
en
cy Deterministic
Probabilistic
10X variation
~50% total power
Major paradigm shift from deterministic design to probabilistic / statistical design
30
Stefan Rusu – July 2012
SRAM Cell Size Scaling
32 nm, 0.171 um2
45 nm, 0.346 um2
Memory density continues to double every 2 years
0.01
0.1
1
10
180 130 90 65 45 32 22
Cell
Are
a (
um
2)
Process technology [nm]
0.5x every 2 years
22 nm, 0.092 um2
31
Stefan Rusu – July 2012
Interconnect Trends
0
2
4
6
8
10
500 350 250 180 130 90 65 45 32 22
Nu
mb
er
of M
eta
l L
aye
rs
Technology Generation (nm)
Al Cu
32
Stefan Rusu – July 2012
22nm Interconnects
• M1 to M8 cross-section
• M1-M6 use ultra-low-k ILD and self-aligned vias providing 13-18% capacitance reduction
• Cross-section of integrated MIM capacitor
• Enables capacitance density of >20fF/mm2
33
C. Auth, VLSI Symposium 2012
Stefan Rusu – July 2012
On-chip Interconnect Trend
• Local interconnects scale with gate delay
• Global interconnects do not keep up with scaling
0.1
1
10
100
250 180 130 90 65 45 32
Relative delay
Feature size (nm)
Source: ITRS, 2001 Gate delay (FO4)
Local interconnect (M1,2)
Global interconnect
with repeaters
Global interconnect
without repeaters
34
Stefan Rusu – July 2012
Agenda
•Microprocessor Design Trends
•Process Technology Directions
•Active Power Management
•Leakage Reduction Techniques
•Packaging and Thermal Modeling
•Future Directions and Summary
35
Stefan Rusu – July 2012
Voltage and Frequency Scaling
1 - Fixed VDD, Frequency Scaling: Linear Power Reduction
2 - Fixed Frequency, VDD Scaling: Square Power Reduction
3 - Voltage and Frequency scaling: Cubic Power Reduction
Frequency
VDD VT VLOW
Data
Retention
Limit
VHI
sub-threshold
logic
VDD
+/-10%
Max Target
Frequency
Required
Frequency
Performance
Limit
1
2
3
J. Rosal,
ISSCC 2006
Reliability Limit
36
Stefan Rusu – July 2012
Memory and RF Vmin Reduction
37
• Write Assist circuit temporarily drops the array supply node to make it easier to write into the bit-cell
• Both Cache and Register Files use this technique to improve write Vmin in 22nm Ivy Bridge processor
WL (0 --> 1)
Write
Data ‘0’
Data#
0 --> 1
Data
1 --> 0
Write
Data# ‘1’
Shared
across
several cells
BL BL#
CVCC
22nm transistor and circuit improvements enable Vmin reduction of >100mV for Cache and 60mV for RF
S. Damaraju, ISSCC 2012
Stefan Rusu – July 2012
NTV Pentium® Processor
NTV
EN
ERG
Y
EFFIC
EN
CY
HIGH
LOW
VOLTAGE ZERO MAX
~5x Demonstrated
IA-32 Core
Logic
Scan
RO
M
L1$-I L1$-D
Level Shifters + clk spine
1.1 mm
1.8
mm
S. Jain, ISSCC 2012
Normal operating range Subthreshold
Ultra-low Power
Energy Efficient
High Performance
280 mV 0.45 V 1.2 V
3 MHz 60 MHz 915 MHz
2 mW 10 mW 737 mW
1500 Mips/W 5830 Mips/W 1240 Mips/W
38
Stefan Rusu – July 2012
D Q 0
1 S
Clk
En
D
REG
D Q
REG
Clk En
D
Clock Gating
• Save power by gating the clock when data activity is low
• Requires detailed logic validation
39
Stefan Rusu – July 2012
Core Power Management
Modulating the processor core voltage and frequency enables lower power states
Gerosa, A-SSCC 2008
40
Core voltage
Core clock
PLL
L1 caches
L2 caches
Wakeup time
Power
C0 HFM C0 LFM C1/C2 C4 C6
OFF OFF OFF
OFF OFF
partial flush off
active active
off flushed flushed
<1us <30us <100us
Stefan Rusu – July 2012
Multiple Voltage Domains
Multiple voltage domains minimize power consumption across the core and uncore areas
SMI SMI
QPI QPI QPI QPI
Fuse
Core Supply
Core Supply
Core Supply
Core Supply
Uncore Supply
Core Domain 0.85-1.1V variable
Un-Core Domain 0.9-1.1V
fixed
I/O Domain 1.1V fixed
Rusu, ISSCC 2009
41
Stefan Rusu – July 2012
Multiple Clock Domains
SMI SMI
QPI QPI QPI QPI
BCLK
IO PLLs
Un - - core PLL
Filter PLL
- -
IO DLLs
Core PLLs
Three primary clock domains: core, un-core, I/O
Total of 16 PLLs and 8 DLLs Rusu, ISSCC 2009
42
Stefan Rusu – July 2012
Agenda
•Microprocessor Design Trends
•Process Technology Directions
•Active Power Management
•Leakage Reduction Techniques
•Packaging and Thermal Modeling
•Future Directions and Summary
43
Stefan Rusu – July 2012
Subthreshold Leakage Trend
1.E-14
1.E-12
1.E-10
1.E-08
1.E-06
1.E-04
10 100 1000
Physical Gate Length (nm)
I Off (
A/u
m)
Intel 15nm
transistor
Intel 30nm
transistor
Intel 20nm transistor
Research data
in literature ( )
Production data
in literature ( )
44
Stefan Rusu – July 2012
Leakage Reduction Techniques
Stack Effect Body Bias Sleep Transistor
Vdd Vbp
Vbn -Ve
+Ve
Equal Loading Logic Block
2-3X reduction 2-1000X reduction
45
Stefan Rusu – July 2012
Leakage is a Strong Function of Voltage
46
Sub-threshold and gate leakage reduce with lower supply voltage
0102030405060708090
100
0 0.3 0.6 0.9 1.2 1.5
No
rma
lize
d L
ea
ka
ge
Voltage (V)
130nm process
Subthreshold
Leakage
Gate Leakage
Stefan Rusu – July 2012
Cache Sleep and Shut-off Modes
47
Active Mode Sleep Mode Shut-off Mode
0V
1.1V
Vo
lta
ge
0V
250mV
520mV Virtual
VSS
2x lower leakage
2x lower leakage
X X
Sub-array
X
Sub-array
Virtual VSS
Block
Select
Sub-array
Sleep Bias
Shut off
S. Rusu, US Pat. 7,657,767
Stefan Rusu – July 2012
Leakage Shut-off Infrared Images
48
Leakage reduction ► 3W (8MB) 5W (4MB)
16MB part 8MB part 4MB part
16MB in
sleep mode
8MB
sleep
8MB
shut-off
4MB
sleep
12MB
shut-off
Stefan Rusu – July 2012
Cache Dynamic Shut-off
49
Normal Operation
In the full-load state, all 16 ways are enabled (green)
Cache-by-Demand Operation
Under idle or low-load states, cache ways are dynamically flushed out and put in shut-off mode (red)
Tag ►
Data ►
Data ►
Way ► 15 14 • • 3 2 1 0
Controller
15 14 • • 3 2 1 0
Controller
Stefan Rusu – July 2012 50
Three PMOS sleep transistor groups for sub-array leakage reduction
Y. Wang, ISSCC 2009
Cache Leakage Management
Stefan Rusu – July 2012
Cache Leakage Reduction Benefit
51
Leakage management circuit reduces sub-array leakage by 58%
Stefan Rusu – July 2012
Leakage Mitigation: Long-Le Transistors
52
• All transistors can be
either nominal or long-Le
• Most library cells are
available in both flavors
• Long-Le transistors are
~10% slower, but have
3x lower leakage
• All paths with timing slack
use long-Le transistors
• Initial design uses only
long channel devices
Nominal Le
Long Le
(Nom+10%) S. Rusu, ISSCC 2006
Stefan Rusu – July 2012
Long-Le Transistors Usage Map
53
90-100
80-90
70-80
60-70
50-60
Massive long-channel usage in uncore reduces leakage
Nehalem-EX
long-channel
device usage
[percent]
SMI SMI
QPI QPI QPI QPI
Stefan Rusu – July 2012
Power & Leakage Breakdown
54
Power Breakdown Leakage Breakdown
Vcore
54.6%
Vuncore
33.4%
Vio
11.2% Vpll
0.8%
Active
84%
Leakage
16%
Clock gating
Run uncore at 0.9V
Long channel device usage:
58% cores, 85% uncore
Reduction
techniques
S. Rusu, ISSCC 2009
Nehalem-EX 45nm example
Stefan Rusu – July 2012
Core and Cache Recovery Example
Defective core and cache slices can be disabled in horizontal pairs
QPI0 QPI1
Disabled
Core5
Core6
Core7
SMI
Disabled
Core2
Core1
Disabled Disabled
System Interface
SMI
QPI2 QPI3
Core0
S. Rusu, ISSCC 2009
55
Stefan Rusu – July 2012
• Disabled cores ► Power gated
• Disabled cache slices ► All major arrays in shut-off
Minimize Leakage in Disabled Blocks
0V
0.9V
Vo
lta
ge
Active Sleep Shut-off
Virtual VCC
35% Leakage
Reduction
Active
SRAM array
Sleep/ Shut-off
Core
Active/
Shut-off
83%
0V
0.85V
Vo
lta
ge
Active Shut-off Virtual VCC
40x Leakage
Reduction
S. Rusu, ISSCC 2009
56
Stefan Rusu – July 2012
Core/Cache Recovery – Infrared Image
All cores and cache slices are enabled
S. Rusu, ISSCC 2009
57
Stefan Rusu – July 2012
Core/Cache Recovery – Infrared Image
Shut-off 2 cores (top row) and 2 cache slices (bottom row)
Disabled blocks are clock and power gated
S. Rusu, ISSCC 2009
58
Stefan Rusu – July 2012
Agenda
•Microprocessor Design Trends
•Process Technology Directions
•Active Power Management
•Leakage Reduction Techniques
•Packaging and Thermal Modeling
•Future Directions and Summary
59
Stefan Rusu – July 2012
Microprocessor Package Evolution
•1971 – 4004 Processor
- 16-pin ceramic package
- Wire bond attach
- 750 kHz I/O
•2012 – Xeon® E5 Processor
- 2011-contact organic package
- Flip-chip attach
- 8.0 GHz I/O
60
Stefan Rusu – July 2012
Power Density Models
With increasing power density and large on-die caches, detailed, non-uniform power models are required
0
50
100
150
200
250
He
at F
lux (
W/c
m2
)
40
50
60
70
80
90
100
110
Te
mp
era
ture
(C
)
Power Map On-Die Temperature
61
Stefan Rusu – July 2012
Thermal Modeling
Simulated power density Infrared emission microscope measurement
D. Genossar and N. Shamir “Intel® Pentium® M Processor Power Estimation,
Budgeting, Optimization and Validation”, Intel Technology Journal 5/2003
62
Stefan Rusu – July 2012
Thermal Sensors • Multiple temperature sensors
-One in each core hot spot
-One in the die center
• Temperature information is available through PECI bus for system fan management
QPI0 QPI1
SMI SMI
QPI2 QPI3
63
Stefan Rusu – July 2012
Power Management Unit
External Voltage Regulator Control
Power Management
Unit
Pow
er
Gate
Pow
er
Gate
Pow
er
Gate
Pow
er
Gate
Pow
er
Gate
Pow
er
Gate
Pow
er
Gate
Pow
er
Gate
Power Gates Control
Core 0
Core 1
Core 2
Core 3
Core 7
Core 6
Core 5
Core 4
Sensors Sensors Sensors Sensors
Sensors Sensors Sensors Sensors
Sensors
PMU controls processor voltage and frequency based on compute loading and thermal data
64
Stefan Rusu – July 2012
Agenda
•Microprocessor Design Trends
•Process Technology Directions
•Active Power Management
•Leakage Reduction Techniques
•Packaging and Thermal Modeling
•Future Directions and Summary
65
Stefan Rusu – July 2012
Future Directions
• 2D mesh network with multiple Voltage / Frequency islands
• Communication across islands achieved through FIFOs
Ogras (CMU), DAC 2007
66
Stefan Rusu – July 2012
Fine Grain Power Management
Cores with critical tasks Freq = f, at Vdd TPT = 1, Power = 1
Non-critical cores Freq = f/2, at 0.7*Vdd TPT = 0.5, Power = 0.25
Temporarily shut down TPT = 0, Power = 0
0
VDD
Vo
lta
ge
Hi-Act Lo-Act Shut-off
0.7*VDD
0V
Pwr=1 Pwr=¼ Pwr=0
f f
f
f
f f
f/2
f/2
f/2
f/2
f/2
0
0
0
0 0
f 0 f/2 0
0
f/2
f/2
0
0
f
f/2
0
0 Permanently disabled TPT = 0, Power = 0
25-core processor example:
67
Stefan Rusu – July 2012
Summary
• Moore’s Law has fueled the worldwide technology revolution for
over 40 years and will continue for at least another decade
-0.7x transistor dimension scaling every two years
-Tri-gate devices provide significant benefits
• Continued microprocessor performance improvement
depends on our ability to manage active power and leakage
-Clock and power gate un-used or disabled blocks
-Multiple voltage and clock domains
-Dynamic voltage and frequency adjustment
• Core and cache recovery enables multiple product options
-Disabled cores and cache slices are clock and power gated
68