Upload
virgil-nicholson
View
228
Download
3
Tags:
Embed Size (px)
Citation preview
ELEN 468 Lecture 29 1
ELEN 468Advanced Logic Design
Lecture 29Low Power Design
ELEN 468 Lecture 29 2
Power Dissipation
P6Pentium ® proc
486
3862868086
80858080
80084004
0.1
1
10
100
1971 1974 1978 1985 1992 2000Year
Po
wer
(W
atts
)
Power increases despite Vdd decreasePower increases despite Vdd decrease
Courtesy, Intel
ELEN 468 Lecture 29 3
Power Density
40048008
80808085
8086
286386
486Pentium® proc
P6
1
10
100
1000
10000
1970 1980 1990 2000 2010
Year
Po
wer
Den
sity
(W
/cm
2)
Hot Plate
NuclearReactor
RocketNozzle
Courtesy, Intel
ELEN 468 Lecture 29 4
Why Power Increased
Growing die size, fast frequency scaling
Clock Frequency (MHz)
10
100
1000
10000
85 87 89 91 93 95 97 99 01 03 05
ELEN 468 Lecture 29 5
Gate Power Dissipation
Leakage power Dynamic power Short circuit power
ELEN 468 Lecture 29 6
Dynamic Power
Occurs at each switching Pd = CL●Vdd
2●fp
fp switching frequency
out
Vdd
out
Vdd
Saturation
Linear
ELEN 468 Lecture 29 7
Leakage Power
Static Leakage current = a ● Vdd
Leakage current = b/Vt
Killer to CMOS technology
out
Vdd
out
Vdd
Saturation
Linear
Leakage
Leakage
ELEN 468 Lecture 29 8
Short Circuit Power
During switching, there is a short moment when both PMOS and CMOS are partially onPs = Q●(Vdd-Vt)3●tr●fp
tr rising time
out
Vdd
out
Vdd
Input rising
Input falling
ELEN 468 Lecture 29 9
Where Does Power Go?
Power percentages
Core transistor leakage
Gate leakageCache leakage
Active power
0%10%20%30%40%50%60%70%80%90%
100%
Scalable X86 CPU Design for 90nmLow VT devices are <1% of total non-memory transistor width[J. Schultz and C. Webb, ISSCC 2004]
Total chip power based on ITRS roadmapIn 2004, we are just breaking even[Kim, et al, Computer 2003]
Power percentages
Core transistor leakage
Gate leakageCache leakage
Active power
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
ELEN 468 Lecture 29 10
Energy – Performance Space
Every design is a point on a 2-D plane
Performance
En
erg
y
ELEN 468 Lecture 29 11
Low Power DesignReduce dynamic power : clock gating, sleep mode C: small transistors (esp. on clock), short wires VDD: lowest suitable voltage f: lowest suitable frequency
Reduce static power Selectively use low Vt devices Power gating, MTCMOS Stacked devices Body bias
ELEN 468 Lecture 29 12
Clock GatingGate off clock to idle functional units e.g., floating point units need logic to generate
disable signal increases complexity of control logic consumes power timing critical to avoid clock glitches
at OR gate output additional gate delay on clock signal
gating OR gate can replace a buffer in the clock distribution tree
Reg
clock
disable
Functionalunit
ELEN 468 Lecture 29 13
Active Power Reduction - Supply Voltage Reduction
Static Dynamic
Pros:• Always active in saving
Cons:• Additional power delivery network• Needs special care of interface between power domains• signals close to Vt – excessive leakage and reduced noise margins
Adjusting operation voltage and frequency to performance requirements:• High performance – high Vdd & frequency• Power saving – low Vdd & frequency
Pros:• Doesn’t limit performance
Cons:• Penalty of transition between different power states can be high (in performance and power)• Additional control logic
Slow SlowFastHigh
Supply Voltage
Low Supply Voltage
ELEN 468 Lecture 29 14
Voltage Islands (Multi-Vdd)
Allow both macro and cell voltage assignmentAllow different voltage islands in the same circuit rowLift unnatural layout restrictionsMinimal placement disturbance
Lackey+ICCAD’02
Usami+JSSC’98
Vddh
Vddl
GVIDAC’03
ELEN 468 Lecture 29 15
Level Converter
Interface circuit when Vddl drives Vddh to avoid leakage
VddH
VddL
weak on!
Vddh
Vddl
IN
OUT
Conventional dual supply level converter
Vddh
IN
OUT
New single supply level converter
ELEN 468 Lecture 29 16
Adjacency Metrics for Clustering
Logic adjacency metric (LAM): Vddl fanin cone of level shifter without going through Vddh
LC1
Vddh
Vddl
LC2
LC3
Vddh
Vddl
LC2
LC3
Physical adjacency metric (PAM): for each candidate Vddl cell, compute total size of its neighbor Vddl cells
LAM to guide logic aware voltage assignmentLAM to guide logic aware voltage assignment PAM to guide placement aware voltage re-assignmentPAM to guide placement aware voltage re-assignment
ELEN 468 Lecture 29 17
Level Converter Optimizations
Logic replacement (or gate sizing)
ZMUX1
LC
LC
LC
LC
DEC
ZMUX2
DEC
B A B ALC LC
LC/Buffer co-optimization
ELEN 468 Lecture 29 18
Placement to Form Voltage Islandswith Power Grid Co-design
Based on Vddl and Vddh cell
placement after voltage assignment, define Vddl/Vddh
power grids on demand
Detailed placement to form Vddl/Vddh voltage islands that
can hit their corresponding power supplies
Vddh
Power grids on demand
Vddl Vddh Vddl Vddh Vddl Vddh
Vddl
ELEN 468 Lecture 29 19
Example of Voltage Islands
Vddl = 1.2V
Vddh = 1.5V
No timing degradation, no area increase!No timing degradation, no area increase!
- IBM Cu11 - 0.13um- 400 MHz
(courtesy IBM)
ELEN 468 Lecture 29 20
Dynamic Frequency and Voltage Scaling
Always run at the lowest supply voltage that meets the timing constraints
DFS (dynamic frequency scaling) saves only power DVS (dynamic voltage scaling) + DFS saves both energy and
power
A DVS+DFS system requires the following A programmable clock generator (PLL)
PLL from 200MHz 700MHz in increments of 33MHz A supply regulation loop that sets the minimum VDD necessary
for operation at the desired frequency 32 levels of VDD from 1.1V to 1.6V
An operating system that sets the required frequency + supply voltage to meet the task completion deadlines
heavier load ramp up VDD, when stable speed up clock lighter load slow down clock, when PLL locks onto new rate,
ramp down VDD
ELEN 468 Lecture 29 22
Leakage Reduction Techniques
pullup (Vdd)
Vx
stack effect
Wu
Wl
High Vt devicesLow Vt devices
dual Vt
partitioning
Vnwell ≥ Vdd
Vpwell ≤ 0
variable threshold(VTCMOS)
low Vt
logic
sleep
sleep
Vdd
virtual Vdd
HVT
virtual Gnd
multi-threshold(MTCMOS)
HVT
Vdd
ELEN 468 Lecture 29 23
Natural Transistor Stacks
• Reduce the leakage by stacking the devices• Reduced Vds• Negative Vgs• Negative Vbs
How?
ELEN 468 Lecture 29 24
Design with Dual Vth
Dual Vth design Two flavors of transistors: slow – high Vth, fast – low
Vth
Low Vth are faster, but have ≈10X leakage
Dual Vth evaluation
ELEN 468 Lecture 29 25
Impacts of Variable VT Reducing the VT increases the sub-threshold leakage current (exponentially)
VT = VT0 + ( F + VSB - F )
where VT0 is the threshold voltage at VSB = 0, VSB is the source- bulk (substrate) voltage, is the body-effect coefficientBut, reducing VT decreases gate delay (increases performance)
ELEN 468 Lecture 29 26
Variable VT through Body Bias
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
-2.5 -2 -1.5 -1 -0.5 0
VSB (V)
VT (
V)
For NMOS, the substrate is normally tied to ground (VSB = 0)
A negative bias on VSB causes VT to increase
Adjusting the substrate bias at runtime is called adaptive body-biasing (ABB) or dynamic threshold scaling (DTS)
Requires a triple well fab process
VSB,p
VSB,n
ELEN 468 Lecture 29 27
Forward/Reverse Body Biasing
RBB (Reverse Body Bias): zero body bias in active mode, a deep reverse bias in standby mode.
FBB (Forward Body Bias): high Vth in standby mode, forward body biasing to achieve better current drive in active mode.
Disadvantages:• Increase PN junction reverse leakage• Scaling down technology worsen short channel effects and weaken the Vth modulation capability
Disadvantages:• Larger junction capacitance• High body effect for stack devices
ELEN 468 Lecture 29 28
Implementation of Dynamic Vth Scaling (DTS)
• The lowest Vth is delivered (NBB-no body bias) if the highest performance is required. • When the performance demand is low, clock frequency is lowered and Vth is raised via RBB to reduce the run time leakage power dissipation.
How?• When critical path replica frequency is less then reference CLK, adjust bias to decrease Vth.• Otherwise adjust bias to increase Vth.
Results:
ELEN 468 Lecture 29 29
Power Gating Using Sleep Transistors
Or can reduce leakage by gating the supply rails when the circuit is in sleep mode
in normal mode, sleep = 0 and the sleep transistors must present as small a resistance as possible (via sizing)
in sleep mode, sleep = 1, the transistor stack effect reduces leakage by orders of magnitude
Or can eliminate leakage by switching off the power supply (but lose the memory state)
ELEN 468 Lecture 29 30
Example of Power Gating
EmbeddedPower
Switches
Rows ofStandard
Cells
Power SwitchControl Signals
Can reduce power 1000XSmaller voltage swing (IR drop on sleep transistors)
Lower performance Increased noise
coupling Local power grid
design
ELEN 468 Lecture 29 31
Power Dissipation on Variation Tolerance
Conventional variation tolerance Using large timing safety margin Implies aggressive timing target Greater power dissipation
Observation Near-worst-case variations occur rarely Safety margin is applied continuously to
guard the small chance of variations Poor power efficiency
ELEN 468 Lecture 29 32
Question..
Can we deal with errors instead preventing them from occurring by conservative binning/clocking?
How fast can we speed up the circuit with error rate in manageable range?
ELEN 468 Lecture 29 33
Fault tolerant system
Begin with reference values
Introduce redundancy Hardware: Triple Modular Redundancy Time: Repeated process Information: Code Software: various algorithm
How about for delay fault? how do we detect (may be correct?) errors?
ELEN 468 Lecture 29 34
Delay fault tolerant system
Delay fault detection Redundant timing margin in signal path +: Second sampling at increase clock period - : Decrease delay of reference signal between
pipeline registers
t1 t2
Timing margin
2nd sampling
t
ELEN 468 Lecture 29 35
Delay fault tolerant system
Delay fault removal Reference signal (SR) Reprocessing at slower clock period (t’)
t1 t2
Timing margin
t
SR
t’
ELEN 468 Lecture 29 36
Delay fault tolerant system: Example
RAZOR* Dynamic Voltage Scaling Design Reduce power voltage down to
manageable failure rate
t1 t2
Timing margin
* Razor: a low-power pipeline based on circuit-level timing speculation, D. Ernst et al, 36th Annual IEEE/ACM International Symposium on Microarchitecture 2003
ELEN 468 Lecture 29 37
RAZOR continued Implemented to 120MHz clock frequency But for high speed circuits…
Managing two clocks Minimum path delay constraint Delay of MUX
Delay fault tolerant system: Example
ELEN 468 Lecture 29 38
Delay fault tolerant system: Example
Parity coding Parity generation based on output correlation Avoid well-correlated outputs for pairing
Timing margin
t
ELEN 468 Lecture 29 39
Now.. Let’s look at delay distribution(s)
ELEN 468 Lecture 29 40
Clock speed achieved for contained error rate
ELEN 468 Lecture 29 41
Delay fault tolerant system: Example
Parity coding (continued) Complexity Example: C449 ISCAS Benchmark
ELEN 468 Lecture 29 42
Recently Proposed Design
Fault detection Partial hardware and time redundancy
Timing margin
t
Ln Ln+1
g0 gm
L'n+1
FL BL
gm
BL'
gi
ELEN 468 Lecture 29 43
Proposed Design
Fault removal Pipeline flush & reprocessing at lower
clock
Ln Ln+1
g0 gm
L'n+1
FL BL
gm
BL'
gi
ELEN 468 Lecture 29 44
Proposed Design
Division of FL an BL
PI PO
Latch
FL BL
CP
Error?BL
ELEN 468 Lecture 29 45
Proposed Design
Division of FL an BL Considerations
The effects on the original circuit should be minimal.
Maximize delay fault detection coverage Minimize added complexity
ELEN 468 Lecture 29 46
Proposed Design
Division of FL an BL First, POs to BL
Gate with longest delay to gate with shortest delay
For the gates connected to BL, Choose the gate with maximum delay
Then, any gate whose number of fanout> number of fanin
ELEN 468 Lecture 29 47
Proposed Design
Delay fault detection coverage dFL: delay from PI to any gate in FL
di: delay from PI to any gate in original circuit
max{ }1
max{ }
FLF
i
dC
d
Add graphical view
ELEN 468 Lecture 29 48
Proposed Design
Delay simulation SPICE simulation
TSMC 0.18um tech. Vcc=1.6V Gate delay for rising and falling signal Load: inverter Different input combinations are considered
Delay simulation Randomly generated test vectors 106~108 according to number of primary inputs (PI)
ELEN 468 Lecture 29 49
Proposed Design
Area complexity Ngate: Number of gates in the original circuit
Nff : Number of ffs in each pipeline, (NPI+NPO)/2
Ngate_BL: Number of gates in BL
Ngate_CP: Number of gates in comparison block
NLatch: Number of latches=Number of connections between FL and BL
w: Complexity ratio of flipflop to gate_ _gate BL gate CP LatchA
gate ff
N N NC
N w N
ELEN 468 Lecture 29 50
Fault Coverage vs. ComplexityFault Detection Coverage vs. Added Complexity : C499
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Fault detection Coverage CF
Add
ed C
ompl
exity
CA
Fault Detection Coverage vs. Added Complexity : C432
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6
Fault detection Coverage CF
Add
ed C
ompl
exity
CA
Fault Detection Coverage vs. Added Complexity : C880
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6
Fault detection Coverage CF
Add
ed C
ompl
exity
CA
Fault Detection Coverage vs. Added Complexity : C6288
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6
Fault detection Coverage CF
Add
ed C
ompl
exity
CA
ELEN 468 Lecture 29 51
Complexity
Effective complexity penalty Depends on application
More than half of area is cacheSpeed critical part: integer unit
0.5
AE A A
Appicable areaC C C
Total chip area
ELEN 468 Lecture 29 52
Estimation of Complexity
& AGUDataCache
AlignMux
RegistersALUs
Intel® Pentium® 4 Processor on 90 nm
Process
ELEN 468 Lecture 29 53
Conclusion
Delay fault tolerant design is proposed Possible operation clock frequency gain is
estimated from modeling and experiments Delay fault detection coverage and complexity
are analyzed for optimal implementation It shows that 10% clock frequency gain is
possible with proposed design at a moderate (8-25%) complexity increase