ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design

ELEN 468 Lecture 29 1

ELEN 468Advanced Logic Design

Lecture 29Low Power Design


Power Dissipation

P6Pentium ® proc

486

3862868086

80858080

80084004

0.1

1

10

100

1971 1974 1978 1985 1992 2000Year

Po

wer

(W

atts

)

Power increases despite Vdd decreasePower increases despite Vdd decrease

Courtesy, Intel


Power Density

40048008

80808085

8086

286386

486Pentium® proc

P6

1

10

100

1000

10000

1970 1980 1990 2000 2010

Year

Po

wer

Den

sity

(W

/cm

2)

Hot Plate

NuclearReactor

RocketNozzle

Courtesy, Intel


Why Power Increased

Growing die size, fast frequency scaling

Clock Frequency (MHz)

10

100

1000

10000

85 87 89 91 93 95 97 99 01 03 05


Gate Power Dissipation

Leakage power Dynamic power Short circuit power


Dynamic Power

Occurs at each switching Pd = CL●Vdd

2●fp

fp switching frequency

out

Vdd

out

Vdd

Saturation

Linear


Leakage Power

Static Leakage current = a ● Vdd

Leakage current = b/Vt

Killer to CMOS technology

out

Vdd

out

Vdd

Saturation

Linear

Leakage

Leakage


Short Circuit Power

During switching, there is a short moment when both PMOS and CMOS are partially onPs = Q●(Vdd-Vt)3●tr●fp

tr rising time

out

Vdd

out

Vdd

Input rising

Input falling


Where Does Power Go?

Power percentages

Core transistor leakage

Gate leakageCache leakage

Active power

0%10%20%30%40%50%60%70%80%90%

100%

Scalable X86 CPU Design for 90nmLow VT devices are <1% of total non-memory transistor width[J. Schultz and C. Webb, ISSCC 2004]

Total chip power based on ITRS roadmapIn 2004, we are just breaking even[Kim, et al, Computer 2003]

Power percentages

Core transistor leakage

Gate leakageCache leakage

Active power

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Energy – Performance Space

Every design is a point on a 2-D plane

Performance

En

erg

y


Low Power DesignReduce dynamic power : clock gating, sleep mode C: small transistors (esp. on clock), short wires VDD: lowest suitable voltage f: lowest suitable frequency

Reduce static power Selectively use low Vt devices Power gating, MTCMOS Stacked devices Body bias


Clock GatingGate off clock to idle functional units e.g., floating point units need logic to generate

disable signal increases complexity of control logic consumes power timing critical to avoid clock glitches

at OR gate output additional gate delay on clock signal

gating OR gate can replace a buffer in the clock distribution tree

Reg

clock

disable

Functionalunit


Active Power Reduction - Supply Voltage Reduction

Static Dynamic

Pros:• Always active in saving

Cons:• Additional power delivery network• Needs special care of interface between power domains• signals close to Vt – excessive leakage and reduced noise margins

Adjusting operation voltage and frequency to performance requirements:• High performance – high Vdd & frequency• Power saving – low Vdd & frequency

Pros:• Doesn’t limit performance

Cons:• Penalty of transition between different power states can be high (in performance and power)• Additional control logic

Slow SlowFastHigh

Supply Voltage

Low Supply Voltage


Voltage Islands (Multi-Vdd)

Allow both macro and cell voltage assignmentAllow different voltage islands in the same circuit rowLift unnatural layout restrictionsMinimal placement disturbance

Lackey+ICCAD’02

Usami+JSSC’98

Vddh

Vddl

GVIDAC’03


Level Converter

Interface circuit when Vddl drives Vddh to avoid leakage

VddH

VddL

weak on!

Vddh

Vddl

IN

OUT

Conventional dual supply level converter

Vddh

IN

OUT

New single supply level converter


Adjacency Metrics for Clustering

Logic adjacency metric (LAM): Vddl fanin cone of level shifter without going through Vddh

LC1

Vddh

Vddl

LC2

LC3

Vddh

Vddl

LC2

LC3

Physical adjacency metric (PAM): for each candidate Vddl cell, compute total size of its neighbor Vddl cells

LAM to guide logic aware voltage assignmentLAM to guide logic aware voltage assignment PAM to guide placement aware voltage re-assignmentPAM to guide placement aware voltage re-assignment


Level Converter Optimizations

Logic replacement (or gate sizing)

ZMUX1

LC

LC

LC

LC

DEC

ZMUX2

DEC

B A B ALC LC

LC/Buffer co-optimization


Placement to Form Voltage Islandswith Power Grid Co-design

Based on Vddl and Vddh cell

placement after voltage assignment, define Vddl/Vddh

power grids on demand

Detailed placement to form Vddl/Vddh voltage islands that

can hit their corresponding power supplies

Vddh

Power grids on demand

Vddl Vddh Vddl Vddh Vddl Vddh

Vddl


Example of Voltage Islands

Vddl = 1.2V

Vddh = 1.5V

No timing degradation, no area increase!No timing degradation, no area increase!

- IBM Cu11 - 0.13um- 400 MHz

(courtesy IBM)


Dynamic Frequency and Voltage Scaling

Always run at the lowest supply voltage that meets the timing constraints

DFS (dynamic frequency scaling) saves only power DVS (dynamic voltage scaling) + DFS saves both energy and

power

A DVS+DFS system requires the following A programmable clock generator (PLL)

PLL from 200MHz 700MHz in increments of 33MHz A supply regulation loop that sets the minimum VDD necessary

for operation at the desired frequency 32 levels of VDD from 1.1V to 1.6V

An operating system that sets the required frequency + supply voltage to meet the task completion deadlines

heavier load ramp up VDD, when stable speed up clock lighter load slow down clock, when PLL locks onto new rate,

ramp down VDD


Leakage Reduction Techniques

pullup (Vdd)

Vx

stack effect

Wu

Wl

High Vt devicesLow Vt devices

dual Vt

partitioning

Vnwell ≥ Vdd

Vpwell ≤ 0

variable threshold(VTCMOS)

low Vt

logic

sleep

sleep

Vdd

virtual Vdd

HVT

virtual Gnd

multi-threshold(MTCMOS)

HVT

Vdd


Natural Transistor Stacks

• Reduce the leakage by stacking the devices• Reduced Vds• Negative Vgs• Negative Vbs

How?


Design with Dual Vth

Dual Vth design Two flavors of transistors: slow – high Vth, fast – low

Vth

Low Vth are faster, but have ≈10X leakage

Dual Vth evaluation


Impacts of Variable VT Reducing the VT increases the sub-threshold leakage current (exponentially)

VT = VT0 + ( F + VSB - F )

where VT0 is the threshold voltage at VSB = 0, VSB is the source- bulk (substrate) voltage, is the body-effect coefficientBut, reducing VT decreases gate delay (increases performance)


Variable VT through Body Bias

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

-2.5 -2 -1.5 -1 -0.5 0

VSB (V)

VT (

V)

For NMOS, the substrate is normally tied to ground (VSB = 0)

A negative bias on VSB causes VT to increase

Adjusting the substrate bias at runtime is called adaptive body-biasing (ABB) or dynamic threshold scaling (DTS)

Requires a triple well fab process

VSB,p

VSB,n


Forward/Reverse Body Biasing

RBB (Reverse Body Bias): zero body bias in active mode, a deep reverse bias in standby mode.

FBB (Forward Body Bias): high Vth in standby mode, forward body biasing to achieve better current drive in active mode.

Disadvantages:• Increase PN junction reverse leakage• Scaling down technology worsen short channel effects and weaken the Vth modulation capability

Disadvantages:• Larger junction capacitance• High body effect for stack devices


Implementation of Dynamic Vth Scaling (DTS)

• The lowest Vth is delivered (NBB-no body bias) if the highest performance is required. • When the performance demand is low, clock frequency is lowered and Vth is raised via RBB to reduce the run time leakage power dissipation.

How?• When critical path replica frequency is less then reference CLK, adjust bias to decrease Vth.• Otherwise adjust bias to increase Vth.

Results:


Power Gating Using Sleep Transistors

Or can reduce leakage by gating the supply rails when the circuit is in sleep mode

in normal mode, sleep = 0 and the sleep transistors must present as small a resistance as possible (via sizing)

in sleep mode, sleep = 1, the transistor stack effect reduces leakage by orders of magnitude

Or can eliminate leakage by switching off the power supply (but lose the memory state)


Example of Power Gating

EmbeddedPower

Switches

Rows ofStandard

Cells

Power SwitchControl Signals

Can reduce power 1000XSmaller voltage swing (IR drop on sleep transistors)

Lower performance Increased noise

coupling Local power grid

design


Power Dissipation on Variation Tolerance

Conventional variation tolerance Using large timing safety margin Implies aggressive timing target Greater power dissipation

Observation Near-worst-case variations occur rarely Safety margin is applied continuously to

guard the small chance of variations Poor power efficiency


Question..

Can we deal with errors instead preventing them from occurring by conservative binning/clocking?

How fast can we speed up the circuit with error rate in manageable range?


Fault tolerant system

Begin with reference values

Introduce redundancy Hardware: Triple Modular Redundancy Time: Repeated process Information: Code Software: various algorithm

How about for delay fault? how do we detect (may be correct?) errors?


Delay fault tolerant system

Delay fault detection Redundant timing margin in signal path +: Second sampling at increase clock period - : Decrease delay of reference signal between

pipeline registers

t1 t2

Timing margin

2nd sampling

t


Delay fault tolerant system

Delay fault removal Reference signal (SR) Reprocessing at slower clock period (t’)

t1 t2

Timing margin

t

SR

t’


Delay fault tolerant system: Example

RAZOR* Dynamic Voltage Scaling Design Reduce power voltage down to

manageable failure rate

t1 t2

Timing margin

* Razor: a low-power pipeline based on circuit-level timing speculation, D. Ernst et al, 36th Annual IEEE/ACM International Symposium on Microarchitecture 2003


RAZOR continued Implemented to 120MHz clock frequency But for high speed circuits…

Managing two clocks Minimum path delay constraint Delay of MUX




Parity coding Parity generation based on output correlation Avoid well-correlated outputs for pairing

Timing margin

t


Now.. Let’s look at delay distribution(s)


Clock speed achieved for contained error rate



Parity coding (continued) Complexity Example: C449 ISCAS Benchmark


Recently Proposed Design

Fault detection Partial hardware and time redundancy

Timing margin

t

Ln Ln+1

g0 gm

L'n+1

FL BL

gm

BL'

gi


Proposed Design

Fault removal Pipeline flush & reprocessing at lower

clock

Ln Ln+1

g0 gm

L'n+1

FL BL

gm

BL'

gi


Proposed Design

Division of FL an BL

PI PO

Latch

FL BL

CP

Error?BL


Proposed Design

Division of FL an BL Considerations

The effects on the original circuit should be minimal.

Maximize delay fault detection coverage Minimize added complexity


Proposed Design

Division of FL an BL First, POs to BL

Gate with longest delay to gate with shortest delay

For the gates connected to BL, Choose the gate with maximum delay

Then, any gate whose number of fanout> number of fanin


Proposed Design

Delay fault detection coverage dFL: delay from PI to any gate in FL

di: delay from PI to any gate in original circuit

max{ }1

max{ }

FLF

i

dC

d

Add graphical view


Proposed Design

Delay simulation SPICE simulation

TSMC 0.18um tech. Vcc=1.6V Gate delay for rising and falling signal Load: inverter Different input combinations are considered

Delay simulation Randomly generated test vectors 106~108 according to number of primary inputs (PI)


Proposed Design

Area complexity Ngate: Number of gates in the original circuit

Nff : Number of ffs in each pipeline, (NPI+NPO)/2

Ngate_BL: Number of gates in BL

Ngate_CP: Number of gates in comparison block

NLatch: Number of latches=Number of connections between FL and BL

w: Complexity ratio of flipflop to gate_ _gate BL gate CP LatchA

gate ff

N N NC

N w N


Fault Coverage vs. ComplexityFault Detection Coverage vs. Added Complexity : C499

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Fault detection Coverage CF

Add

ed C

ompl

exity

CA

Fault Detection Coverage vs. Added Complexity : C432

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6


Add

ed C

ompl

exity

CA


0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6


Add

ed C

ompl

exity

CA


0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6


Add

ed C

ompl

exity

CA


Complexity

Effective complexity penalty Depends on application

More than half of area is cacheSpeed critical part: integer unit

0.5

AE A A

Appicable areaC C C

Total chip area


Estimation of Complexity

& AGUDataCache

AlignMux

RegistersALUs

Intel® Pentium® 4 Processor on 90 nm

Process


Conclusion

Delay fault tolerant design is proposed Possible operation clock frequency gain is

estimated from modeling and experiments Delay fault detection coverage and complexity

are analyzed for optimal implementation It shows that 10% clock frequency gain is

possible with proposed design at a moderate (8-25%) complexity increase

Documents

ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design