Review: Designing Inverters for Performance Reduce C L l internal diffusion capacitance of the gate itself l interconnect capacitance l fanout Increase

Review: Designing Inverters for Performance Reduce CL

internal diffusion capacitance of the gate itself interconnect capacitance fanout

Increase W/L ratio of the transistor the most powerful and effective performance optimization

tool in the hands of the designer watch out for self-loading!

Increase VDD only minimal improvement in performance at the cost of

increased energy dissipation

Slope engineering - keeping signal rise and fall times smaller than or equal to the gate propagation delays and of approximately equal values good for performance good for power consumption

Switch Delay Model

A

Req

A

Rp

A

Rp

A

Rn CL

A

Cint

CintCL

A

Rn

A

Rp

B

Rp

B

Rn

NAND

INVERTER

B

Rp

A

Rp

A

Rn

B

Rn CL

NOR

Input Pattern Effects on Delay

Delay is dependent on the pattern of inputs

Low to high transition both inputs go low

- delay is 0.69 Rp/2 CL since two p-resistors are on in parallel

one input goes low

- delay is 0.69 Rp CL

High to low transition both inputs go high

- delay is 0.69 2Rn CL

Adding transistors in series (without sizing) slows down the circuit

CL

A

Rn

A

Rp

B

Rp

B

Rn Cint

Delay Dependence on Input Patterns

-0.5

0

0.5

1

1.5

2

2.5

3

0 100 200 300 400

A=B=10

A=1, B=10

A=1 0, B=1

time, psec

Vo

ltage

, V

Input Data

Pattern

Delay

(psec)

A=B=01 69

A=1, B=01 62

A= 01, B=1 50

A=B=10 35

A=1, B=10 76

A= 10, B=1 57

2-input NAND withNMOS = 0.5m/0.25 mPMOS = 0.75m/0.25 m CL = 10 fF

Transistor Sizing a Complex CMOS Gate

OUT = !(D + A • (B + C))

D

A

B C

D

A

B

C

1

2

2 2

2

2

4

4

6

6

12

12

Fan-In Considerations

DCBA

D

C

B

A CL

C3

C2

C1

Distributed RC model (Elmore delay)

tpHL = 0.69 Reqn(C1+2C2+3C3+4CL)

Propagation delay deteriorates rapidly as a function of fan-in – quadratically in the worst case.

tp as a Function of Fan-In

0

250

500

750

1000

1250

2 4 6 8 10 12 14 16

tpHL

tpLH

t p (

pse

c)

fan-in

quadratic function of fan-in

linear function of fan-in

Gates with a fan-in greater than 4 should be avoided.

tp

Fast Complex Gates: Design Technique 1 Transistor sizing

as long as fan-out capacitance dominates

Progressive sizing

InN CL

C3

C2

C1In1

In2

In3

M1

M2

M3

MN

Distributed RC line

M1 > M2 > M3 > … > MN

(the fet closest to the output should be the smallest)

Can reduce delay by more than 20%; decreasing gains as technology shrinks

Fast Complex Gates: Design Technique 2

Input re-ordering when not all inputs arrive at the same time

C2

C1In1

In2

In3

M1

M2

M3 CL

C2

C1In3

In2

In1

M1

M2

M3 CL

critical path critical path

charged1

01charged

charged1

delay determined by time to discharge CL, C1 and C2

delay determined by time to discharge CL

1

1

01 charged

discharged

discharged

Sizing and Ordering Effects

DCBA

D

C

B

A CL

C3

C2

C1

Progressive sizing in pull-down chain gives up to a 23% improvement.

Input ordering saves 5% critical path A – 23% critical path D – 17%

3 3 3 3

4

4

4

4

4

5

6

7

= 100 fF

Fast Networks: Design Technique 5 - Logical Effort The optimum fan-out for a chain of N inverters driving a

load CL isf = (CL/Cin)

so, if we can, keep the fan-out per stage around 4.

Can the same approach (logical effort) be used for any combinational circuit?

For a complex gate, we expand the inverter equation

tp = tp0 (1 + Cext/ Cg) = tp0 (1 + f/) to

tp = tp0 (p + g f/)

- tp0 is the intrinsic delay of an inverter

- f is the effective fan-out (Cext/Cg) – also called the electrical effort

- p is the ratio of the instrinsic (unloaded) delay of the complex gate and a simple inverter (a function of the gate topology and layout style)

- g is the logical effort

N

Intrinsic Delay Term, p

The more involved the structure of the complex gate, the higher the intrinsic delay compared to an inverter

Gate Type p

Inverter 1

n-input NAND n

n-input NOR n

n-way mux 2n

XOR, XNOR n 2n-1

Ignoring second order effects such as internal node capacitances

Logical Effort Term, g g represents the fact that, for a given load, complex gates

have to work harder than an inverter to produce a similar (speed) response

the logical effort of a gate tells how much worse it is at producing an output current than an inverter (how much more input capacitance a gate presents to deliver it same output current)

Gate Type g (for 1 to 4 input gates)

1 2 3 4

Inverter 1

NAND 4/3 5/3 (n+2)/3

NOR 5/3 7/3 (2n+1)/3

mux 2 2 2

XOR 4 12

Example of Logical Effort Assuming a pmos/nmos ratio of 2, the input capacitance

of a minimum-sized inverter is three times the gate capacitance of a minimum-sized nmos (Cunit)

A + B

A

B

A B

A

B

A • B

A B

AA

A 2

1

Cunit = 3

2 2

2

2

Cunit = 4

4

4

1 1

Cunit = 5

Delay as a Function of Fan-Out

The slope of the line is the logical effort of the gate

The y-axis intercept is the intrinsic delay

0

1

2

3

4

5

6

7

0 1 2 3 4 5

nor

ma

lize

d d

ela

y

fan-out f

NAND2: g=4/3, p

= 2

INV: g=1, p=1

intrinsic delay

effort delay Can adjust the delay by

adjusting the effective fan-out (by sizing) or by choosing a gate with a different logical effort

Gate effort: h = fg

Path Delay of Complex Logic Gate Network Total path delay through a combinational logic block

tp = tp,j = tp0 (pj + (fj gj)/ )

So, the minimum delay through the path determines that each stage should bear the same gate effort

f1g1 = f2g2 = . . . = fNgN

Consider optimizing the delay through the logic network

how do we determine a, b, and c sizes?

1a b c

CL5

Path Delay Equation Derivation The path logical effort, G = gi

And the path effective fan-out (path electrical effort) is F = CL/g1

The branching effort accounts for fan-out to other gates in the network

b = (Con-path + Coff-path)/Con-path

The path branching effort is then B = bi

And the total path effort is then H = GFB

So, the minimum delay through the path is

D = tp0 ( pj + (N H)/ )

N

Path Delay of Complex Logic Gates, con’t

For gate i in the chain, its size is determined by

si = (g1 s1)/gi (fj/bj)j=1

i-1

1a b c

CL5

For this network F = CL/Cg1 = 5 G = 1 x 5/3 x 5/3 x 1 = 25/9 B = 1 (no branching) H = GFB = 125/9, so the optimal stage effort is H = 1.93

- Fan-out factors are f1=1.93, f2=1.93 x 3/5 = 1.16, f3 = 1.16, f4 = 1.93

So the gate sizes are a = f1g1/g2 = 1.16, b = f1f2g1/g3 = 1.34 and c = f1f2f3g1/g4 = 2.60

4

Fast Complex Gates: Design Technique 6

Reducing the voltage swing

linear reduction in delay also reduces power consumption requires use of “sense amplifiers” on the receiving end to

restore the signal level (will look at their design when covering memory design)

tpHL = 0.69 (3/4 (CL VDD)/ IDSATn )

= 0.69 (3/4 (CL Vswing)/ IDSATn )

TG Logic Performance Effective resistance of the TG is modeled as a parallel

connection of Rp (= (VDD – Vout)/(-IDp)) and Rn (=VDD – Vout)/IDn)

0

5

10

15

20

25

30

0 1 2Vout, V

Res

ista

nce,

k

Rp

Rn

2.5V

0V

2.5V VoutRp

Rn

Req = Rn || Rp W/Ln=0.50/0.25

W/Lp=0.50/0.25

So, the assumption that the TG switch has a constant resistive value, Req, is acceptable

Delay of a TG Chain

C C C C

VN

V1 Vi Vi+1

5

0

5

0

5

0

5

0

C C C C

Req Req Req ReqVin

VN

V1 Vi Vi+1

Vin

Delay of the RC chain (N TG’s in series) is

tp(Vn) = 0.69 kCReq = 0.69 CReq (N(N+1))/2 0.35 CReqN2

k=1

N

TG Delay Optimization Can speed it up by inserting buffers every M switches

Delay of buffered chain (M TG’s between buffer)

tp = 0.69 N/M CReq (M(M+1))/2 + (N/M - 1) tpbuf

Mopt = 1.7 (tpbuf/CReq ) 3 or 4

C

VinVN

5

0

5

0

5

0

C C5

0

5

0

C 5

0

C

M

Why Power Matters

Packaging costs

Power supply rail design

Chip and system cooling costs

Noise immunity and system reliability

Battery life (in portable systems)

Environmental concerns Office equipment accounted for 5% of total US commercial

energy usage in 1993 Energy Star compliant systems

Chip Power Density Distribution

Power density is not uniformly distributed across the chip

Silicon is not a good heat conductor

Max junction temperature is determined by hot-spots Impact on packaging, w.r.t. cooling

Power Map On-Die Temperature

Power and Energy Figures of Merit

Power consumption in Watts determines battery life in hours

Peak power determines power ground wiring designs sets packaging limits impacts signal noise margin and reliability analysis

Energy efficiency in Joules rate at which power is consumed over time

Energy = power * delay Joules = Watts * seconds lower energy number means less power to perform a

computation at the same frequency

Power versus Energy

Watts

time

Power is height of curve

Watts

time

Approach 1

Approach 2

Approach 2

Approach 1

Energy is area under curve

Lower power design could simply be slower

Two approaches require the same energy

PDP and EDP Power-delay product (PDP) = Pav * tp = (CLVDD

2)/2 PDP is the average energy consumed per switching event

(Watts * sec = Joule) lower power design could simply be a slower design

allows one to understand tradeoffs better

0

5

10

15

0.5 1 1.5 2 2.5

Vdd (V)

Energ

y-Dela

y (no

rmali

zed)

energy-delay

energy

delay

Energy-delay product (EDP) = PDP * tp = Pav * tp2

EDP is the average energy consumed multiplied by the computation time required

takes into account that one can trade increased delay for lower energy/operation (e.g., via supply voltage scaling that increases delay, but decreases energy consumption)

Understanding Tradeoffs

Ene

rgy

1/Delay

a

b

c

d

Lower EDP

Which design is the “best” (fastest, coolest, both) ?

bett

er

better

CMOS Energy & Power Equations

E = CL VDD2 P01 + tsc VDD Ipeak P01 + VDD Ileakage

P = CL VDD2 f01 + tscVDD Ipeak f01 + VDD Ileakage

Dynamic power

Short-circuit power

Leakage power

f01 = P01 * fclock

Dynamic Power Consumption

Energy/transition = CL * VDD2 * P01

Pdyn = Energy/transition * f = CL * VDD2 * P01 * f

Pdyn = CEFF * VDD2 * f where CEFF = P01 CL

Not a function of transistor sizes!Data dependent - a function of switching activity!

Vin Vout

CL

Vdd

f01

Lowering Dynamic Power

Pdyn = CL VDD2 P01 f

Capacitance:Function of fan-out, wire length, transistor sizes

Supply Voltage:Has been dropping with successive generations

Clock frequency:Increasing…

Activity factor:How often, on average, do wires switch?

Short Circuit Power Consumption

Finite slope of the input signal causes a direct current path between VDD and GND for a short period of time during switching when both the NMOS and PMOS transistors are conducting.

Vin Vout

CL

Isc

Short Circuit Currents Determinates

Duration and slope of the input signal, tsc

Ipeak determined by the saturation current of the P and N transistors which

depend on their sizes, process technology, temperature, etc. strong function of the ratio between input and output slopes

- a function of CL

Esc = tsc VDD Ipeak P01

Psc = tsc VDD Ipeak f01

Impact of CL on Psc

Vin Vout

CL

Isc 0

Vin Vout

CL

Isc Imax

Large capacitive load

Output fall time significantly larger than input rise time.

Small capacitive load

Output fall time substantially smaller than the input rise

time.

Ipeak as a Function of CL

-0.5

0

0.5

1

1.5

2

2.5

0 2 4 6

I pea

k (A

)

time (sec)

x 10-10

x 10-4

CL = 20 fF

CL = 100 fF

CL = 500 fF

500 psec input slope

Short circuit dissipation is minimized by matching the rise/fall times of the input and output signals - slope engineering.

When load capacitance is small, Ipeak is large.

Psc as a Function of Rise/Fall Times

0

1

2

3

4

5

6

7

8

0 2 4

P n

orm

aliz

ed

tsin/tsout

VDD= 3.3 V

VDD = 2.5 V

VDD = 1.5V

normalized wrt zero input rise-time dissipation

When load capacitance is small (tsin/tsout > 2 for VDD > 2V) the power is dominated by Psc

If VDD < VTn + |VTp| then Psc is eliminated since both devices are never on at the same time.

W/Lp = 1.125 m/0.25 mW/Ln = 0.375 m/0.25 mCL = 30 fF

Leakage (Static) Power Consumption

Sub-threshold current is the dominant factor.

All increase exponentially with temperature!

VDD Ileakage

Vout

Drain junction leakage

Sub-threshold currentGate leakage

Leakage as a Function of VT

0 0.2 0.4 0.6 0.8 1

VGS (V)

ID (A

)

VT=0.4VVT=0.1V

10-2

10-12

10-7

Continued scaling of supply voltage and the subsequent scaling of threshold voltage will make subthreshold conduction a dominate component of power dissipation.

An 90mV/decade VT roll-off - so each 255mV increase in VT gives 3 orders of magnitude reduction in leakage (but adversely affects performance)

TSMC Processes Leakage and VT

80

0.25 V

13,000

920/400

0.08 m

24 Å

1.2 V

CL013 HS

52

0.29 V

1,800

860/370

0.11 m

29 Å

1.5 V

CL015 HS

42 Å42 Å42 Å42 ÅTox (effective)

43142230FET Perf. (GHz)

0.40 V0.73 V0.63 V0.42 VVTn

3000.151.6020Ioff (leakage) (A/m)

780/360320/130500/180600/260IDSat (n/p) (A/m)

0.13 m 0.18 m 0.16 m 0.16 m Lgate

2 V1.8 V1.8 V1.8 VVdd

CL018 HS

CL018 ULP

CL018 LP

CL018 G

From MPR, 2000

Exponential Increase in Leakage Currents

1

10

100

1000

10000

30 40 50 60 70 80 90 100 110

0.25

0.18

0.13

0.1

Temp(C)

I leak

age(n

A/

m)

From De,1999

Review: Energy & Power Equations

E = CL VDD2 P01 + tsc VDD Ipeak P01 + VDD Ileakage

P = CL VDD2 f01 + tscVDD Ipeak f01 + VDD Ileakage

Dynamic power(~90% today and

decreasing relatively)

Short-circuit power

(~8% today and decreasing absolutely)

Leakage power(~2% today and

increasing)

f01 = P01 * fclock

Power and Energy Design Space

Constant Throughput/Latency

Variable Throughput/Latency

Energy Design Time Non-active Modules Run Time

Active

Logic Design

Reduced Vdd

Sizing

Multi-Vdd

Clock Gating

DFS, DVS

(Dynamic Freq, Voltage

Scaling)

Leakage + Multi-VT

Sleep Transistors

Multi-Vdd

Variable VT

+ Variable VT

Dynamic Power as a Function of Device Size

Device sizing affects dynamic energy consumption gain is largest for networks with large overall effective fan-outs (F

= CL/Cg,1) The optimal gate sizing factor

(f) for dynamic energy is smaller than the one for performance, especially for large F’s

e.g., for F=20, fopt(energy) = 3.53 while fopt(performance) = 4.47

If energy is a concern avoid oversizing beyond the optimal 1 2 3 4 5 6 7

0

0.5

1

1.5

f

norm

aliz

ed e

nerg

y

F=1

F=2

F=5

F=10

F=20

From Nikolic, UCB

Dynamic Power Consumption is Data Dependent

A B Out

0 0 1

0 1 0

1 0 0

1 1 0

2-input NOR Gate

With input signal probabilities PA=1 = 1/2 PB=1 = 1/2

Static transition probability P01 = Pout=0 x Pout=1

= P0 x (1-P0)

Switching activity, P01, has two components A static component – function of the logic topology A dynamic component – function of the timing behavior (glitching)

NOR static transition probability = 3/4 x 1/4 = 3/16

NOR Gate Transition Probabilities

CL

A

B

BA

P01 = P0 x P1 = (1-(1-PA)(1-PB)) (1-PA)(1-PB)

PA

PB

0

1 0 1

Switching activity is a strong function of the input signal statistics PA and PB are the probabilities that inputs A and B are one

Transition Probabilities for Some Basic Gates

P01 = Pout=0 x Pout=1

NOR (1 - (1 - PA)(1 - PB)) x (1 - PA)(1 - PB)

OR (1 - PA)(1 - PB) x (1 - (1 - PA)(1 - PB))

NAND PAPB x (1 - PAPB)

AND (1 - PAPB) x PAPB

XOR (1 - (PA + PB- 2PAPB)) x (PA + PB- 2PAPB)

B

AZ

X0.5

0.5

For Z: P01 = P0 x P1 = (1-PXPB) PXPB

For X: P01 = P0 x P1 = (1-PA) PA

= 0.5 x 0.5 = 0.25

= (1 – (0.5 x 0.5)) x (0.5 x 0.5) = 3/16

Inter-signal Correlations

B

A

Z

X

P(Z=1) = P(B=1) & P(A=1 | B=1)

0.5

0.5

(1-0.5)(1-0.5)x(1-(1-0.5)(1-0.5)) = 3/16

(1- 3/16 x 0.5) x (3/16 x 0.5) = 0.085Reconvergent

Determining switching activity is complicated by the fact that signals exhibit correlation in space and time reconvergent fan-out

Have to use conditional probabilities

Logic Restructuring

Chain implementation has a lower overall switching activity than the tree implementation for random inputs

Ignores glitching effects

Logic restructuring: changing the topology of a logic network to reduce transitions

A

BC

D F

AB

CD Z

FW

X

Y0.5

0.5

(1-0.25)*0.25 = 3/16

0.50.5

0.5

0.5

0.5

0.5

7/64

15/256

3/16

3/16

15/256

AND: P01 = P0 x P1 = (1 - PAPB) x PAPB

Input Ordering

Beneficial to postpone the introduction of signals with a high transition rate (signals with signal probability close to 0.5)

A

BC

X

F

0.5

0.20.1

B

CA

X

F

0.2

0.10.5

(1-0.5x0.2)x(0.5x0.2)=0.09 (1-0.2x0.1)x(0.2x0.1)=0.0196

Glitching in Static CMOS Networks

ABC

X

Z

101 000

Unit Delay

AB

X

ZC

Gates have a nonzero propagation delay resulting in spurious transitions or glitches (dynamic hazards) glitch: node exhibits multiple transitions in a single cycle before

settling to the correct logic value

Glitching in an RCA

S0S1S2S14S15

Cin

0

1

2

3

0 2 4 6 8 10 12

Time (ps)

S O

utp

ut

Vo

ltag

e (

V)

Cin

S0

S1

S2

S3

S4

S5S10

S15

Balanced Delay Paths to Reduce Glitching

So equalize the lengths of timing paths through logic

F1

F2

F3

0

0

0

0

1

2

F1

F2

F3

0

0

0

0

1

1

Glitching is due to a mismatch in the path lengths in the logic network; if all input signals of a gate change simultaneously, no glitching occurs





Active

Logic Design

Reduced Vdd

Sizing

Multi-Vdd

Clock Gating

DFS, DVS


Scaling)

Leakage + Multi-VT

Sleep Transistors

Multi-Vdd

Variable VT

+ Variable VT

Dynamic Power as a Function of VDD

Decreasing the VDD

decreases dynamic energy consumption (quadratically)

But, increases gate delay (decreases performance)

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

VDD (V) t p

( no

r ma

l ize

d)

Determine the critical path(s) at design time and use high VDD for the transistors on those paths for speed. Use a lower VDD on the other gates, especially those that drive large capacitances (as this yields the largest energy benefits).

Multiple VDD Considerations How many VDD? – Two is becoming common

Many chips already have two supplies (one for core and one for I/O)

When combining multiple supplies, level converters are required whenever a module at the lower supply drives a gate at the higher supply (step-up)

If a gate supplied with VDDL drives a gate at VDDH, the PMOS never turns off

- The cross-coupled PMOS transistors do the level conversion

- The NMOS transistor operate on a reduced supply

Level converters are not needed for a step-down change in voltage

Overhead of level converters can be mitigated by doing conversions at register boundaries and embedding the level conversion inside the flipflop (see Figure 11.47)

VDDH

Vin

VoutVDDL

Dual-Supply Inside a Logic Block Minimum energy consumption is achieved if all logic

paths are critical (have the same delay)

Clustered voltage-scaling Each path starts with VDDH and switches to VDDL (gray logic

gates) when delay slack is available Level conversion is done in the flipflops at the end of the paths





Active

Logic Design

Reduced Vdd

Sizing

Multi-Vdd

Clock Gating

DFS, DVS


Scaling)

Leakage + Multi-VT

Sleep Transistors

Multi-Vdd

Variable VT

+ Variable VT

Stack Effect Leakage is a function of the circuit topology and the value

of the inputs

VT = VT0 + (|-2F + VSB| - |-2F|)

where VT0 is the threshold voltage at VSB = 0; VSB is the source- bulk (substrate) voltage; is the body-effect coefficient

A B

B

A

Out

VX

A B VX ISUB

0 0 VT ln(1+n) VGS=VBS= -VX

0 1 0 VGS=VBS=0

1 0 VDD-VT VGS=VBS=0

1 1 0 VSG=VSB=0

Leakage is least when A = B = 0

Leakage reduction due to stacked transistors is called the stack effect

Short Channel Factors and Stack Effect In short-channel devices, the subthreshold leakage

current depends on VGS,VBS and VDS. The VT of a short-channel device decreases with increasing VDS due to DIBL (drain-induced barrier loading).

Typical values for DIBL are 20 to 150mV change in VT per voltage change in VDS so the stack effect is even more significant for short-channel devices.

VX reduces the drain-source voltage of the top nfet, increasing its VT and lowering its leakage

For our 0.25 micron technology, VX settles to ~100mV in steady state so VBS = -100mV and VDS = VDD -100mV which is 20 times smaller than the leakage of a device with VBS = 0mV and VDS = VDD

Leakage as a Function of Design Time VT

Reducing the VT increases the sub-threshold leakage current (exponentially)

90mV reduction in VT increases leakage by an order of magnitude

But, reducing VT decreases gate delay (increases performance)

0 0.2 0.4 0.6 0.8 1

VGS (V)ID

(A)

VT=0.4VVT=0.1V

Determine the critical path(s) at design time and use low VT devices on the transistors on those paths for speed. Use a high VT on the other logic for leakage control.

A careful assignment of VT’s can reduce the leakage by as much as 80%

Dual-Thresholds Inside a Logic Block

Minimum energy consumption is achieved if all logic paths are critical (have the same delay)

Use lower threshold on timing-critical paths Assignment can be done on a per gate or transistor basis; no

clustering of the logic is needed No level converters are needed

Variable VT (ABB) at Run Time VT = VT0 + (|-2F + VSB| - |-2F|)

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

-2.5 -2 -1.5 -1 -0.5 0

VSB (V)

VT (

V)

A negative bias on VSB causes VT to increase

Adjusting the substrate bias at run time is called adaptive body-biasing (ABB)

Requires a dual well fab process

For an n-channel device, the substrate is normally tied to ground (VSB = 0)

Documents

Review: Designing Inverters for Performance Reduce C L l internal diffusion capacitance of the gate itself l interconnect capacitance l fanout Increase